Tokenization is a crucial concept around LLMs, and it can be more complex than one may think!
For our tokenization implementation, please refer to mistral-common.
In this deep dive, we will dig into 3 versions of our tokenizer:
- V1: The tokenizer behind our very first models.
- V2: Introducing control tokens and function calling!
- V3: Better function calling implementation.
- V3-Tekken: Different version based on
tiktoken
, opposed to the other versions based onsentencepiece
.
- V3-Tekken: Different version based on
Overview
Section | Description |
---|---|
Basics | Basics of tokenization. |
Boundaries & Token Healing | Main problems with tokenization and token healing. |
Control Tokens | Introduction to Control Tokens and their advantages. |
Templates | A summarized list of our tokenizers with their chat templates. |
Tokenizer | Make your own tokenizer with sentencepiece. |
Tool Calling | Learn about how tokenization for our tool calling works. |
Chat Templates | Legacy documentation around our chat templates. |