Concept Deep Dive: Tokenization

Tokenization

Tokenization is a crucial concept around LLMs, and it can be more complex than one may think!

For our tokenization implementation, please refer to mistral-common.

In this deep dive, we will dig into 3 versions of our tokenizer:

  • V1: The tokenizer behind our very first models.
  • V2: Introducing control tokens and function calling!
  • V3: Better function calling implementation.
    • V3-Tekken: Different version based on tiktoken, opposed to the other versions based on sentencepiece.

Overview

SectionDescription
BasicsBasics of tokenization.
Boundaries & Token HealingMain problems with tokenization and token healing.
Control TokensIntroduction to Control Tokens and their advantages.
TemplatesA summarized list of our tokenizers with their chat templates.
TokenizerMake your own tokenizer with sentencepiece.
Tool CallingLearn about how tokenization for our tool calling works.
Chat TemplatesLegacy documentation around our chat templates.