[1' read]

View on GitHub ↗

Concept Deep Dive: Tokenization

Tokenization

Tokenization is a crucial concept around LLMs, and it can be more complex than one may think!

For our tokenization implementation, please refer to mistral-common.

In this deep dive, we will dig into 3 versions of our tokenizer:

V1: The tokenizer behind our very first models.
V2: Introducing control tokens and function calling!
V3: Better function calling implementation.
- V3-Tekken: Different version based on tiktoken, opposed to the other versions based on sentencepiece.

Overview

Section	Description
Basics	Basics of tokenization.
Boundaries & Token Healing	Main problems with tokenization and token healing.
Control Tokens	Introduction to Control Tokens and their advantages.
Templates	A summarized list of our tokenizers with their chat templates.
Tokenizer	Make your own tokenizer with sentencepiece.
Tool Calling	Learn about how tokenization for our tool calling works.

Chat Templates	Legacy documentation around our chat templates.