Tokenizer - Mistral AI Cookbook

Introduction

A tokenizer has the role of converting a request, often a string, into a sequence of token IDs that the model will receive as input. A token can be anything from a single character to a subword or a symbol. The model possesses a vocabulary of tokens with their corresponding IDs. The tokenizer's job is to break apart the string and convert it into a sequence of token IDs, a step called "encoding." It is also the tokenizer's job to convert a sequence of IDs back into a string, a step called "decoding." Once provided with a sequence of token IDs, the model tries to predict the next token.

For simplicity, let's describe tokens as full words.

Imagine a scenario where we have the following sentence: "Hello, how are"

Encoding

The tokenizer will first break apart the sentence into different pieces, or tokens. For example:
- Hello,
- how
- are
The tokenizer will then get the token ID corresponding to each token. For example:
- Hello, -> 432
- how -> 523
- are -> 87
This sequence of token IDs is then provided to the model.

Decoding

The model will try to predict the next token. Let's say it predicts:
- 75
We will then have a full sequence of token IDs:
- 432 + 523 + 87 + 75
This sequence can be decoded back into a full string:
- Hello, + how + are + you -> Hello, how are you

Types of Tokenizers

There are multiple ways to make tokenizers with various approaches...

ASCII/Character-Based Tokenizer

A simple tokenizer is the ASCII/Character-based tokenizer, where each character and ASCII symbol is treated as a token. This approach is straightforward but has several drawbacks:

Slow Prediction: Predicting symbol after symbol is inefficient.
Statistical Errors: Prone to errors due to the granularity of the tokens.
Information/Sequence Length Ratio: Lower ratio of information per sequence length.

Word-Based Tokenizer

Another type is the word-based tokenizer, where each word is considered a token. While this may seem intuitive, it is not optimized. For example, consider the following words:

"fast"
"fastest"
"faster"
"slow"
"slowest"
"slower"

A word-based tokenizer would translate these into 6 tokens. However, a more efficient approach would be to use subword tokens:

"fast"
"est"
"er"
"slow"

With only 4 tokens, we can reproduce the original 6 words by combining them.

BPE and More

Byte Pair Encoding (BPE) is a simple form of data compression technique used in tokenization. It works by iteratively merging the most frequent bigram (pair of consecutive tokens) in the text. This process continues until the desired vocabulary size is reached. BPE is particularly useful for handling out-of-vocabulary words by breaking them down into subword units!

Example

Let's consider a simple example to understand BPE better. Suppose we have the following text:

low
lower
lowest
new
news
newer

low
lower
lowest
new
news
newer

We'll start with each character as a separate token:

l o w
l o w e r
l o w e s t
n e w
n e w s
n e w e r

l o w
l o w e r
l o w e s t
n e w
n e w s
n e w e r

The objective will be to train a tokenizer with, for this example, a vocabulary size of 5, meaning 5 unique tokens.

Step 1: Initial Tokens

l, o, w, e, r, s, t, n

l, o, w, e, r, s, t, n

-> 8 tokens

Step 2: Merge the Most Frequent Bigram

The most frequent bigram is l and o. We merge them to form a new token lo.

lo w
lo w e r
lo w e s t
n e w
n e w s
n e w e r

lo w
lo w e r
lo w e s t
n e w
n e w s
n e w e r

Step 3: New tokens

lo, w, e, r, s, t, n

lo, w, e, r, s, t, n

-> 7 tokens

Step 4: Merge again Most Frequent Bigram

The next most frequent bigram is lo and w. We merge them to form a new token low.

low
low e r
low e s t
n e w
n e w s
n e w e r

low
low e r
low e s t
n e w
n e w s
n e w e r

Step 5: New tokens

low, e, r, s, t, n, w

low, e, r, s, t, n, w

-> 7 tokens

Repeat this process...

Repeat the procedure until reaching the vocab size desired, in the end will look like this:

low
low er
low est
new
new s
new er

low
low er
low est
new
new s
new er

Final Tokens

low, new, er, est, s

low, new, er, est, s

-> 5 tokens

The process of building a tokenizer that is efficient and compresses the knowledge well for your model is a crucial step towards efficient language models, with a lot of trade-offs to consider.

Trade-offs in Tokenization

The trade-offs between vocabulary size, length, and compression ratio are critical in the world of LLMs for several reasons:

Model Efficiency: Smaller vocabulary sizes can lead to faster training and inference times.
Memory Usage: Larger vocabularies require more memory.
Generalization: Subword tokenization can help the model generalize better by capturing common subword patterns.

Make your own Tokenizer

Let's make a tokenizer using the sentencepiece library and BPE.

Step 1: Obtain Training Data

First, we need a large text dataset to train our tokenizer. For this example, we'll use a subset of Wikipedia.

from datasets import load_dataset

dataset = load_dataset("wikimedia/wikipedia", "20231101.en", split="train", streaming=True)

with open("wikipedia_subset.txt", "w") as f:
    for i, example in enumerate(dataset):
        f.write(example["text"] + "\n")
        if i >= 9999:
            break

from datasets import load_dataset

dataset = load_dataset("wikimedia/wikipedia", "20231101.en", split="train", streaming=True)

with open("wikipedia_subset.txt", "w") as f:
    for i, example in enumerate(dataset):
        f.write(example["text"] + "\n")
        if i >= 9999:
            break

Step 2: Train the Tokenizer

We can train a tokenizer quickly using the sentencepiece library.

import sentencepiece as spm

data_path = "wikipedia_subset.txt"
model_file = "wikipedia_tokenizer"
vocab_size = 512

def get_longest_sentence_length(file_path):
    max_length = 0
    with open(file_path, 'r') as file:
        for line in file:
            sentence_length = len(line.split())
            if sentence_length > max_length:
                max_length = sentence_length
    return max_length

longest_sentence_from_dataset = get_longest_sentence_length(data_path)

spm.SentencePieceTrainer.train(
    f'--input={data_path} --model_prefix={model_file} --vocab_size={vocab_size} '
    f'--model_type=bpe --max_sentence_length={longest_sentence_from_dataset}'
)

tokenizer = spm.SentencePieceProcessor(model_file + ".model")

import sentencepiece as spm

data_path = "wikipedia_subset.txt"
model_file = "wikipedia_tokenizer"
vocab_size = 512

def get_longest_sentence_length(file_path):
    max_length = 0
    with open(file_path, 'r') as file:
        for line in file:
            sentence_length = len(line.split())
            if sentence_length > max_length:
                max_length = sentence_length
    return max_length

longest_sentence_from_dataset = get_longest_sentence_length(data_path)

spm.SentencePieceTrainer.train(
    f'--input={data_path} --model_prefix={model_file} --vocab_size={vocab_size} '
    f'--model_type=bpe --max_sentence_length={longest_sentence_from_dataset}'
)

tokenizer = spm.SentencePieceProcessor(model_file + ".model")

Step 3: Encode Example

Let's encode a simple example to see how the tokenizer works.

text = "We love Mistral!"
token_ids = tokenizer.encode(text)
token_str = tokenizer.encode(text, out_type=str)
print(f"Token Strings: {token_str}")
print(f"Token IDs: {token_ids}")
print(f"Decoded Text: {tokenizer.decode(token_ids)}")

text = "We love Mistral!"
token_ids = tokenizer.encode(text)
token_str = tokenizer.encode(text, out_type=str)
print(f"Token Strings: {token_str}")
print(f"Token IDs: {token_ids}")
print(f"Decoded Text: {tokenizer.decode(token_ids)}")

Token Strings: ['▁W', 'e', '▁l', 'ov', 'e', '▁M', 'ist', 'r', 'al', '!']
Token IDs: [111, 392, 58, 206, 392, 59, 92, 398, 19, 481]
Decoded Text: We love Mistral!

Token Strings: ['▁W', 'e', '▁l', 'ov', 'e', '▁M', 'ist', 'r', 'al', '!']
Token IDs: [111, 392, 58, 206, 392, 59, 92, 398, 19, 481]
Decoded Text: We love Mistral!

Note: sentencepiece by default will prefix a dummy whitespace while encoding, so it's important to be aware of this.

We can quickly see the impact of the vocabulary size by training a new tokenizer with a larger vocabulary size, lets say vocab_size = 8192.

Token Strings: ['▁We', '▁love', '▁M', 'ist', 'ral', '!']
Token IDs: [1820, 4859, 59, 92, 448, 8161]
Decoded Text: We love Mistral!

Token Strings: ['▁We', '▁love', '▁M', 'ist', 'ral', '!']
Token IDs: [1820, 4859, 59, 92, 448, 8161]
Decoded Text: We love Mistral!

That's not all

There is more to building a tokenizer, especially when considering control tokens. Building a proper tokenizer is an important step for good performance, efficiency, and compression ratio of your models!