Skip to main content

Sampling: Overview on our sampling settings

Here, we will discuss the sampling settings that influence the output of Language Learning Models (LLMs). This guide covers parameters such as Temperature, N, Top P, Presence Penalty, and Frequency Penalty, and explains how to adjust them. Whether you aim to generate creative content or ensure accurate responses, understanding these settings is key.

Let's explore each parameter and learn how to fine-tune LLM outputs effectively.

N Completions

N Completions

N represents the number of completions to return for each request. This parameter is useful when you want to generate multiple responses for a single input. Each completion will be a unique response generated by the model, providing a variety of outputs to choose from.

Key Points

  • Multiple Responses: By setting N to a value greater than 1, you can get multiple responses for the same input.
  • Cost Efficiency: Input tokens are only billed once, regardless of the number of completions requested. This makes it cost-effective to explore different possibilities.

Example

Here's an example of how to use the N parameter in the API:

import os
from mistralai import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
model = "ministral-3b-latest"

client = Mistral(api_key=api_key)

chat_response = client.chat.complete(
model=model,
messages=[
{
"role": "user",
"content": "What is the best mythical creature? Answer with a single word.",
},
],
temperature = 1, # Increasing randomness and diversity of the output, this is required to be higher than 0 to have diverse outputs
n = 10 # Number of completions
)

for i, choice in enumerate(chat_response.choices):
print(choice.message.content)

Output

Phoenix.
Dragon
Dragon
Unicorn
Unicorn
Phoenix
Unicorn
Dragon
Dragon.
Unicorn

In this example, the model generates 10 responses for the same input prompt. This allows you to see a variety of possible answers and choose the one that best fits your needs.

Temperature

Temperature

Temperature in Language Learning Models (LLMs) controls output diversity. Lower values make the model more deterministic, focusing on likely responses for accuracy. Higher values increase creativity and diversity. During text generation, LLMs predict tokens with associated probabilities using a softmax function. Temperature scales these probabilities: higher temperatures flatten the distribution, making outputs more varied, while lower temperatures amplify differences, favoring more likely tokens.

Visualization

To better understand the underlying principle and impact it has on the probability distribution, here is a visualisation of the Temperature with a simple prompt: "What is the best mythical creature? Answer with a single word."

Example ImageBarplot example comparing the distribution with different Temperature values and the top 5 tokens using Mistral 7B at 4 bits precision.

Temperature significantly affects the probability distribution in LLMs. At a Temperature of 0, the model always outputs the most likely token, e.g., "Dragon". Increasing the Temperature to 0.2 introduces variability, allowing for tokens like "Un" (as in "Unicorn"). Further increases reveal more diverse tokens: the third token might still be "Drag" (for "Dragon"), but the fourth could start "Pegasus", and the fifth, "Phoenix". Higher Temperatures make less likely tokens more probable, enhancing the diversity of the model's output.

API

You can set a temperature value easily via our clients, let's experiment with our API.

import os
from mistralai import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
model = "ministral-3b-latest"

client = Mistral(api_key=api_key)

chat_response = client.chat.complete(
model = model,
messages = [
{
"role": "user",
"content": "What is the best mythical creature? Answer with a single word.",
},
],
temperature = 0.1,
n = 10
)

for i, choice in enumerate(chat_response.choices):
print(choice.message.content)
Dragon
Dragon
Dragon
Dragon
Dragon
Dragon
Dragon
Dragon
Dragon
Dragon

The model answered mostly with Dragon! Lets try with a higher temperature to try to have more diverse outputs, let's set it to temperature = 1.

Unicorn
Dragon
Phoenix
Unicorn
Dragon
Phoenix.
Dragon.
Phoenix
Dragon
Unicorn.

The outputs ended much more diverse, the model answering with a different creature more frequently, we have "Dragon", "Unicorn" and "Phoenix".

The Best Temperature

There's no one-size-fits-all Temperature for all use cases, but some guidelines can help you find the best for your applications.

Determinism

  • Requirements: Tasks needing consistent, accurate responses, such as Mathematics, Classification, Healthcare, or Reasoning.
  • Temperature: Use very low values, sometimes not null to add slight uniqueness.

For example, a classification agent should use a Temperature of 0 to always pick the best token. A math chat assistant might use very low Temperature values to avoid repetition while maintaining accuracy.

Creativity

  • Requirements: Tasks needing diverse, unique text, like brainstorming, writing novels, creating slogans, or roleplaying.
  • Temperature: Use high values, but avoid excessively high Temperatures to prevent randomness and nonsense outputs.

Consider the trade-off: higher Temperatures increase creativity but may decrease quality and accuracy.

Top P

Top P

Top P is a setting that limits the tokens considered by a language model based on a probability threshold. It helps focus on the most likely tokens, improving output quality.

Visualization

For these examples, we set the Temperature first, then apply a Top P of 50%. Note that a Temperature of 0 is deterministic, making Top P irrelevant in that case.

The process is as follows:

  1. Apply the Temperature.
  2. Use Top P (0.5) to keep only the most likely tokens.
  3. Adjust the probabilities of the remaining tokens.

We will visualize the token probability distribution across different temperature values for the question:

  • "What is the best mythical creature? Answer with a single word."
Example Image
Different Temperature values and the top 5 tokens using Mistral 7B at 4 bits precision.
Example Image
Top P considers only the top tokens until reaching 50% probability.
Example Image
Other tokens' probabilities are set to 0, and the remaining tokens' probabilities are adjusted.

Top P ensures that only high-quality tokens are considered, maintaining output quality by excluding unlikely tokens. It's challenging to balance Temperature and Top P, so it's recommended to fix one and adjust the other. However you should experiment to find the best settings for your use case!

To Summarize

  1. Role of Top P: Top P limits the tokens considered based on a probability threshold, focusing on the most likely tokens to improve output quality.
  2. Interaction with Temperature: Top P is applied after Temperature.
  3. Impact on Outputs: Top P avoids considering very unlikely tokens, maintaining output quality and coherence.
  4. Balancing Temperature and Top P: It's challenging to balance both. Start by fixing one parameter and adjust the other, experiment to find optimal settings.

Example

Here's an example of how to use the Top P parameter with our python client:

import os
from mistralai import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
model = "ministral-3b-latest"

client = Mistral(api_key=api_key)

chat_response = client.chat.complete(
model=model,
messages=[
{
"role": "user",
"content": "What is the best mythical creature? Answer with a single word.",
},
],
temperature=1,
top_p=0.5,
n=10
)

for i, choice in enumerate(chat_response.choices):
print(choice.message.content)

Output

Unicorn
Unicorn
Unicorn
Unicorn
Dragon
Unicorn
Dragon
Dragon
Dragon
Dragon

Output Table

Temperature 0.1Temperature 1Temperature 1 & Top P 50%
DragonUnicornUnicorn
DragonDragonUnicorn
DragonPhoenixUnicorn
DragonUnicornUnicorn
DragonDragonDragon
DragonPhoenix.Unicorn
DragonDragon.Dragon
DragonPhoenixDragon
DragonDragonDragon
DragonUnicorn.Dragon

In this example, the model generates a response considering only the top tokens that cumulatively reach a 50% probability threshold. This ensures that the output keeps some uniform diversity while still taking only the best tokens, in this case only 2 tokens reach the 50% threshold.

Penalties

Presence/Frequency Penalty

Presence Penalty

Presence Penalty determines how much the model penalizes the repetition of words or phrases. It encourages the model to use a wider variety of words and phrases, making the output more diverse and creative.

  • Range: [-2, 2]
  • Default: 0

A higher presence penalty encourages the model to avoid repeating words or phrases that have already appeared in the output, ensuring a more varied and creative text.

The presence penalty specifically is a one-time adjustment applied to all tokens that have been used at least once. It reduces the likelihood of repeating any token that has already appeared. This encourages the model to use a diverse range of tokens, promoting creativity and variety in the output.

Frequency Penalty

Frequency Penalty is a parameter that penalizes the repetition of words based on their frequency in the generated text. It helps to promote diversity and reduce repetition in the output.

  • Range: [-2, 2]
  • Default: 0

A higher frequency penalty discourages the model from repeating words that have already appeared frequently in the output. This ensures that the generated text is more varied and less repetitive.

The frequency penalty specifically is a value that increases with the frequency of a token's appearance in the generated text, an accumulative penalty, the more the token is sampled the higher the penalty. It reduces the likelihood of repeating any token that has already appeared frequently. This ensures that the generated text is more varied and less repetitive.

Differences Between Presence Penalty and Frequency Penalty

  • Presence Penalty: This is a one-off additive contribution that applies to all tokens that have been sampled at least once. It encourages the model to include a diverse range of tokens in the generated text.
  • Frequency Penalty: This is a contribution that is proportional to how often a particular token has already been sampled. It discourages the model from repeating the same words or phrases too frequently within the generated text.

Both parameters can be tweaked to shape the quality and diversity of the generated text. The best values for these parameters can differ based on the specific task and the desired outcome.

Example Without Presence Penalty

Here's an example of how the output looks without the Presence Penalty parameter:

import os
from mistralai import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
model = "ministral-3b-latest"

client = Mistral(api_key=api_key)

chat_response = client.chat.complete(
model=model,
messages=[
{"role": "user",
"content": "List 10 possible titles for a fantasy book. Give a list only."}
],
temperature=0
)

print(chat_response.choices[0].message.content)

Output Without Presence Penalty

1. "The Shattered Crown"
2. "Whispers of the Old Magic"
3. "Echoes of the Forgotten Realm"
4. "The Chronicles of the Silver Moon"
5. "The Enchanted Forest's Secret"
6. "The Last Dragon's Legacy"
7. "The Shadowed Path"
8. "The Song of the Siren's Call"
9. "The Lost City of the Stars"
10. "The Whispering Winds of Destiny"

Penalties are a sensible parameter that can have a significant impact on long context and long output queries. They can also help avoid highly repetitive loops that the model may otherwise fall into, making them a valuable parameter.