Judges

Judges are LLM-based evaluators that score or classify model responses.

Define your quality criteria once, then apply them consistently at scale across thousands of events.

note

Judges aren't used on their own. They power Campaigns, which batch-annotate production traffic.

A Judge has three main components: its type, model, and instructions.

Two types of Judges

There are two types of Judges: Classification and Regression.

Classification Judges

Classification Judges assign a discrete label to each response. Use them when you need a clear category, not a score.

Good for:

Binary decisions: helpful / not helpful, safe / unsafe
Multi-class tagging: excellent / acceptable / poor
Routing labels: code / search / guide

Example:

Evaluate the assistant's final response. Is it accurate, relevant, and complete? Classify as excellent, acceptable, or poor.

Then define labels as follows:

excellent → "Fully addresses the question with accurate, clear information"
acceptable → "Addresses the question but with minor issues"
poor → "Fails to address the question or contains significant errors"

Choosing a model

Pick a model of your choice based on the trade-off between evaluation quality and cost:

stronger models produce more nuanced judgments but cost more per event.
faster models work well for straightforward, well-defined criteria.

See the full list of available models and their pricing on the Models page.

Instructions guidelines

Craft your instructions with care:

Be specific. Avoid vague prompts and prefer detailed instructions
Never assume the Judge understands your context. Spell out what "good" means for your use case.
Ensure testability. If you can’t apply the criteria consistently, neither can the Judge.
Use boundary examples: "A score of 3 means the response is partially helpful (= answers the question but omits a key detail)."

note

The instructions field is prefilled with a template structure. Just write under the # Instructions block. Conversation history, user message, assistant response, and available tools are auto-injected into the Judge’s context.

Advanced: using Jinja2 template variables

Advanced users can tweak the template and reference event data directly using Jinja2 syntax ({{ }}).

Here is a list of available variables:

Variable	What it contains
`{{ conversation_history }}`	The full conversation history (messages in order).
`{{ user_message }}`	The user's last message.
`{{ assistant_message }}`	The assistant's last response.
`{{ system_prompt }}`	The system prompt used during the request.
`{{ available_tools }}`	Tools available to the model during the request.
`{{ answer_type_definition }}`	The output schema (auto-generated from your output type config).
`{{ properties.* }}`	Custom properties from the dataset record.

Properties let you attach structured context that Judges can reference.

For instance, if your dataset records have a properties.expected_output field, you can reference it to compare against a known-good answer:

Compare the assistant's response to the expected output.
Expected output: {{ properties.expected_output }}

Compare the assistant's response to the expected output.
Expected output: {{ properties.expected_output }}

Validate before you scale

Always test your Judge on a small sample before scaling to thousands of events:

Select a Source between Traffic or Dataset.
Optionally filter the events (using the same filter language as in Explorer)
Use the Try it button to run the Judge on all events or click on a row to analyze a single event.

Run on 10–20 records and inspect every judgment:

Check agreement: do the Judge's scores match what you'd give?
Check stability: does the Judge score similar inputs consistently?
Check failure patterns: are there undefined answers or contradictions?

If the Judge disagrees with your expectations, revise the instructions before scaling up.

[Developer] Use Judges programmatically

The SDK lets you create, manage, and version Judges from code. Useful for automated Campaign workflows or teams that want Judge definitions in version control.

import os
from mistralai.client import Mistral

mistral = Mistral(
    api_key=os.getenv("MISTRAL_API_KEY", ""),
)

# Classification Judge
classifier = mistral.beta.observability.judges.create(
    name="Response Quality Classifier",
    description="Classifies responses as excellent, acceptable, or poor",
    model_name="mistral-medium-latest",
    instructions="""You are evaluating the quality of an AI assistant's response.

Conversation:
{{ conversation_history }}

Evaluate the assistant's final response based on:
- Accuracy and correctness
- Helpfulness and relevance
- Clarity and conciseness

Classify as:
- excellent: Fully addresses the question with accurate, clear information
- acceptable: Addresses the question but with minor issues
- poor: Fails to address the question or contains significant errors""",
    output={
        "type": "CLASSIFICATION",
        "options": [
            {"value": "excellent", "description": "High quality, accurate response"},
            {"value": "acceptable", "description": "Adequate but improvable"},
            {"value": "poor", "description": "Inadequate or incorrect"}
        ]
    },
    tools=[]
)
print(f"Classifier created: {classifier.id}")

# Regression Judge
scorer = mistral.beta.observability.judges.create(
    name="Helpfulness Score",
    description="Rates response helpfulness on a 1-5 scale",
    model_name="mistral-small-latest",
    instructions="""Rate the helpfulness of the assistant's response.

Conversation:
{{ conversation_history }}

Score from 1 to 5:
1 = Completely unhelpful or harmful
2 = Mostly unhelpful with major issues
3 = Partially helpful but with significant gaps
4 = Helpful with minor issues
5 = Excellent, fully addresses the user's needs""",
    output={
        "type": "REGRESSION",
        "min": 1,
        "max": 5,
        "min_description": "Completely unhelpful",
        "max_description": "Excellent response"
    },
    tools=[]
)
print(f"Scorer created: {scorer.id}")

import os
from mistralai.client import Mistral

mistral = Mistral(
    api_key=os.getenv("MISTRAL_API_KEY", ""),
)

# Classification Judge
classifier = mistral.beta.observability.judges.create(
    name="Response Quality Classifier",
    description="Classifies responses as excellent, acceptable, or poor",
    model_name="mistral-medium-latest",
    instructions="""You are evaluating the quality of an AI assistant's response.

Conversation:
{{ conversation_history }}

Evaluate the assistant's final response based on:
- Accuracy and correctness
- Helpfulness and relevance
- Clarity and conciseness

Classify as:
- excellent: Fully addresses the question with accurate, clear information
- acceptable: Addresses the question but with minor issues
- poor: Fails to address the question or contains significant errors""",
    output={
        "type": "CLASSIFICATION",
        "options": [
            {"value": "excellent", "description": "High quality, accurate response"},
            {"value": "acceptable", "description": "Adequate but improvable"},
            {"value": "poor", "description": "Inadequate or incorrect"}
        ]
    },
    tools=[]
)
print(f"Classifier created: {classifier.id}")

# Regression Judge
scorer = mistral.beta.observability.judges.create(
    name="Helpfulness Score",
    description="Rates response helpfulness on a 1-5 scale",
    model_name="mistral-small-latest",
    instructions="""Rate the helpfulness of the assistant's response.

Conversation:
{{ conversation_history }}

Score from 1 to 5:
1 = Completely unhelpful or harmful
2 = Mostly unhelpful with major issues
3 = Partially helpful but with significant gaps
4 = Helpful with minor issues
5 = Excellent, fully addresses the user's needs""",
    output={
        "type": "REGRESSION",
        "min": 1,
        "max": 5,
        "min_description": "Completely unhelpful",
        "max_description": "Excellent response"
    },
    tools=[]
)
print(f"Scorer created: {scorer.id}")

FAQ

Judges

Two types of Judges

Classification Judges

Choosing a model

Instructions guidelines

Validate before you scale

[Developer] Use Judges programmatically

FAQ

My Judge gives inconsistent scores on similar inputs

Too many judgments come back as undefined

Can I reuse the same Judge across multiple Campaigns?

How do I choose between classification and regression?