Judges
Judges are LLM-based evaluators that score or classify model responses.
Define your quality criteria once, then apply them consistently at scale across thousands of events.
Judges aren't used on their own. They power Campaigns, which batch-annotate production traffic.
A Judge has three main components: its type, model, and instructions.
Two types of Judges
There are two types of Judges: Classification and Regression.
Classification Judges
Classification Judges assign a discrete label to each response. Use them when you need a clear category, not a score.
Good for:
- Binary decisions:
helpful/not helpful,safe/unsafe - Multi-class tagging:
excellent/acceptable/poor - Routing labels:
code/search/guide
Example:
Evaluate the assistant's final response. Is it accurate, relevant, and complete? Classify as excellent, acceptable, or poor.
Then define labels as follows:
- excellent → "Fully addresses the question with accurate, clear information"
- acceptable → "Addresses the question but with minor issues"
- poor → "Fails to address the question or contains significant errors"
Choosing a model
Pick a model of your choice based on the trade-off between evaluation quality and cost:
- stronger models produce more nuanced judgments but cost more per event.
- faster models work well for straightforward, well-defined criteria.
See the full list of available models and their pricing on the Models page.
Instructions guidelines
Craft your instructions with care:
- Be specific. Avoid vague prompts and prefer detailed instructions
- Never assume the Judge understands your context. Spell out what "good" means for your use case.
- Ensure testability. If you can’t apply the criteria consistently, neither can the Judge.
- Use boundary examples: "A score of 3 means the response is partially helpful (= answers the question but omits a key detail)."
The instructions field is prefilled with a template structure. Just write under the # Instructions block. Conversation history, user message, assistant response, and available tools are auto-injected into the Judge’s context.
Advanced: using Jinja2 template variables
Advanced users can tweak the template and reference event data directly using Jinja2 syntax ({{ }}).
Here is a list of available variables:
| Variable | What it contains |
|---|---|
{{ conversation_history }} | The full conversation history (messages in order). |
{{ user_message }} | The user's last message. |
{{ assistant_message }} | The assistant's last response. |
{{ system_prompt }} | The system prompt used during the request. |
{{ available_tools }} | Tools available to the model during the request. |
{{ answer_type_definition }} | The output schema (auto-generated from your output type config). |
{{ properties.* }} | Custom properties from the dataset record. |
Properties let you attach structured context that Judges can reference.
For instance, if your dataset records have a properties.expected_output field, you can reference it to compare against a known-good answer:
Compare the assistant's response to the expected output.
Expected output: {{ properties.expected_output }}Validate before you scale
Always test your Judge on a small sample before scaling to thousands of events:
- Select a Source between Traffic or Dataset.
- Optionally filter the events (using the same filter language as in Explorer)
- Use the Try it button to run the Judge on all events or click on a row to analyze a single event.
Run on 10–20 records and inspect every judgment:
- Check agreement: do the Judge's scores match what you'd give?
- Check stability: does the Judge score similar inputs consistently?
- Check failure patterns: are there undefined answers or contradictions?
If the Judge disagrees with your expectations, revise the instructions before scaling up.
[Developer] Use Judges programmatically
The SDK lets you create, manage, and version Judges from code. Useful for automated Campaign workflows or teams that want Judge definitions in version control.
import os
from mistralai.client import Mistral
mistral = Mistral(
api_key=os.getenv("MISTRAL_API_KEY", ""),
)
# Classification Judge
classifier = mistral.beta.observability.judges.create(
name="Response Quality Classifier",
description="Classifies responses as excellent, acceptable, or poor",
model_name="mistral-medium-latest",
instructions="""You are evaluating the quality of an AI assistant's response.
Conversation:
{{ conversation_history }}
Evaluate the assistant's final response based on:
- Accuracy and correctness
- Helpfulness and relevance
- Clarity and conciseness
Classify as:
- excellent: Fully addresses the question with accurate, clear information
- acceptable: Addresses the question but with minor issues
- poor: Fails to address the question or contains significant errors""",
output={
"type": "CLASSIFICATION",
"options": [
{"value": "excellent", "description": "High quality, accurate response"},
{"value": "acceptable", "description": "Adequate but improvable"},
{"value": "poor", "description": "Inadequate or incorrect"}
]
},
tools=[]
)
print(f"Classifier created: {classifier.id}")
# Regression Judge
scorer = mistral.beta.observability.judges.create(
name="Helpfulness Score",
description="Rates response helpfulness on a 1-5 scale",
model_name="mistral-small-latest",
instructions="""Rate the helpfulness of the assistant's response.
Conversation:
{{ conversation_history }}
Score from 1 to 5:
1 = Completely unhelpful or harmful
2 = Mostly unhelpful with major issues
3 = Partially helpful but with significant gaps
4 = Helpful with minor issues
5 = Excellent, fully addresses the user's needs""",
output={
"type": "REGRESSION",
"min": 1,
"max": 5,
"min_description": "Completely unhelpful",
"max_description": "Excellent response"
},
tools=[]
)
print(f"Scorer created: {scorer.id}")