Datasets
Datasets are curated collections of conversation records used to evaluate model quality, build regression tests, or prepare fine-tuning data.
Unlike raw traffic in Explorer, Dataset records are editable: fix messages, add expected outputs, remove noise, and shape the data directly from the Studio interface.
What's in a record
Each record has three parts:
| Field | Content | Purpose |
|---|---|---|
| Conversation | System messages, user inputs, assistant responses, and tool calls. | The core data that Judges evaluate. |
| Properties | Custom metadata: expected output, category, grading guidance, difficulty, etc. | Judges reference properties in their instructions via {{ properties.* }}. |
| Source | Where the record came from: EXPLORER, UPLOADED_FILE, DIRECT_INPUT, or PLAYGROUND. | Traceability back to the record's origin. |
Properties
Properties are what make Datasets more than a list of conversations. They let you attach structured context to each record:
expected_output: the ideal response, so a Judge can compare against it.category: the type of request (e.g.,billing,technical,general).grading_guidance: specific instructions for how the Judge should evaluate this record.difficulty: a difficulty marker to segment results by complexity.
Judges can reference any property in their instructions using {{ properties.your_field_name }}. See Instructions guidelines for details.
Add data to a Dataset
Click New dataset, then choose a source:
Create manually
Add records by hand in Studio. Define conversation turns, then attach properties and metadata.
Useful for:
- Regression tests targeting a known edge case.
- Golden examples with carefully crafted expected outputs.
- Specific scenarios that don't appear naturally in production traffic.
Properties can be entered as key-value pairs or pasted as raw JSON for bulk editing.
From the Playground
Import conversations from the Playground — useful if you've been testing agents or prompts and want to reuse those conversations without recreating them manually.
From a Campaign
Import all or a subset of a Campaign's records, including the Judge's annotations as properties. This lets you build curated Datasets from evaluated traffic.
From Explorer
Select events in Explorer and click Export to Dataset. See the Explorer guide for details.
From a file
Upload a JSONL file to import records in bulk. Each line must be a JSON object with messages and optionally properties:
{"messages": [{"role": "user", "content": "How do I reset my password?"}, {"role": "assistant", "content": "Go to Settings > Security > Reset password."}], "properties": {"expected_output": "Clear reset instructions", "category": "account"}}
{"messages": [{"role": "user", "content": "What's the rate limit?"}], "properties": {"expected_output": "Tier-specific rate limit info", "category": "technical"}}Imports can take some time. Check the status by clicking the Import Tasks button.
Export a Dataset
Click Actions → Export to JSONL to export a Dataset as a JSONL file. Each line contains a record with its conversation and properties.
Best practices
Curate your records
Click into any record to edit:
- Messages: fix typos, clarify ambiguous inputs, or reshape the conversation to better represent a test case.
- Properties: add
expected_output,grading_guidance, or any metadata your Judges need.
Remove low-value records
Strip out records that add noise:
- Duplicates: similar conversations that over-represent one scenario.
- Out of scope: records that don't match the Dataset's purpose.
- Ambiguous: conversations where even a human couldn't reliably score the response.
Test before you commit
Run a Judge on a single record before launching a full Campaign. This is the fastest way to verify that your instructions and properties work together. See Validate before you scale.
Keep Datasets healthy
If you reuse Datasets over time (and you should):
- Name explicitly. Include scope and date:
support_billing_baseline_2025_06, nottest_data. - Track sources. Note where records came from and what curation you applied.
- Version baselines. Freeze a baseline Dataset between uses. Create a new version when you need changes.
- Don't mix unrelated tasks. Keep "support quality" and "code generation accuracy" in separate Datasets.
- Check class balance. If 90% of your records are easy cases, the Dataset won't reveal real problems.
[Developer] Use Datasets programmatically
The SDK lets you create Datasets, import records, and manage data from code.
import os
from mistralai.client import Mistral
mistral = Mistral(
api_key=os.getenv("MISTRAL_API_KEY", ""),
)
# Create an empty Dataset
dataset = mistral.beta.observability.datasets.create(
name="Customer Support Analysis Set",
description="Curated examples for analyzing support agent quality"
)
print(f"Dataset created: {dataset.id}")