Datasets

Datasets are curated collections of conversation records used to evaluate model quality, build regression tests, or prepare fine-tuning data.

Unlike raw traffic in Explorer, Dataset records are editable: fix messages, add expected outputs, remove noise, and shape the data directly from the Studio interface.

What's in a record

What's in a record

Each record has three parts:

FieldContentPurpose
ConversationSystem messages, user inputs, assistant responses, and tool calls.The core data that Judges evaluate.
PropertiesCustom metadata: expected output, category, grading guidance, difficulty, etc.Judges reference properties in their instructions via {{ properties.* }}.
SourceWhere the record came from: EXPLORER, UPLOADED_FILE, DIRECT_INPUT, or PLAYGROUND.Traceability back to the record's origin.

Properties

Properties are what make Datasets more than a list of conversations. They let you attach structured context to each record:

  • expected_output: the ideal response, so a Judge can compare against it.
  • category: the type of request (e.g., billing, technical, general).
  • grading_guidance: specific instructions for how the Judge should evaluate this record.
  • difficulty: a difficulty marker to segment results by complexity.
tip

Judges can reference any property in their instructions using {{ properties.your_field_name }}. See Instructions guidelines for details.

Add data to a Dataset

Add data to a Dataset

Click New dataset, then choose a source:

Create manually

Add records by hand in Studio. Define conversation turns, then attach properties and metadata.

Useful for:

  • Regression tests targeting a known edge case.
  • Golden examples with carefully crafted expected outputs.
  • Specific scenarios that don't appear naturally in production traffic.
i
Information

Properties can be entered as key-value pairs or pasted as raw JSON for bulk editing.

From the Playground

Import conversations from the Playground — useful if you've been testing agents or prompts and want to reuse those conversations without recreating them manually.

From a Campaign

Import all or a subset of a Campaign's records, including the Judge's annotations as properties. This lets you build curated Datasets from evaluated traffic.

From Explorer

Select events in Explorer and click Export to Dataset. See the Explorer guide for details.

From a file

Upload a JSONL file to import records in bulk. Each line must be a JSON object with messages and optionally properties:

{"messages": [{"role": "user", "content": "How do I reset my password?"}, {"role": "assistant", "content": "Go to Settings > Security > Reset password."}], "properties": {"expected_output": "Clear reset instructions", "category": "account"}}
{"messages": [{"role": "user", "content": "What's the rate limit?"}], "properties": {"expected_output": "Tier-specific rate limit info", "category": "technical"}}
i
Information

Imports can take some time. Check the status by clicking the Import Tasks button.

Export a Dataset

Export a Dataset

Click Actions → Export to JSONL to export a Dataset as a JSONL file. Each line contains a record with its conversation and properties.

Best practices

Best practices

Curate your records

Click into any record to edit:

  • Messages: fix typos, clarify ambiguous inputs, or reshape the conversation to better represent a test case.
  • Properties: add expected_output, grading_guidance, or any metadata your Judges need.

Remove low-value records

Strip out records that add noise:

  • Duplicates: similar conversations that over-represent one scenario.
  • Out of scope: records that don't match the Dataset's purpose.
  • Ambiguous: conversations where even a human couldn't reliably score the response.

Test before you commit

Run a Judge on a single record before launching a full Campaign. This is the fastest way to verify that your instructions and properties work together. See Validate before you scale.

Keep Datasets healthy

If you reuse Datasets over time (and you should):

  • Name explicitly. Include scope and date: support_billing_baseline_2025_06, not test_data.
  • Track sources. Note where records came from and what curation you applied.
  • Version baselines. Freeze a baseline Dataset between uses. Create a new version when you need changes.
  • Don't mix unrelated tasks. Keep "support quality" and "code generation accuracy" in separate Datasets.
  • Check class balance. If 90% of your records are easy cases, the Dataset won't reveal real problems.
[Developer] Use Datasets programmatically

[Developer] Use Datasets programmatically

The SDK lets you create Datasets, import records, and manage data from code.

import os
from mistralai.client import Mistral

mistral = Mistral(
    api_key=os.getenv("MISTRAL_API_KEY", ""),
)

# Create an empty Dataset
dataset = mistral.beta.observability.datasets.create(
    name="Customer Support Analysis Set",
    description="Curated examples for analyzing support agent quality"
)

print(f"Dataset created: {dataset.id}")
FAQ

FAQ