Quickstart

Information

This guide walks through the Observability workflow in Studio.

Are you a developer? Look for the API & SDK Implementation callout included in each step:

it shows the exact SDK method(s) for the same action,
and links to the corresponding section in the API/SDK docs.

What you'll get

By the end of this guide, you will have:

A filtered view of your production traffic showing relevant conversations.
A Judge that automatically identifies and scores/labels these conversations for you.
A completed Campaign that applies the Judge across your traffic at scale.
A Dataset built from the Campaign results.

Before you start

Make sure you have:

An Enterprise-tier Organization.
Admin access to the Observability features in your Organization.
A production traffic with chat completion events.

Step 1: Explore production traffic

Click Observe then open Explorer in the sidebar.

Your goal is to find a filter combination that surfaces relevant conversations (e.g. failure modes, low-quality responses, or specific behaviors you want to investigate).

Start with a broad time range (e.g., last 7 days) and one model (e.g., mistral-medium-2508).
Experiment with filters until you find a combination that suits your needs. For example:
- invoked_tools includes "web_search" to isolate tool-using conversations.
- last_user_message_preview contains "reset password" to find conversations about a specific topic.
- total_time_elapsed > 5 to find slow responses.
Click into individual events to inspect messages, tool calls, and metadata.

tip

Refining your filters improves Judge and Campaign accuracy. This step helps you define "relevant" before automating evaluations.

The Explorer view with the filter bar showing `model_name = 'mistral-medium-2508'` and a date range. Below, a list of events with columns for timestamp, model, tokens, and latency. One event is expanded showing the full conversation.

API & SDK Implementation

In the SDK, use the chat_completion_events.search() method to filter your events programmatically.

Step 2: Create a Judge

You have identified a filter combination that surfaces relevant conversations. You will now create a Judge to evaluate them automatically.

Go to Judges in the sidebar and click Create Judge.
Select a model among the available options.

Provide clear instructions detailing how to evaluate the conversations. For example:

Rate how helpful the assistant's response is to the user's question.
Consider whether the response is accurate, relevant, and complete.

Rate how helpful the assistant's response is to the user's question.
Consider whether the response is accurate, relevant, and complete.

Add tools (Optional):
- Enable Web Search to give the Judge access to the internet.
- Choose Code Interpreter to let the Judge run its own Python code.
Select a Judge type and provide the corresponding labels or score ranges:
- Classification for discrete labels (e.g. helpful / not helpful).
- Regression for a numeric score (e.g., 0 to 5).
Click Create Judge, provide a name and description, and confirm.

The Judge creation form showing an instruction text area, the output type selector (Classification selected with two options: 'helpful' and 'not helpful'), and the model dropdown.

tip

Test your Judge on real records before running a Campaign.

API & SDK Implementation

In the SDK, use judges.create() and pass your instructions and other parameters in the function.

Step 3: Run a Campaign

A Campaign evaluates a set of filtered events and applies your Judge to them. To run a Campaign:

Go to Campaigns in the sidebar and click Create Campaign.
In the Campaign creation form :
- Select the Judge you created in Step 2.
- Select a time range (e.g., last 7 days).
- Define your filters (reuse the same filter conditions from Step 1, or widen the scope if needed)
- Limit the number of events to process (ranging from 100 to 10,000).
Click Create Campaign, set a Campaign name and description, then confirm.

Campaigns run in the background. Check back later in the Campaigns dashboard for results.

The campaign detail view showing: the Judge used, the time range, the filter conditions and the number of events considered

API & SDK Implementation

In the SDK, use campaigns.create() to define filters and attach your Judge, then monitor progress with campaigns.fetch_status().

Step 4: Save results to a Dataset

Your Campaign has completed. All events are now annotated with the Judge's output and you can save them to a Dataset:

Select the relevant events (you may apply additional filters).
Click Actions and choose between adding the matching events to a new Dataset or append them to an existing one.

tip

Campaigns annotations are linked to their original events. View them anytime in Explorer.

API & SDK Implementation

In the SDK (see Campaigns & Datasets), use campaigns.list_events() then datasets.import_from_explorer() to pipe the matching events directly into a dataset.

Congratulations

Congratulations! You have created a curated, annotated Dataset, built from real production data.

Want to learn more? Explore the following deep dives in the Observability docs:

Explorer: Query specific events and filter production logs.
Judges: Design complex instructions, schemas, and validation techniques.
Campaigns: Annotate thousands of production events in bulk.
Datasets: Manage record structures, curation, and file imports.

Troubleshooting

Quickstart

What you'll get

Before you start

Step 1: Explore production traffic

Step 2: Create a Judge

Step 3: Run a Campaign

Step 4: Save results to a Dataset

Congratulations

Troubleshooting

Explorer shows no events

I cannot export events to a dataset

Judge scores seem inconsistent

The Campaign is pending or stuck

Campaign annotations don't look right