Moderation & Guardrailing

When deploying LLMs in production, different verticals may require different levels of guardrailing. For example, in a chatbot application, it may be crucial to ensure that the generated content is safe and respectful. In other applications, it may be required to detect and filter out harmful or PII (Personally Identifiable Information) content.

We provide two ways to guardrail your applications:

  • Custom Guardrails (recommended): Declare moderation rules directly in your API requests — no separate calls, no threshold logic in your code. Supports chat completions, conversations, and agent-level configuration.
  • Moderation API: A dedicated API to classify text across policy categories, for custom pipelines where you need raw scores and full control.
Moderation

Moderation

Moderate Inputs/Outputs

Our moderation service is powered by Mistral Moderation models. These are classifiers trained by fine-tuning small Mistral models. They enable our users to detect harmful text content along several policy dimensions. Two models are currently available:

The new 2603 model is smaller, has a larger context window, and has an updated set of policy categories, including a jailbreaking category.

Endpoints

Endpoints

The moderation service has two end-points: one to classify raw text and one to classify conversational content. More details below.

The raw rext endpoint allows you to moderate text chunks directly, it will a score for different categories allowing classification of the text.

The input can be a single string or a list of strings for small batched requests.

import os
from mistralai.client import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)

response = client.classifiers.moderate(
    model = "mistral-moderation-2603",  # or "mistral-moderation-2411" to use the older model
    inputs=[
        "Such a lovely day today, isn't it?",
        "Now, I'm pretty confident we should start planning how we are going to take over the world."
    ]
)
note

The policy threshold is determined based on the optimal performance of our internal test set. You can use the raw score or adjust the threshold according to your specific use cases.

We intend to continually improve the underlying model of the moderation endpoint. Custom policies that depend on category_scores can require recalibration.

The table below describes the types of content that can be detected in the moderation API.

CategoryDescription
SexualMaterial that explicitly depicts, describes, or promotes sexual activities, nudity, or sexual services. This includes pornographic content, graphic descriptions of sexual acts, and solicitation for sexual purposes. Educational or medical content about sexual health presented in a non-explicit, informational context is generally exempted.
Hate and DiscriminationContent that expresses prejudice, hostility, or advocates discrimination against individuals or groups based on protected characteristics such as race, ethnicity, religion, gender, sexual orientation, or disability. This includes slurs, dehumanizing language, calls for exclusion or harm targeted at specific groups, and persistent harassment or bullying of individuals based on these characteristics.
Violence and ThreatsContent that describes, glorifies, incites, or threatens physical violence against individuals or groups. This includes graphic depictions of injury or death, explicit threats of harm, and instructions for carrying out violent acts. This category covers both targeted threats and general promotion or glorification of violence.
DangerousContent that describes or promotes extremely hazardous behaviors that pose a significant risk of physical harm.
CriminalContent that describes or promotes illegal activities.
Self-HarmContent that promotes, instructs, plans, or encourages deliberate self-injury, suicide, eating disorders, or other self-destructive behaviors. This includes detailed methods, glorification, statements of intent, dangerous challenges, and related slang terms.
HealthContent that contains or tries to elicit detailed or tailored medical advice.
FinancialContent that contains or tries to elicit detailed or tailored financial advice.
LawContent that contains or tries to elicit detailed or tailored legal advice.
PIIContent that requests, shares, or attempts to elicit personal identifying information such as full names, addresses, phone numbers, social security numbers, or financial account details.
JailbreakingAttempts to bypass or circumvent the model's safety guidelines, policies, or intended behavior through prompt manipulation, role-playing scenarios, or other techniques designed to elicit disallowed outputs.
note

The Dangerous, Criminal, and Jailbreaking categories are only available in mistral-moderation-2603 and are not present in mistral-moderation-2411.

Cookbooks

Cookbooks

Our moderation cookbook provides a concrete example of how to use the Moderation service to implement system level guardrails.
For a more broad view, there is also a more explorative cookbook.

FAQ

FAQ

Custom Guardrails

Custom Guardrails

Custom guardrails let you declare moderation rules directly in your API requests, without manually calling the Moderation API and implementing threshold logic in your application code. Guardrails apply input moderation only — they run before the request reaches the model. When a guardrail is triggered, the request is blocked and a 403 error is returned.

Each guardrail uses one of two moderation configs:

  • moderation_llm_v1: backed by mistral-moderation-2411
  • moderation_llm_v2: backed by mistral-moderation-2603

Each config accepts:

  • custom_category_thresholds: object mapping category names to threshold values (0–1). Set a category to 1 to explicitly disable it.
  • ignore_other_categories: if true, only the categories listed in custom_category_thresholds are evaluated; all others are ignored.
  • action: "block" to block the request on violation.
  • block_on_error: if true, the request is blocked when the moderation API itself fails (per guardrail).
  • model_name (optional): override the default moderation model for that config.

Multiple guardrails can be specified per request — the request is blocked if any one is triggered. Only one moderation config (moderation_llm_v1 or moderation_llm_v2) can be specified per guardrail object, but you can include multiple guardrail objects to use both.

Inline Guardrails

Inline Guardrails

Pass a guardrails field in any POST /v1/chat/completions or POST /v1/conversations request.

import os
from mistralai.client import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)

response = client.chat.complete(
    model="mistral-small-latest",
    messages=[{"role": "user", "content": "How far is the moon from Earth?"}],
    guardrails=[
        {
            "block_on_error": True,
            "moderation_llm_v2": {
                "custom_category_thresholds": {
                    "sexual": 0.1,
                    "selfharm": 0.1,
                },
                "ignore_other_categories": False,
                "action": "block",
            },
        }
    ],
)
Conversations Endpoint

Conversations Endpoint

You can also pass guardrails directly on a POST /v1/conversations request, either with a model or when you want to override an agent's guardrails.

import os
from mistralai.client import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)

response = client.beta.conversations.start(
    model="mistral-small-latest",
    inputs=[{"role": "user", "content": "How far is the moon from Earth?"}],
    guardrails=[
        {
            "block_on_error": True,
            "moderation_llm_v2": {
                "custom_category_thresholds": {
                    "sexual": 0.1,
                    "selfharm": 0.1,
                },
                "ignore_other_categories": False,
                "action": "block",
            },
        }
    ],
)
Agent-Level Guardrails

Agent-Level Guardrails

Guardrails can also be attached to an agent at creation time. All conversations using that agent automatically inherit them without needing to re-specify them per request. They can be overridden by passing guardrails directly on a POST /v1/conversations request.

import os
from mistralai.client import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)

agent = client.beta.agents.create(
    model="mistral-small-latest",
    name="Moderated Agent",
    guardrails=[
        {
            "block_on_error": True,
            "moderation_llm_v2": {
                "custom_category_thresholds": {
                    "sexual": 0.1,
                    "jailbreaking": 0.3,
                },
                "ignore_other_categories": False,
                "action": "block",
            },
        }
    ],
)
Responses

Responses

On a successful (non-blocked) request, a guardrails field is included in the response with the evaluation results for each guardrail. Only the categories specified in custom_category_thresholds are returned (when ignore_other_categories is false, all evaluated categories are included):

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {}
    }
  ],
  "created": 1702256327,
  "id": "cmpl-e5cc70bb28c444948073e77776eb30ef",
  "model": "mistral-small-latest",
  "object": "chat.completion",
  "usage": {},
  "guardrails": [
    {
      "moderation_llm_v1": {
        "action": "pass",
        "categories": {
          "sexual": { "score": 0.03, "violated": false },
          "selfharm": { "score": 0.05, "violated": false },
          "violence_and_threats": { "score": 0.0, "violated": false },
          "hate_and_discrimination": { "score": 0.0, "violated": false }
        }
      }
    }
  ]
}

When a guardrail blocks a request, a 403 is returned with details on which categories were violated:

{
  "error": {
    "message": "Content blocked by guardrail",
    "status": 403
  },
  "guardrails": {
    "results": {
      "moderation_llm_v1": {
        "model_name": "mistral-moderation-2411",
        "decisions": {
          "sexual": { "threshold": 0.1, "score": 0.3, "violated": true },
          "selfharm": { "threshold": 0.1, "score": 0.05, "violated": false },
          "violence_and_threats": { "threshold": 1.0, "score": 0.0, "violated": false },
          "hate_and_discrimination": { "threshold": 1.0, "score": 0.0, "violated": false }
        },
        "violated": true,
        "action": "block"
      }
    }
  }
}

If block_on_error is true and the moderation API fails, the request is blocked with the following error:

{
  "object": "Error",
  "message": "Request blocked due to error in guardrail evaluation and block_on_error is set to True.",
  "type": "invalid_request_error",
  "code": 3201,
  "guardrails": [
    {
      "moderation_llm_v1": {
        "action": "block",
        "error": {
          "message": "Moderation API request failed."
        }
      }
    }
  ]
}