Moderation & Guardrailing

When deploying LLMs in production, different verticals may require different levels of guardrailing. For example, in a chatbot application, it may be crucial to ensure that the generated content is safe and respectful. In other applications, it may be required to detect and filter out harmful or PII (Personally Identifiable Information) content.

We provide two ways to guardrail your applications:

Custom Guardrails (recommended): Declare moderation rules directly in your API requests — no separate calls, no threshold logic in your code. Supports chat completions, conversations, and agent-level configuration.
Moderation API: A dedicated API to classify text across policy categories, for custom pipelines where you need raw scores and full control.

Moderation

Moderate Inputs/Outputs

Our moderation service is powered by mistral-moderation-2603. It classifies text across policy categories including a jailbreaking category.

Note

mistral-moderation-2411 was deprecated on March 31, 2026. See Mistral Moderation 2411 for legacy reference.

Endpoints

The moderation service has two end-points: one to classify raw text and one to classify conversational content. More details below.

The raw rext endpoint allows you to moderate text chunks directly, it will a score for different categories allowing classification of the text.

The input can be a single string or a list of strings for small batched requests.

import os
from mistralai.client import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)

response = client.classifiers.moderate(
    model = "mistral-moderation-2603"
    inputs=[
        "Such a lovely day today, isn't it?",
        "Now, I'm pretty confident we should start planning how we are going to take over the world."
    ]
)

import os
from mistralai.client import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)

response = client.classifiers.moderate(
    model = "mistral-moderation-2603"
    inputs=[
        "Such a lovely day today, isn't it?",
        "Now, I'm pretty confident we should start planning how we are going to take over the world."
    ]
)

Note

The policy threshold is determined based on the optimal performance of our internal test set. You can use the raw score or adjust the threshold according to your specific use cases.

We intend to continually improve the underlying model of the moderation endpoint. Custom policies that depend on category_scores can require recalibration.

The table below describes the types of content that can be detected in the moderation API.

Category	Description
Sexual	Material that explicitly depicts, describes, or promotes sexual activities, nudity, or sexual services. This includes pornographic content, graphic descriptions of sexual acts, and solicitation for sexual purposes. Educational or medical content about sexual health presented in a non-explicit, informational context is generally exempted.
Hate and Discrimination	Content that expresses prejudice, hostility, or advocates discrimination against individuals or groups based on protected characteristics such as race, ethnicity, religion, gender, sexual orientation, or disability. This includes slurs, dehumanizing language, calls for exclusion or harm targeted at specific groups, and persistent harassment or bullying of individuals based on these characteristics.
Violence and Threats	Content that describes, glorifies, incites, or threatens physical violence against individuals or groups. This includes graphic depictions of injury or death, explicit threats of harm, and instructions for carrying out violent acts. This category covers both targeted threats and general promotion or glorification of violence.
Dangerous	Content that describes or promotes extremely hazardous behaviors that pose a significant risk of physical harm.
Criminal	Content that describes or promotes illegal activities.
Self-Harm	Content that promotes, instructs, plans, or encourages deliberate self-injury, suicide, eating disorders, or other self-destructive behaviors. This includes detailed methods, glorification, statements of intent, dangerous challenges, and related slang terms.
Health	Content that contains or tries to elicit detailed or tailored medical advice.
Financial	Content that contains or tries to elicit detailed or tailored financial advice.
Law	Content that contains or tries to elicit detailed or tailored legal advice.
PII	Content that requests, shares, or attempts to elicit personal identifying information such as full names, addresses, phone numbers, social security numbers, or financial account details.
Jailbreaking	Attempts to bypass or circumvent the model's safety guidelines, policies, or intended behavior through prompt manipulation, role-playing scenarios, or other techniques designed to elicit disallowed outputs.

Cookbooks

Our moderation cookbook provides a concrete example of how to use the Moderation service to implement system level guardrails.
For a more broad view, there is also a more explorative cookbook.

FAQ

Custom Guardrails

Custom guardrails let you declare moderation rules directly in your API requests, without manually calling the Moderation API and implementing threshold logic in your application code. Guardrails apply input moderation only — they run before the request reaches the model. When a guardrail is triggered, the request is blocked and a 403 error is returned.

Each guardrail uses the moderation_llm_v2 config, backed by mistral-moderation-2603.

Each config accepts:

custom_category_thresholds: object mapping category names to threshold values (0–1). Set a category to 1 to explicitly disable it.
ignore_other_categories: if true, only the categories listed in custom_category_thresholds are evaluated; all others are ignored.
action: "block" to block the request on violation.
block_on_error: if true, the request is blocked when the moderation API itself fails (per guardrail).
model_name (optional): override the default moderation model for that config.

Multiple guardrails can be specified per request — the request is blocked if any one is triggered. Only one moderation_llm_v2 config can be specified per guardrail object, but you can include multiple guardrail objects.

Inline Guardrails

Pass a guardrails field in any POST /v1/chat/completions or POST /v1/conversations request.

import os
from mistralai.client import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)

response = client.chat.complete(
    model="mistral-small-latest",
    messages=[{"role": "user", "content": "How far is the moon from Earth?"}],
    guardrails=[
        {
            "block_on_error": True,
            "moderation_llm_v2": {
                "custom_category_thresholds": {
                    "sexual": 0.1,
                    "selfharm": 0.1,
                },
                "ignore_other_categories": False,
                "action": "block",
            },
        }
    ],
)

import os
from mistralai.client import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)

response = client.chat.complete(
    model="mistral-small-latest",
    messages=[{"role": "user", "content": "How far is the moon from Earth?"}],
    guardrails=[
        {
            "block_on_error": True,
            "moderation_llm_v2": {
                "custom_category_thresholds": {
                    "sexual": 0.1,
                    "selfharm": 0.1,
                },
                "ignore_other_categories": False,
                "action": "block",
            },
        }
    ],
)

Conversations Endpoint

You can also pass guardrails directly on a POST /v1/conversations request, either with a model or when you want to override an agent's guardrails.

import os
from mistralai.client import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)

response = client.beta.conversations.start(
    model="mistral-small-latest",
    inputs=[{"role": "user", "content": "How far is the moon from Earth?"}],
    guardrails=[
        {
            "block_on_error": True,
            "moderation_llm_v2": {
                "custom_category_thresholds": {
                    "sexual": 0.1,
                    "selfharm": 0.1,
                },
                "ignore_other_categories": False,
                "action": "block",
            },
        }
    ],
)

import os
from mistralai.client import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)

response = client.beta.conversations.start(
    model="mistral-small-latest",
    inputs=[{"role": "user", "content": "How far is the moon from Earth?"}],
    guardrails=[
        {
            "block_on_error": True,
            "moderation_llm_v2": {
                "custom_category_thresholds": {
                    "sexual": 0.1,
                    "selfharm": 0.1,
                },
                "ignore_other_categories": False,
                "action": "block",
            },
        }
    ],
)

Agent-Level Guardrails

Guardrails can also be attached to an agent at creation time. All conversations using that agent automatically inherit them without needing to re-specify them per request. They can be overridden by passing guardrails directly on a POST /v1/conversations request.

import os
from mistralai.client import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)

agent = client.beta.agents.create(
    model="mistral-small-latest",
    name="Moderated Agent",
    guardrails=[
        {
            "block_on_error": True,
            "moderation_llm_v2": {
                "custom_category_thresholds": {
                    "sexual": 0.1,
                    "jailbreaking": 0.3,
                },
                "ignore_other_categories": False,
                "action": "block",
            },
        }
    ],
)

import os
from mistralai.client import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)

agent = client.beta.agents.create(
    model="mistral-small-latest",
    name="Moderated Agent",
    guardrails=[
        {
            "block_on_error": True,
            "moderation_llm_v2": {
                "custom_category_thresholds": {
                    "sexual": 0.1,
                    "jailbreaking": 0.3,
                },
                "ignore_other_categories": False,
                "action": "block",
            },
        }
    ],
)

Responses

On a successful (non-blocked) request, a guardrails field is included in the response with the evaluation results for each guardrail. Only the categories specified in custom_category_thresholds are returned (when ignore_other_categories is false, all evaluated categories are included):

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {}
    }
  ],
  "created": 1702256327,
  "id": "cmpl-e5cc70bb28c444948073e77776eb30ef",
  "model": "mistral-small-latest",
  "object": "chat.completion",
  "usage": {},
  "guardrails": [
    {
      "moderation_llm_v2": {
        "action": "pass",
        "categories": {
          "sexual": { "score": 0.03, "violated": false },
          "selfharm": { "score": 0.05, "violated": false },
          "violence_and_threats": { "score": 0.0, "violated": false },
          "hate_and_discrimination": { "score": 0.0, "violated": false }
        }
      }
    }
  ]
}

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {}
    }
  ],
  "created": 1702256327,
  "id": "cmpl-e5cc70bb28c444948073e77776eb30ef",
  "model": "mistral-small-latest",
  "object": "chat.completion",
  "usage": {},
  "guardrails": [
    {
      "moderation_llm_v2": {
        "action": "pass",
        "categories": {
          "sexual": { "score": 0.03, "violated": false },
          "selfharm": { "score": 0.05, "violated": false },
          "violence_and_threats": { "score": 0.0, "violated": false },
          "hate_and_discrimination": { "score": 0.0, "violated": false }
        }
      }
    }
  ]
}

When a guardrail blocks a request, a 403 is returned with details on which categories were violated:

{
  "error": {
    "message": "Content blocked by guardrail",
    "status": 403
  },
  "guardrails": {
    "results": {
      "moderation_llm_v2": {
        "model_name": "mistral-moderation-2603",
        "decisions": {
          "sexual": { "threshold": 0.1, "score": 0.3, "violated": true },
          "selfharm": { "threshold": 0.1, "score": 0.05, "violated": false },
          "violence_and_threats": { "threshold": 1.0, "score": 0.0, "violated": false },
          "hate_and_discrimination": { "threshold": 1.0, "score": 0.0, "violated": false }
        },
        "violated": true,
        "action": "block"
      }
    }
  }
}

{
  "error": {
    "message": "Content blocked by guardrail",
    "status": 403
  },
  "guardrails": {
    "results": {
      "moderation_llm_v2": {
        "model_name": "mistral-moderation-2603",
        "decisions": {
          "sexual": { "threshold": 0.1, "score": 0.3, "violated": true },
          "selfharm": { "threshold": 0.1, "score": 0.05, "violated": false },
          "violence_and_threats": { "threshold": 1.0, "score": 0.0, "violated": false },
          "hate_and_discrimination": { "threshold": 1.0, "score": 0.0, "violated": false }
        },
        "violated": true,
        "action": "block"
      }
    }
  }
}

If block_on_error is true and the moderation API fails, the request is blocked with the following error:

{
  "object": "Error",
  "message": "Request blocked due to error in guardrail evaluation and block_on_error is set to True.",
  "type": "invalid_request_error",
  "code": 3201,
  "guardrails": [
    {
      "moderation_llm_v2": {
        "action": "block",
        "error": {
          "message": "Moderation API request failed."
        }
      }
    }
  ]
}

{
  "object": "Error",
  "message": "Request blocked due to error in guardrail evaluation and block_on_error is set to True.",
  "type": "invalid_request_error",
  "code": 3201,
  "guardrails": [
    {
      "moderation_llm_v2": {
        "action": "block",
        "error": {
          "message": "Moderation API request failed."
        }
      }
    }
  ]
}