Skip to main content

Benchmarks

LLM (Large Language Model) benchmarks are standardized tests or datasets used to evaluate the performance of large language models. These benchmarks help researchers and developers understand the strengths and weaknesses of their models and compare them with other models in a systematic way.

Mistral benchmarks

Mistral demonstrates top-tier reasoning capabilities and excels in advanced reasoning, multilingual tasks, math, and code generation. The company reports benchmark results on popular public benchmarks such as MMLU (Massive Multitask Language Understanding), MT-bench, and others.

You can find the benchmark results in the following blog posts:

  • Pixtral: Pixtral 12B the first open-source model to demonstrate state-of-the-art multimodal understanding, without regressing on abilities in pure text.
  • Mistral Large: a cutting-edge text generation model with top-tier reasoning capabilities. It can be used for complex multilingual reasoning tasks, including text understanding, transformation, and code generation.
  • Mistral Nemo: Mistral Nemo's reasoning, world knowledge, and coding performance are state-of-the-art in its size category. As it relies on standard architecture, Mistral Nemo is easy to use and a drop-in replacement in any system using Mistral 7B that it supersedes.
  • Codestral: as a 22B model, Codestral sets a new standard on the performance/latency space for code generation compared to previous models used for coding.
  • Codestral-Mamba: we have trained this model with advanced code and reasoning capabilities, enabling the model to have a strong performance on par with SOTA transformer-based models.
  • Mathstral: Mathstral stands on the shoulders of Mistral 7B and specialises in STEM subjects. It achieves state-of-the-art reasoning capacities in its size category across various industry-standard benchmarks.
  • Mixtral 8x22B: our most performant open model. It handles English, French, Italian, German, Spanish and performs strongly on code-related tasks. Natively handles function calling.
  • Mixtral 8x7B: outperforms Llama 2 70B on most benchmarks with 6x faster inference and matches or outperforms GPT3.5 on most standard benchmarks. It handles English, French, Italian, German and Spanish, and shows strong performance in code generation.
  • Mistral 7B: outperforms Llama 2 13B on all benchmarks and Llama 1 34B on many benchmarks.

Scale Seal Leaderboard

Scale AI reports private benchmark results in coding, instruction following, math, and Spanish. Mistral Large performs exceptionally well in code and Spanish, outperforming Llama 3 405B in these areas.

Artificial Analysis

Artificial Analysis compares and evaluates AI models across key performance metrics, including quality, price, output speed, latency, context window, and others. Our model has several areas of outstanding performance worth highlighting.

  • Artificial Analysis Quality Index: Our model ranks 3rd in this benchmark, surpassing even the 405B model. This achievement underscores our model's superior ability to analyze and generate high-quality insights.
  • Coding (HumanEval): In the HumanEval benchmark, our model secures the 3rd position, again outperforming the 405B model. This highlights our model's exceptional proficiency in coding tasks.
  • Quantitative Reasoning (MATH): Our model places 4th in the MATH benchmark, ahead of the 405B model. This demonstrates our model's strong quantitative reasoning capabilities.
  • Scientific Reasoning & Knowledge (GPQA): In the GPQA benchmark, our model ranks 4th, showcasing its robust scientific reasoning and knowledge retention abilities.

Qualitative feedback

We've gathered a lot of valuable insights from platforms like Reddit and Twitter. Below are some highlights and quotes from users who have shared their experiences with our models.

Pixtral:

Pixtral absolutely SLAYS at OCR.

Very impressive at charts and diagrams and drawings and photos of screens.

It outperforms GPT-4o-mini in many examples I’ve tested.

Mistral Large:

Mistral large 2 has been my go to model.

This model is so good. In terms of local models, this is probably the first that I honestly felt was proprietary tier for coding.

Mistral Nemo:

I’ve been playing with Nemo for a few days now, and it blows me away at how coherent it is. It’s slightly ‘less creative and more repetitive’ than Llama 3 8B fine-tunes… But it feels ‘more coherent and has better instruction capabilities’.

Just wanna say thank you to those genius french over at Mistral for Nemo. 12B parameters and 128k context is a very useful combination. It’s enough of a size improvement over 7B to feel a little more “solid” when talking to it, and it runs circles around Llama-2-13B, with 32x the context length. Thank you mistral!