Self-deployment
Mistral AI provides ready-to-use Docker images on the Github registry. The weights are distributed separately.
To run these images, you need a cloud virtual machine matching the requirements for a given model. These requirements can be found in the model description.
We recommend three different serving frameworks for our models :
- vLLM: A python only serving framework which deploys an API matching OpenAI's spec. vLLM provides paged attention kernel to improve serving throughput.
- NVidias's TensorRT-LLM served with Nvidia's Triton Inference Server : TensorRT-LLM provides a DSL to build fast inference engines with dedicated kernels for large language models. Triton Inference Server allows efficient serving of these inference engines.
- TGI: A toolkit for deploying LLMs, including OpenAI's spec, grammars, production monitoring, and tools functionality.
These images can be run locally, or on your favorite cloud provider, using SkyPilot.