vLLM
vLLM is an open-source LLM inference and serving engine. It is particularly appropriate as a target platform for self-deploying Mistral models on-premise.
Pre-requisites
- The hardware requirements for vLLM are listed on its installation documentation page.
- By default, vLLM sources the model weights from Hugging Face. To access Mistral model
repositories you need to be authenticated on Hugging Face, so an access
token
HF_TOKEN
with theREAD
permission will be required. You should also make sure that you have accepted the conditions of access on each model card page. - If you already have the model artifacts on your infrastructure you can use them directly by pointing vLLM to their local path instead of a Hugging Face model ID. In this scenario you will be able to skip all Hugging Face related setup steps.
Getting started
The following sections will guide you through the process of deploying and querying Mistral models on vLLM.
Installing vLLM
-
Create a Python virtual environment and install the
vllm
package (version>=0.6.1.post1
to ensure maximum compatibility with all Mistral models). -
Authenticate on the HuggingFace Hub using your access token
$HF_TOKEN
:huggingface-cli login --token $HF_TOKEN
Offline mode inference
When using vLLM in offline mode the model is loaded and used for one-off batch inference workloads.
- Text input (Mistral NeMo)
- Text input (Mistral Small)
- Image + text input (Pixtral-12B)
from vllm import LLM
from vllm.sampling_params import SamplingParams
model_name = "mistralai/Mistral-NeMo-Instruct-2407"
sampling_params = SamplingParams(max_tokens=8192)
llm = LLM(
model=model_name,
tokenizer_mode="mistral",
load_format="mistral",
config_format="mistral",
)
messages = [
{
"role": "user",
"content": "Who is the best French painter. Answer with detailed explanations.",
}
]
res = llm.chat(messages=messages, sampling_params=sampling_params)
print(res[0].outputs[0].text)
from vllm import LLM
from vllm.sampling_params import SamplingParams
model_name = "mistralai/Mistral-Small-Instruct-2409"
sampling_params = SamplingParams(max_tokens=8192)
llm = LLM(
model=model_name,
tokenizer_mode="mistral",
load_format="mistral",
config_format="mistral",
)
messages = [
{
"role": "user",
"content": "Who is the best French painter. Answer with detailed explanations.",
}
]
res = llm.chat(messages=messages, sampling_params=sampling_params)
print(res[0].outputs[0].text)
Suppose you want to caption the following images:
You can do so by running the following code:
from vllm import LLM
from vllm.sampling_params import SamplingParams
model_name = "mistralai/Pixtral-12B-2409"
max_img_per_msg = 3
sampling_params = SamplingParams(max_tokens=8192)
llm = LLM(
model=model_name,
tokenizer_mode="mistral",
load_format="mistral",
config_format="mistral",
limit_mm_per_prompt={"image": max_img_per_msg},
)
urls = [f"https://picsum.photos/id/{id}/512/512" for id in ["1", "11", "111"]]
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image"},
] + [{"type": "image_url", "image_url": {"url": f"{u}"}} for u in urls],
},
]
res = llm.chat(messages=messages, sampling_params=sampling_params)
print(res[0].outputs[0].text)
Server mode inference
In server mode, vLLM spawns an HTTP server that continuously waits for clients to connect and send requests concurrently. The server exposes a REST API that implements the OpenAI protocol, allowing you to directly reuse existing code relying on the OpenAI API.
- Text input (Mistral NeMo)
- Text input (Mistral Small)
- Image + text input (Pixtral-12B)
Start the inference server to deploy your model, e.g. for Mistral NeMo:
vllm serve mistralai/Mistral-Nemo-Instruct-2407 \
--tokenizer_mode mistral \
--config_format mistral \
--load_format mistral
You can now run inference requests with text input:
- cURL
- Python
curl --location 'http://localhost:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer token' \
--data '{
"model": "mistralai/Mistral-Nemo-Instruct-2407",
"messages": [
{
"role": "user",
"content": "Who is the best French painter? Answer in one short sentence."
}
]
}'
import httpx
url = 'http://localhost:8000/v1/chat/completions'
headers = {
'Content-Type': 'application/json',
'Authorization': 'Bearer token'
}
data = {
"model": "mistralai/Mistral-Nemo-Instruct-2407",
"messages": [
{
"role": "user",
"content": "Who is the best French painter? Answer in one short sentence."
}
]
}
response = httpx.post(url, headers=headers, json=data)
print(response.json())
Start the inference server to deploy your model, e.g. for Mistral Small:
vllm serve mistralai/Mistral-Small-Instruct-2409 \
--tokenizer_mode mistral \
--config_format mistral \
--load_format mistral
You can now run inference requests with text input:
- cURL
- Python
curl --location 'http://localhost:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer token' \
--data '{
"model": "mistralai/Mistral-Small-Instruct-2409",
"messages": [
{
"role": "user",
"content": "Who is the best French painter? Answer in one short sentence."
}
]
}'
import httpx
url = 'http://localhost:8000/v1/chat/completions'
headers = {
'Content-Type': 'application/json',
'Authorization': 'Bearer token'
}
data = {
"model": "mistralai/Mistral-Small-Instruct-2409",
"messages": [
{
"role": "user",
"content": "Who is the best French painter? Answer in one short sentence."
}
]
}
response = httpx.post(url, headers=headers, json=data)
print(response.json())
Start the inference server to deploy your model, e.g. for Pixtral-12B:
vllm serve mistralai/Pixtral-12B-2409 \
--tokenizer_mode mistral \
--config_format mistral \
--load_format mistral
-
The default number of image inputs per prompt is set to 1. To increase it, set the
--limit_mm_per_prompt
option (e.g.--limit_mm_per_prompt 'image=4'
). -
If you encounter memory issues, set the
--max_model_len
option to reduce the memory requirements of vLLM (e.g.--max_model_len 16384
). More troubleshooting details can be found in the vLLM documentation.
You can now run inference requests with images and text inputs. Suppose you want to caption the following image:
You can prompt the model and retrieve its response like so:
- cURL
- Python
curl --location 'http://localhost:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer token' \
--data '{
"model": "mistralai/Pixtral-12B-2409",
"messages": [
{
"role": "user",
"content": [
{"type" : "text", "text": "Describe this image in a short sentence."},
{"type": "image_url", "image_url": {"url": "https://picsum.photos/id/237/200/300"}}
]
}
]
}'
import httpx
url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json", "Authorization": "Bearer token"}
data = {
"model": "mistralai/Pixtral-12B-2409",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image in a short sentence."},
{
"type": "image_url",
"image_url": {"url": "https://picsum.photos/id/237/200/300"},
},
],
}
],
}
response = httpx.post(url, headers=headers, json=data)
print(response.json())
Deploying with Docker
If you are looking to deploy vLLM as a containerized inference server you can leverage the project's official Docker image (see more details in the vLLM Docker documentation).
-
Set the HuggingFace access token environment variable in your shell:
export HF_TOKEN=your-access-token
-
Run the Docker command to start the container:
- Mistral NeMo
- Mistral Small
- Pixtral-12B
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-NeMo-Instruct-2407 \
--tokenizer_mode mistral \
--load_format mistral \
--config_format mistraldocker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-Small-Instruct-2409 \
--tokenizer_mode mistral \
--load_format mistral \
--config_format mistraldocker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Pixtral-12B-2409 \
--tokenizer_mode mistral \
--load_format mistral \
--config_format mistral
Once the container is up and running you will be able to run inference on your model using the same code as in a standalone deployment.