vLLM

vLLM is an open-source LLM inference and serving engine. It is particularly appropriate as a target platform for self-deploying Mistral models on-premise.

Pre-requisites

The hardware requirements for vLLM are listed on its installation documentation page.
By default, vLLM sources the model weights from Hugging Face. To access Mistral model repositories you need to be authenticated on Hugging Face, so an access token HF_TOKEN with the READ permission will be required. You should also make sure that you have accepted the conditions of access on each model card page.
If you already have the model artifacts on your infrastructure you can use them directly by pointing vLLM to their local path instead of a Hugging Face model ID. In this scenario you will be able to skip all Hugging Face related setup steps.

Getting started

The following sections will guide you through the process of deploying and querying Mistral models on vLLM.

Installing vLLM

Create a Python virtual environment and install the vllm package (version >=0.6.1.post1 to ensure maximum compatibility with all Mistral models).
Authenticate on the HuggingFace Hub using your access token $HF_TOKEN :
```
huggingface-cli login --token $HF_TOKEN
```

Offline mode inference

When using vLLM in offline mode the model is loaded and used for one-off batch inference workloads.

Text input (Mistral NeMo)
Text input (Mistral Small)
Image + text input (Pixtral-12B)

from vllm import LLM
from vllm.sampling_params import SamplingParams

model_name = "mistralai/Mistral-NeMo-Instruct-2407"
sampling_params = SamplingParams(max_tokens=8192)

llm = LLM(
    model=model_name,
    tokenizer_mode="mistral",
    load_format="mistral",
    config_format="mistral",
)

messages = [
    {
        "role": "user",
        "content": "Who is the best French painter. Answer with detailed explanations.",
    }
]

res = llm.chat(messages=messages, sampling_params=sampling_params)
print(res[0].outputs[0].text)

from vllm import LLM
from vllm.sampling_params import SamplingParams

model_name = "mistralai/Mistral-Small-Instruct-2409"
sampling_params = SamplingParams(max_tokens=8192)

llm = LLM(
    model=model_name,
    tokenizer_mode="mistral",
    load_format="mistral",
    config_format="mistral",
)

messages = [
    {
        "role": "user",
        "content": "Who is the best French painter. Answer with detailed explanations.",
    }
]

res = llm.chat(messages=messages, sampling_params=sampling_params)
print(res[0].outputs[0].text)

Suppose you want to caption the following images:

You can do so by running the following code:

from vllm import LLM
from vllm.sampling_params import SamplingParams

model_name = "mistralai/Pixtral-12B-2409"
max_img_per_msg = 3

sampling_params = SamplingParams(max_tokens=8192)
llm = LLM(
    model=model_name,
    tokenizer_mode="mistral",
    load_format="mistral",
    config_format="mistral",
    limit_mm_per_prompt={"image": max_img_per_msg},
)

urls = [f"https://picsum.photos/id/{id}/512/512" for id in ["1", "11", "111"]]

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image"},
            ] + [{"type": "image_url", "image_url": {"url": f"{u}"}} for u in urls],
    },
]

res = llm.chat(messages=messages, sampling_params=sampling_params)
print(res[0].outputs[0].text)

Server mode inference

In server mode, vLLM spawns an HTTP server that continuously waits for clients to connect and send requests concurrently. The server exposes a REST API that implements the OpenAI protocol, allowing you to directly reuse existing code relying on the OpenAI API.

Text input (Mistral NeMo)
Text input (Mistral Small)
Image + text input (Pixtral-12B)

Start the inference server to deploy your model, e.g. for Mistral NeMo:

vllm serve mistralai/Mistral-Nemo-Instruct-2407 \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral

You can now run inference requests with text input:

cURL
Python

curl --location 'http://localhost:8000/v1/chat/completions' \
    --header 'Content-Type: application/json' \
    --header 'Authorization: Bearer token' \
    --data '{
        "model": "mistralai/Mistral-Nemo-Instruct-2407",
        "messages": [
          {
            "role": "user",
            "content": "Who is the best French painter? Answer in one short sentence."
          }
        ]
      }'

import httpx

url = 'http://localhost:8000/v1/chat/completions'
headers = {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer token'
}
data = {
    "model": "mistralai/Mistral-Nemo-Instruct-2407",
    "messages": [
        {
            "role": "user",
            "content": "Who is the best French painter? Answer in one short sentence."
        }
    ]
}

response = httpx.post(url, headers=headers, json=data)

print(response.json())

Start the inference server to deploy your model, e.g. for Mistral Small:

vllm serve mistralai/Mistral-Small-Instruct-2409 \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral

You can now run inference requests with text input:

cURL
Python

curl --location 'http://localhost:8000/v1/chat/completions' \
    --header 'Content-Type: application/json' \
    --header 'Authorization: Bearer token' \
    --data '{
        "model": "mistralai/Mistral-Small-Instruct-2409",
        "messages": [
          {
            "role": "user",
            "content": "Who is the best French painter? Answer in one short sentence."
          }
        ]
      }'

import httpx

url = 'http://localhost:8000/v1/chat/completions'
headers = {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer token'
}
data = {
    "model": "mistralai/Mistral-Small-Instruct-2409",
    "messages": [
        {
            "role": "user",
            "content": "Who is the best French painter? Answer in one short sentence."
        }
    ]
}

response = httpx.post(url, headers=headers, json=data)

print(response.json())

Start the inference server to deploy your model, e.g. for Pixtral-12B:

vllm serve mistralai/Pixtral-12B-2409 \
    --tokenizer_mode mistral \
    --config_format mistral \
    --load_format mistral

info

The default number of image inputs per prompt is set to 1. To increase it, set the --limit_mm_per_prompt option (e.g. --limit_mm_per_prompt 'image=4').
If you encounter memory issues, set the --max_model_len option to reduce the memory requirements of vLLM (e.g. --max_model_len 16384). More troubleshooting details can be found in the vLLM documentation.

You can now run inference requests with images and text inputs. Suppose you want to caption the following image:

You can prompt the model and retrieve its response like so:

cURL
Python

curl --location 'http://localhost:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer token' \
--data '{
    "model": "mistralai/Pixtral-12B-2409",
    "messages": [
      {
        "role": "user",
        "content": [
            {"type" : "text", "text": "Describe this image in a short sentence."},
            {"type": "image_url", "image_url": {"url": "https://picsum.photos/id/237/200/300"}}
        ]
      }
    ]
  }'

  import httpx

  url = "http://localhost:8000/v1/chat/completions"
  headers = {"Content-Type": "application/json", "Authorization": "Bearer token"}
  data = {
      "model": "mistralai/Pixtral-12B-2409",
      "messages": [
          {
              "role": "user",
              "content": [
                  {"type": "text", "text": "Describe this image in a short sentence."},
                  {
                      "type": "image_url",
                      "image_url": {"url": "https://picsum.photos/id/237/200/300"},
                  },
              ],
          }
      ],
  }

  response = httpx.post(url, headers=headers, json=data)

  print(response.json())

Deploying with Docker

If you are looking to deploy vLLM as a containerized inference server you can leverage the project's official Docker image (see more details in the vLLM Docker documentation).

Set the HuggingFace access token environment variable in your shell:
```
export HF_TOKEN=your-access-token
```

Run the Docker command to start the container:

Mistral NeMo
Mistral Small
Pixtral-12B

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model mistralai/Mistral-NeMo-Instruct-2407 \
    --tokenizer_mode mistral \
    --load_format mistral \
    --config_format mistral

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model mistralai/Mistral-Small-Instruct-2409 \
    --tokenizer_mode mistral \
    --load_format mistral \
    --config_format mistral

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model mistralai/Pixtral-12B-2409 \
    --tokenizer_mode mistral \
    --load_format mistral \
    --config_format mistral 

Once the container is up and running you will be able to run inference on your model using the same code as in a standalone deployment.

Pre-requisites​

Getting started​

Installing vLLM​

Offline mode inference​

Server mode inference​

Deploying with Docker​