Text Generation Inference

Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-access LLMs. Among other features, it has quantization, tensor parallelism, token streaming, continuous batching, flash attention, guidance, and more.

The easiest way of getting started with TGI is using the official Docker container.

Deploying

Mistral-7B
Mixtral-8X7B
Mixtral-8X22B

model=mistralai/Mistral-7B-Instruct-v0.3

model=mistralai/Mixtral-8x22B-Instruct-v0.1

model=mistralai/Mixtral-8x22B-Instruct-v0.1

volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data  \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    ghcr.io/huggingface/text-generation-inference:2.0.3 \
    --model-id $model

This will spawn a TGI instance exposing an OpenAI-like API, as documented in the API section.

Make sure to set the HUGGING_FACE_HUB_TOKEN environment variable to your Hugging Face user access token. To use Mistral models, you must first visit the corresponding model page and fill out the small form. You then automatically get access to the model.

If the model does not fit in your GPU, you can also use quantization methods (AWQ, GPTQ, etc.). You can find all TGI launch options at their documentation.

Using the API

With chat-compatible endpoint

TGI supports the Messages API which is compatible with Mistral and OpenAI Chat Completion API.

Using MistralClient
Using OpenAI Client
Using cURL

from mistralai.client import MistralClient
from mistralai.models.chat_completion import ChatMessage

# init the client but point it to TGI
client = MistralClient(api_key="-", endpoint="http://127.0.0.1:8080")
chat_response = client.chat(
    model="-",
    messages=[
      ChatMessage(role="user", content="What is the best French cheese?")
    ]
)

print(chat_response.choices[0].message.content)

from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(api_key="-", base_url="http://127.0.0.1:8080/v1")
chat_response = client.chat.completions.create(
    model="-",
    messages=[
      {"role": "user", "content": "What is deep learning?"}
    ]
)

print(chat_response)

curl http://127.0.0.1:8080/v1/chat/completions \
    -X POST \
    -d '{
  "model": "tgi",
  "messages": [
    {
      "role": "user",
      "content": "What is deep learning?"
    }
  ]
}' \
    -H 'Content-Type: application/json'

Using a generate endpoint

If you want more control over what you send to the server, you can use the generate endpoint. In this case, you're responsible of formatting the prompt with the correct template and stop tokens.

Using Python
Using JavaScript
Using cURL

# Make sure to install the huggingface_hub package before
from huggingface_hub import InferenceClient

client = InferenceClient(model="http://127.0.0.1:8080")
client.text_generation(prompt="What is Deep Learning?")

async function query() {
    const response = await fetch(
        'http://127.0.0.1:8080/generate',
        {
            method: 'POST',
            headers: { 'Content-Type': 'application/json'},
            body: JSON.stringify({
                'inputs': 'What is Deep Learning?'
            })
        }
    );
}

query().then((response) => {
    console.log(JSON.stringify(response));
});

curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'

Deploying​

Using the API​

With chat-compatible endpoint​

Using a generate endpoint​

Deploying

Using the API

With chat-compatible endpoint

Using a generate endpoint