Deploy with SkyPilot
SkyPilot is a framework for running LLMs, AI, and batch jobs on any cloud, offering maximum cost savings, highest GPU availability, and managed execution.
We provide an example SkyPilot config that deploys the Mistral-7B-v0.1
model on an AWS g5.xlarge
(A node with a single NVIDIA A10G GPU) instance.
SkyPilot Configuration
After installing SkyPilot, you need to create a configuration file that tells SkyPilot how and where to deploy your inference server, using our pre-built docker container:
envs:
MODEL_NAME: mistralai/Mistral-7B-v0.1
resources:
cloud: aws
accelerators: A10G:1
ports:
- 8000
run: |
docker run --gpus all -p 8000:8000 ghcr.io/mistralai/mistral-src/vllm:latest \
--host 0.0.0.0 \
--model $MODEL_NAME \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE
Once these environment variables are set, you can use sky launch
to launch the inference server with the name mistral-7b
:
sky launch -c mistral-7b mistral-7b-v0.1.yaml --region us-east-1
When deployed that way, the model will be accessible to the whole world. You must secure it, either by exposing it exclusively on your private network (change the --host
Docker option for that), by adding a load-balancer with an authentication mechanism in front of it, or by configuring your instance networking properly.
Test it out!
To easily retrieve the IP address of the deployed mistral-7b
cluster you can use:
sky status --ip mistral-7b
You can then use curl to send a completion request:
IP=$(sky status --ip cluster-name)
curl http://$IP:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-v0.1",
"prompt": "My favourite condiment is",
"max_tokens": 25
}'
Usage Quotas
Many cloud providers require you to explicitly request access to powerful GPU instances. Read SkyPilot's guide on how to do this.