TensorRT
Follow the official TensorRT-LLM documentation to build the engine.
- For Mistral-7B, you can use the LLaMA example
- For Mixtral-8X7B, official documentation coming soon...
Deploying the engine
Deploying the engine
Once the engine is built, it can be deployed using the Triton inference server and its TensorRTLLM backend.
Follow the official documentation.