Skip to main content



LLMs (Large Language Models), such as the Mistral AI models, are AI models trained on extensive text data to predict the next word in a sentence. They are capable of understanding and generating text in a way that's similar to how humans communicate. They can answer questions, draft documents, summarize texts, extract information, translate languages, write code, and more.

Text generation

Text generation in large language models is the process of producing coherent and contextually relevant text based on a given input prompt. These models, such as Mistral AI, are trained on vast amounts of text data to predict the next word in a sentence, given the previous words as context. This capability enables them to generate text that is similar to human communication and can be used for various applications, including answering questions, drafting documents, summarizing texts, translating languages, and coding.


Tokens serve as the smallest individual units that a language model processes, typically representing common sequences of characters such as words or subwords. In order for a language model to comprehend text, it must be converted into numerical representations. This is accomplished by encoding the text into a series of tokens, where each token is assigned a unique numerical index. The process of converting text into tokens is known as tokenization. One widely used tokenization algorithm is Byte-Pair Encoding (BPE), which initially treats each byte in a text as a separate token. BPE then iteratively adds new tokens to the vocabulary for the most frequent pair of tokens in the corpus, replacing occurrences of the pair with the new token, until no more replacements can be made. This results in a compact and efficient representation of the text for processing by the language model.

Mixture of Experts

Mixture of Experts (MoE) is the underlying architecture of Mixtral 8x7b and Mixtral 8x22b. It is a neural network architecture that incorporates expert layers within the Transformer block, allowing models to be pretrained with much less compute while maintaining the same quality as dense models. This is achieved by replacing dense feed-forward network (FFN) layers with sparse MoE layers, which contain multiple "experts" (FFNs). A gate network or router determines which input tokens are sent to which expert for computation. MoE offers benefits such as efficient pretraining and faster inference, but also presents challenges like overfitting during fine-tuning and high memory requirements. Nevertheless, MoE is a valuable method for achieving improved model quality at lower computing costs by dynamically assigning input tokens to specialized experts for processing.


Retrieval-augmented generation (RAG) is an AI framework that synergizes the capabilities of LLMs and information retrieval systems. There are two main steps in RAG: 1) retrieval: retrieve relevant information from a knowledge base with text embeddings stored in a vector store; 2) generation: insert the relevant information to the prompt for the LLM to generate information. RAG is useful to answer questions or generate content leveraging external knowledge including up-to-date information and domain-specific information. RAG allows the model to access and utilize information beyond its training data, reducing hallucination and improving factual accuracy. Check out our Basic RAG guide for details.


Fine-tuning is a process used in large language models to adapt a pre-trained model to a specific task or domain. It involves continuing the training process on a smaller, task-specific dataset and adjusting the model's parameters to optimize its performance on the new dataset. This enables the model to learn task-specific language patterns and improve its performance on the target task. Fine-tuning can be beneficial for adapting the model to a particular format or tone, domain-specific tasks, and improving performance through distillation from a larger model. This approach can achieve state-of-the-art performance with fewer data and computational resources compared to training a model from scratch.

Function calling

Function calling allows Mistral models to connect to external tools and call external functions or APIs to perform tasks beyond the model's capabilities. This allows the model to access and leverage external tools and resources to improve its performance and provide more accurate responses. Function calling can be used for tasks such as retrieving real-time data, performing calculations, accessing databases, and interacting with other systems or services. It improves the model's accuracy, efficiency, and versatility. Check out our Function Calling guide to learn more.


Embeddings are vectorial representations of text that capture the semantic meaning of paragraphs through their position in a high dimensional vector space. These vectors capture the semantic meaning and context of the text, allowing the model to understand and generate language more effectively. Mistral AI Embeddings API offers cutting-edge, state-of-the-art embeddings for text, which can be used for many NLP tasks. Check out our Embeddings guide to learn more.