Observability
Why observability?
Observability is essential for Large Language Model (LLM) systems across prototyping, testing, and production for several reasons:
- Visibility: Observability provides detailed insights into the internal states of LLM applications, allowing developers to understand system behavior. This visibility is crucial for identifying and diagnosing issues and debugging.
- Production requirement: Implementing observability in production environments address critical requirements including monitoring, scalability, security and compliance.
- Reproducibility: Observability enables developers to observe and reproduce LLM system behavior.
- Continuous improvement: The insights gained from observability data can be used to drive continuous improvement initiatives.
What components do we observe?
The short answer is: anything and everything!
An LLM (Large Language Model) application can include one or more LLM calls. Understanding both the details at the individual API call level and the sequence of these calls at the application level is crucial:
-
Individual LLM call level: at the individual LLM API call level, an LLM receives an input prompt and generates an output. Therefore, we can monitor and observe three key components: input prompt, model, and output.
-
Application level: At the application level, it’s important to observe the pattern, logistics and sequence of LLM calls. This sequence determines the flow of information and the order in which LLMs are called and which tasks are executed.
Individual level: what components can we observe?
For effective observability, we need to monitor and record detailed information for each version of each component involved in the interaction with the LLM. Here's a breakdown of what to observe and some expected modules in an observability tool:
Input prompt
- Prompt template
- The standardized format or structure used to generate the input prompt, including any placeholders or variables within the template.
- Observability tools often provide a registry of prompt templates that the community or an organization can use and share.
- Examples
- Few-shot in-context learning is often effective in prompt engineering. Specific examples or sample inputs can be used to guide the model's response used.
- Retrieve context
- In a Retrieval-Augmented Generation (RAG) system, relevant context is retrieved from external sources or databases to provide information for the LLM, making the results more reliable.
- Memory
- Historical data or previous interactions stored in memory.
- How this memory is used to influence the current prompt, such as summarizing past memory, retrieving relevant memory, or using the most recent memory.
- Tools
- Any tools or utilities used to preprocess or enhance the input prompt.
- Tools are becoming increasingly important in LLM applications, serving as the bridge to real-life applications.
- Specific configurations or settings applied by these tools and their impact.
Model
- Models specs
- The specific version or identifier of the model being used.
- Configuration settings, hyperparameters, and any customizations applied to the model.
Output
- Formatting
- The structure and format of the output generated by the model.
Application level: what workflow patterns can we observe?
An LLM system often composed of more than just one LLM. At the application level, there are specific workflow patterns that require specific observability in each step of the workflow. Here is some example workflows:
- RAG
- A RAG system includes the document retrieval step in addition to the generation step from an LLM. Additional observability is needed to track and monitor the external document/dataset and the retrieval step.
- LLM as part of a system
- An LLM system could involve multiple LLMs chained together, flow engineering with various iterations, or a complicated multi-agent system, for example to create a simulated world. The input and output of each step need to be observed to understand the overall system behavior, identify bottlenecks, and ensure the system's reliability and performance.
- Fine-tuning
- Fine-tuning is a distinct workflow that might be part of a larger workflow or a prerequisite step of another workflow. It involves preparing a fine-tuning dataset, uploading data, creating a fine-tuning job, and using a fine-tuned model. Each of these steps, especially the fine-tuning training job, could benefit from observability to track fine-tuning datasets, monitor progress, identify issues, and ensure the quality of the fine-tuned model.
What metrics do we observe?
At each step of the LLM system workflow, we can observe the following and set overall Service Level Objectives (SLOs), alerts, and monitoring:
Token and cost
- Track the number of tokens processed and the associated costs.