Building an LLM Agent to Find Relevant Research Papers from Arxiv

AgentsLlamaindex+2

This notebook was created by Andrei Chernov (Github, Linkedin) In this tutorial, we will create an LLM agent based on the MistralAI language model. The agent's primary purpose will be to find and summarize research papers from Arxiv that are relevant to the user's query. To build the agent, we will use the LlamaIndex framework.

Tools Used by the Agent

The agent will utilize the following three tools:

  1. RAG Query Engine This tool will store and retrieve recent papers from Arxiv, serving as a knowledge base for efficient and quick access to relevant information.

  2. Paper Fetch Tool If the user specifies a topic that is not covered in the RAG Query Engine, this tool will fetch recent papers on the specified topic directly from Arxiv.

  3. PDF Download Tool This tool allows the agent to download a research paper's PDF file locally using a link provided by Arxiv.

First, let's install necessary libraries

!pip install arxiv==2.1.3 llama_index==0.12.3 llama-index-llms-mistralai==0.3.0 llama-index-embeddings-mistralai==0.3.0 
!pip install arize-phoenix==7.2.0 arize-phoenix-evals==0.18.0 openinference-instrumentation-llama-index==3.0.2
from getpass import getpass
import requests
import sys
import arxiv
from llama_index.llms.mistralai import MistralAI
from llama_index.embeddings.mistralai import MistralAIEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Document, StorageContext, load_index_from_storage, PromptTemplate, Settings
from llama_index.core.tools import FunctionTool, QueryEngineTool
from llama_index.core.agent import ReActAgent

Additionally, You Need to Provide Your API Key to Access Mistral Models

You can obtain an API key here.

api_key= getpass("Type your API Key")
llm = MistralAI(api_key=api_key, model='mistral-large-latest')

To Build a RAG Query Engine, We Will Need an Embedding Model

For this tutorial, we will use the MistralAI embedding model.

model_name = "mistral-embed"
embed_model = MistralAIEmbedding(model_name=model_name, api_key=api_key)

Now, We Will Download Recent Papers About Large Language Models from ArXiv

To keep this tutorial accessible with the free Mistral API version, we will download only the last 10 papers. Downloading more would exceed the limit later while building the RAG query engine. However, if you have a Mistral subscription, you can download additional papers.

def fetch_arxiv_papers(title :str, papers_count: int):
    search_query = f'all:"{title}"'
    search = arxiv.Search(
        query=search_query,
        max_results=papers_count,
        sort_by=arxiv.SortCriterion.SubmittedDate,
        sort_order=arxiv.SortOrder.Descending
    )

    papers = []
    # Use the Client for searching
    client = arxiv.Client()
    
    # Execute the search
    search = client.results(search)

    for result in search:
        paper_info = {
                'title': result.title,
                'authors': [author.name for author in result.authors],
                'summary': result.summary,
                'published': result.published,
                'journal_ref': result.journal_ref,
                'doi': result.doi,
                'primary_category': result.primary_category,
                'categories': result.categories,
                'pdf_url': result.pdf_url,
                'arxiv_url': result.entry_id
            }
        papers.append(paper_info)

    return papers
    
papers = fetch_arxiv_papers("Language Models", 10)
[[p['title']] for p in papers]

To Build a RAG Agent, We First Need to Index All Documents

This process creates a vector representation for each chunk of a document using the embedding model.

def create_documents_from_papers(papers):
    documents = []
    for paper in papers:
        content = f"Title: {paper['title']}\n" \
                  f"Authors: {', '.join(paper['authors'])}\n" \
                  f"Summary: {paper['summary']}\n" \
                  f"Published: {paper['published']}\n" \
                  f"Journal Reference: {paper['journal_ref']}\n" \
                  f"DOI: {paper['doi']}\n" \
                  f"Primary Category: {paper['primary_category']}\n" \
                  f"Categories: {', '.join(paper['categories'])}\n" \
                  f"PDF URL: {paper['pdf_url']}\n" \
                  f"arXiv URL: {paper['arxiv_url']}\n"
        documents.append(Document(text=content))
    return documents



#Create documents for LlamaIndex
documents = create_documents_from_papers(papers)
Settings.chunk_size = 1024
Settings.chunk_overlap = 50

index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

Now, We Will Store the Index

Indexing a large number of texts can be time-consuming and costly since it requires making API calls to the embedding model. In real-world applications, it is better to store the index in a vector database to avoid reindexing. However, for simplicity, we will store the index locally in a directory in this tutorial, without using a vector database.

index.storage_context.persist('index/')
# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir='index/')

#load index
index = load_index_from_storage(storage_context, embed_model=embed_model)

We Are Ready to Build a RAG Query Engine for Our Agent

It is a good practice to provide a meaningful name and a clear description for each tool. This helps the agent select the most appropriate tool when needed.

query_engine = index.as_query_engine(llm=llm, similarity_top_k=5)

rag_tool = QueryEngineTool.from_defaults(
    query_engine,
    name="research_paper_query_engine_tool",
    description="A RAG engine with recent research papers.",
)

Let's Take a Look at the Prompts the RAG Tool Uses to Answer a Query Based on Context

Note that there are two prompts. By default, LlamaIndex uses a refine prompt before returning an answer. You can find more information about the response modes here.

from llama_index.core import PromptTemplate
from IPython.display import Markdown, display
# define prompt viewing function
def display_prompt_dict(prompts_dict):
    for k, p in prompts_dict.items():
        text_md = f"**Prompt Key**: {k}" f"**Text:** "
        display(Markdown(text_md))
        print(p.get_template())
        display(Markdown(""))
        
prompts_dict = query_engine.get_prompts()
display_prompt_dict(prompts_dict)

Building two other tools is straightforward because they are simply Python functions.

def download_pdf(pdf_url, output_file):
    """
    Downloads a PDF file from the given URL and saves it to the specified file.

    Args:
        pdf_url (str): The URL of the PDF file to download.
        output_file (str): The path and name of the file to save the PDF to.

    Returns:
        str: A message indicating success or the nature of an error.
    """
    try:
        # Send a GET request to the PDF URL
        response = requests.get(pdf_url)
        response.raise_for_status()  # Raise an error for HTTP issues

        # Write the content of the PDF to the output file
        with open(output_file, "wb") as file:
            file.write(response.content)

        return f"PDF downloaded successfully and saved as '{output_file}'."

    except requests.exceptions.RequestException as e:
        return f"An error occurred: {e}"
download_pdf_tool = FunctionTool.from_defaults(
    download_pdf,
    name='download_pdf_file_tool',
    description='python function, which downloads a pdf file by link'
)
fetch_arxiv_tool = FunctionTool.from_defaults(
    fetch_arxiv_papers,
    name='fetch_from_arxiv',
    description='download the {max_results} recent papers regarding the topic {title} from arxiv' 
)
# building an ReAct Agent with the three tools.
agent = ReActAgent.from_tools([download_pdf_tool, rag_tool, fetch_arxiv_tool], llm=llm, verbose=True)

Let's Chat with Our Agent

We built a ReAct agent, which operates in two main stages:

  1. Reasoning: Upon receiving a query, the agent evaluates whether it has enough information to answer directly or if it needs to use a tool.
  2. Acting: If the agent decides to use a tool, it executes the tool and then returns to the Reasoning stage to determine whether it can now answer the query or if further tool usage is necessary.
# create a prompt template to chat with an agent
q_template = (
    "I am interested in {topic}. \n"
    "Find papers in your knowledge database related to this topic; use the following template to query research_paper_query_engine_tool tool: 'Provide title, summary, authors and link to download for papers related to {topic}'. If there are not, could you fetch the recent one from arXiv? \n"
)
answer = agent.chat(q_template.format(topic="Audio Models"))
Markdown(answer.response)

The agent chose to use the RAG tool, found the relevant papers, and summarized them for us.

Since the agent retains the chat history, we can request to download the papers without mentioning them explicitly.

answer = agent.chat("Download the papers, which you mentioned above")
Markdown(answer.response)

Let's see what happens if we ask about a topic that is not available in the RAG.

answer = agent.chat(q_template.format(topic="Gaussian process"))
Markdown(answer.response)

As You Can See, the Agent Did Not Find the Papers in Storage and Fetched Them from ArXiv.

For a more detailed view of the agent's execution, check out your Phoenix dashboard.

(Optional) Let's Trace and Evaluate the Agent

LlamaIndex has a built-in observability layer powered by Arize Phoenix. We can use this to trace the agent's execution and evaluate its performance.

If you don't have a Phoenix API key, you can get one here.

from phoenix.otel import register
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
import os

PHOENIX_API_KEY = getpass("Type your Phoenix API Key")
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"

tracer = register(project_name="arxiv-agentic-rag")
LlamaIndexInstrumentor().instrument(tracer_provider=tracer)

Now any calls we make to LlamaIndex will be traced and logged to your Phoenix instance.

Because we've just now turned on tracing, we'll need to run the agent again to see the trace data. Typically you would enable tracing earlier in the notebook to capture all the agent's execution.

answer = agent.chat(q_template.format(topic="Audio Models"))
answer = agent.chat("Download the papers, which you mentioned above")
answer = agent.chat(q_template.format(topic="Gaussian process"))

Now if you go to your Phoenix instance, you should see the trace data for the agent's execution.

Evaluate the agent's performance

While it's easy to manually spot check the first few iterations of your agent's execution, it's not practical to do this for every iteration.

Let's add a more scalable way to evaluate the agent's performance.

There are infinite ways to evaluate the agent's performance. Let's look at two common ones:

  1. Evaluating the agent's RAG skill
  2. Evaluating the agent's function calling accuracy

We'll use an LLM as a Judge for both of these evaluations, with Mistral as our Judge.

from phoenix.session.evaluation import get_retrieved_documents, get_qa_with_reference
from phoenix.trace import SpanEvaluations, DocumentEvaluations
import phoenix as px
from phoenix.evals import (
    MistralAIModel,
    RelevanceEvaluator,
    HallucinationEvaluator,
    QAEvaluator,
    run_evals,
)

import nest_asyncio
nest_asyncio.apply()

eval_model = MistralAIModel(api_key=api_key)

Evaluate the agent's RAG skill

# First retrieve documents from Phoenix
retrieved_documents_df = get_retrieved_documents(px.Client(), project_name="arxiv-agentic-rag")
retrieved_documents_df.head()

# Use Phoenix's RelevanceEvaluator to evaluate the relevance of the retrieved documents
relevance_evaluator = RelevanceEvaluator(eval_model)

retrieved_documents_relevance_df = run_evals(
    evaluators=[relevance_evaluator],
    dataframe=retrieved_documents_df,
    provide_explanation=True,
    concurrency=5,
)[0]
# Retrieve Question and Answer pairs with reference answers
qa_with_reference_df = get_qa_with_reference(px.Client(), project_name="arxiv-agentic-rag")

# Evaluate the correctness of the Q&A pairs
qa_evaluator = QAEvaluator(eval_model)

# Evaluate the hallucination of the Q&A pairs
hallucination_evaluator = HallucinationEvaluator(eval_model)

# Run evaluations for Q&A correctness and hallucination
qa_correctness_eval_df, hallucination_eval_df = run_evals(
    evaluators=[qa_evaluator, hallucination_evaluator],
    dataframe=qa_with_reference_df,
    provide_explanation=True,
    concurrency=5,
)

With these three metrics calculated on our RAG skill, we can log them to Phoenix to view them alongside the trace data.

px.Client().log_evaluations(
    SpanEvaluations(dataframe=qa_correctness_eval_df, eval_name="Q&A Correctness"),
    SpanEvaluations(dataframe=hallucination_eval_df, eval_name="Hallucination"),
    DocumentEvaluations(dataframe=retrieved_documents_relevance_df, eval_name="relevance"),
)

Evaluate the agent's function calling accuracy

Now let's evaluate the agent's function calling accuracy, aka how often the agent uses the correct tool to answer a query.

from phoenix.trace.dsl import SpanQuery
from phoenix.evals import (
    llm_classify,
    TOOL_CALLING_PROMPT_RAILS_MAP,
    TOOL_CALLING_PROMPT_TEMPLATE,
)

Same as before, we'll start by retrieving the relevant trace data. In the previous section, we were able to use helper methods in the Phoenix SDK to retrieve the trace data. Here, we'll use the more general SpanQuery DSL to retrieve the trace data based on the filters we set.

query = (
    SpanQuery()
    .where(
        # Filter for the `LLM` span kind.
        # The filter condition is a string of valid Python boolean expression.
        "span_kind == 'LLM'",
    )
    .select(
        # Extract and rename the following span attributes
        question="llm.input_messages",
        tool_call="llm.function_call",
    )
)
trace_df = px.Client().query_spans(query, project_name="arxiv-agentic-rag")
trace_df["tool_call"] = trace_df["tool_call"].fillna("No tool used")
trace_df["question"] = trace_df["question"].fillna("No question")

We also need to pass in the tool definitions to the evaluator so it knows the possible tools available to the agent.

tool_definitions = ""

for current_tool in [download_pdf_tool, rag_tool, fetch_arxiv_tool]:
    tool_definitions += f"""
    {current_tool.metadata.name}: {current_tool.metadata.description}
    """

tool_definitions = tool_definitions.replace("{", "").replace("}", "")
trace_df["tool_definitions"] = tool_definitions
print(tool_definitions)

Now we're ready to run the evaluations. We'll use the llm_classify method to classify the tool calls as correct or incorrect.

rails = list(TOOL_CALLING_PROMPT_RAILS_MAP.values())

template = TOOL_CALLING_PROMPT_TEMPLATE.explanation_template[0].template.replace("{tool_definitions}", tool_definitions)

function_calling_evals = llm_classify(
    dataframe=trace_df,
    template=TOOL_CALLING_PROMPT_TEMPLATE,
    model=eval_model,
    rails=rails,
    concurrency=5,
    provide_explanation=True,
)
function_calling_evals["score"] = function_calling_evals.apply(
    lambda x: 1 if x["label"] == "correct" else 0, axis=1
)

And finally, we can log the evaluations to Phoenix to view them alongside the trace data.

px.Client().log_evaluations(
    SpanEvaluations(dataframe=function_calling_evals, eval_name="Function Calling Accuracy"),
)

Congratulations! You've now built an LLM agent with LlamaIndex and added evaluated it's performance using Phoenix.