UNIVERSITAT DE BARCELONA FUNDAMENTAL PRINCIPLES OF DATA SCIENCE MASTER’S THESIS Evaluating Tool-Augmented ReAct Language Agents Author: Jokin Eguzkitza Supervisor: Laura Igual Pablo Álvarez A thesis submitted in partial fulfillment of the requirements for the degree of MSc in Fundamental Principles of Data Science in the Facultat de Matemàtiques i Informàtica June 30, 2025 ii UNIVERSITAT DE BARCELONA Abstract Facultat de Matemàtiques i Informàtica MSc in Fundamental Principles of Data Science Master Thesis Title by Jokin Eguzkitza This thesis studies how to evaluate ReAct agents that use external tools. ReAct agents are AI Agents that combine reasoning and tool use (functions), allowing large language models to perform tasks that require accessing external sources of informa- tion. These agents are becoming more common in real applications, but evaluating their behaviour remains a challenge. Using LangGraph and LangChain three differ- ent AI agents are created using locally deployed LLM models served with Ollama. These agents use open-source tools like Wikipedia, Wikidata, Yahoo Finance and PDF readers. To evaluate them, the project combines rule-based checks with RA- GAS metrics to measure tool use, answer quality, factual correctness and context use. The results show that prompt design is very important to guide the agent’s be- haviour, and that typical question-answer metrics are not always enough to measure how well an agent works. This work offers a simple and practical way to test LLM agents. All the corresponding code notebook can be found on the following repository, https://github.com/Jokinn9/Evaluating-Tool-Augmented-ReAct-Language-Agents iii Acknowledgements I would like to thank my academic advisor, Laura Igual, for her support, feed- back and guidance throughout the development of this thesis. I also want to thank Pablo Álvarez, my company supervisor, for his practical advice and help during the project. I also would like to thank my classmates and friends for their encourage- ment and useful discussions during this year, as well as my family for their constant support and patience. 2Contents Abstract ii Acknowledgements iii 1 Introduction 4 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 The Rise of the LLM Agent Ecosystem . . . . . . . . . . . . . . . . . . . 4 1.3 The Challenge of Evaluating LLM Agents . . . . . . . . . . . . . . . . . 5 1.4 Objectives of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Background 7 2.1 What Are Agents? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 From Agents to Agentic Design Patterns . . . . . . . . . . . . . . . . . . 7 2.2.1 Tool Use Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Reflection Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.3 ReAct Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.4 Planning Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.5 Multi-Agent Pattern . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Evaluation Framework and Metrics . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Custom Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . 10 2.3.2 Evaluation with RAGAS . . . . . . . . . . . . . . . . . . . . . . . 11 3 Experimental Design and Results 14 3.1 Common Agent Architecture . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.1 Choice of Language Model . . . . . . . . . . . . . . . . . . . . . 14 3.1.2 Agent Construction with LangGraph . . . . . . . . . . . . . . . 14 3.1.3 Shared Agent Loop and Tool Integration . . . . . . . . . . . . . 15 3.1.4 Message Format Adaptation . . . . . . . . . . . . . . . . . . . . 15 3.1.5 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Agent 1: Baseline ReAct Agent . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.1 Available Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.2 Example Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.3 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.4 First Evaluation Results and Observations . . . . . . . . . . . . 17 3.2.5 Limitations and Targeted Improvements . . . . . . . . . . . . . 18 3.3 Agent 2: React Agent with Wikipedia and Wikidata Tools . . . . . . . . 20 3.3.1 Available Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.2 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.3 First Evaluation Results and Observations . . . . . . . . . . . . 20 3.3.4 Limitations and Targeted Improvements . . . . . . . . . . . . . 23 3.4 Agent 3: Metal-Focused Agent with PDF, Finance and Wikipedia Tools 26 3.4.1 Available Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4.2 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . 26 Contents 3 3.4.3 First Evaluation Results and Observations . . . . . . . . . . . . 26 3.4.4 Limitations and Targeted Improvements . . . . . . . . . . . . . 28 4 Conclusions and future work 30 Bibliography 33 4Chapter 1 Introduction 1.1 Introduction In recent years, the development of large language models (LLMs) has significantly transformed the field of artificial intelligence. Trained on massive text datasets and later fine tuned for different tasks, models like OpenAI’s GPT-4 (OpenAI, 2023), Meta’s LLaMA (Touvron et al., 2023) and Mistral (Jiang et al., 2023) have shown strong performance in natural language understanding, reasoning and generation. These models have rapidly moved from being research tools to becoming core el- ements in both industry and academia, supporting applications such as conversa- tional agents, educational tutors, legal reasoning systems and tools for automated software development. Building on these capabilities, a new generation of intelligent systems has ap- peared, AI agents, systems that embed LLMs within agentic frameworks. These agents expand the functionality of LLMs beyond isolated prompt completion by en- abling structured reasoning, dynamic decision-making and interaction with external tools such as APIs, databases or code execution environments. This leads to a change from passive language models to active agents that can follow goals, complete tasks and adapt to new information. These agents are powerful because they are flexible and can work on their own. They combine language generation with things like memory, planning, tool use and even working together with other agents. Because of this, they are now used in more complex tasks such as research, customer support or data analysis. What used to be just an idea, the idea of an AI that can think, act and adapt, is now a real solution used in many areas (Amazon Web Services, 2024). 1.2 The Rise of the LLM Agent Ecosystem This rapid progress has been made possible not only by the models themselves but also by the growing ecosystem of tools and frameworks that make their orchestra- tion easier. Libraries like LangChain, LangGraph, AutoGen and CrewAI have in- troduced modular structures for creating agents that combine language generation with tool use, memory and control flows . These frameworks provide ways to build agents as sequences of reasoning steps, decision points and external actions, effec- tively turning LLMs into systems that can be programmed to reason and act. At the same time, the open-source movement has played a crucial role in making these technologies more accessible. Initiatives such as Hugging Face Transformers, the release of open-source LLM models like LLaMA and Mistral, and tool that allow 1.3. The Challenge of Evaluating LLM Agents 5 them to deploy this LLMs like Ollama or LM Studio have enabled to build and ex- periment with LLM agents without depending on commercial APIs or cloud-based black boxes. As a result, there has been a rapid increase in agentic applications, from simple personal assistants to fully autonomous multiagent systems. These agents are no longer just experimental prototypes, they are now being integrated into real produc- tion environments, business processes and decision support systems across different sectors. The ability to design and deploy these systems using only open source com- ponents makes this trend especially relevant from both a technological and social point of view (Wu et al., 2023). 1.3 The Challenge of Evaluating LLM Agents The increasing complexity and autonomy of LLM-based agents has also introduced new challenges, particularly in how these systems are evaluated. Traditional NLP evaluation metrics, such as exact match, are not sufficient to capture the behaviour of agents operating in dynamic environments. Unlike static models that produce a single response to a prompt, agents often follow multi-step reasoning chains, call external tools, adapt to user feedback and make decisions based on internal state or memory. Evaluating these systems requires a more detailed approach, one that considers both the final result and the reasoning process that leads to it. Moreover, agent performance cannot be reduced to a single metric. A proper evaluation should take into account multiple aspects of behaviour and reasoning. These questions are essential not only for benchmarking but also for real world de- ployment. Poor reasoning or incorrect tool use can lead to issues such as misinfor- mation, system failures or loss of trust. As agents are increasingly adopted in areas as healthcare and finance, the need for robust and transparent evaluation methods becomes necessary (Wu et al., 2025). Some of the key questions to consider include the following: • Is the answer correct or helpful? • Were the appropriate tools selected and used? • Is the reasoning trace logically sound and interpretable? • How robust is the agent to input variation or ambiguous queries? • Are there signs of hallucination or inconsistent behaviour? 1.4 Objectives of the thesis This thesis emerges from the need to demonstrate how agents can be evaluated. The focus is not only on assessing whether they produce correct outputs, but also on examining how they reach those outputs, how they interact with tools and how their behaviour changes across different tasks. By doing so, the goal is to provide a structured and reproducible framework for evaluating agentic systems that rely on open source components and can be applied in practical scenarios. The challenge is that while language models can be evaluated using embeddings and well-established metrics, this is not as straightforward with agents. Even though 6 Chapter 1. Introduction some evaluation is possible, current tools are more limited. Since this is still a devel- oping area, there is not much work yet on how to properly assess whether an agent is working as expected (Wu et al., 2023). For this reason, we will use the RAGAS library (RAGAS, 2024), as it offers different metrics specifically designed for evalu- ating this type of system. The idea is to build the agent, send it a set of queries and evaluate the responses. Based on the results, we will analyse where the agent fails and use that information to improve its performance. The data used to evaluate the different agents is synthetic, meaning it has been generated internally. Since each agent is designed to handle a specific type of query, the evaluation set is created to simulate realistic usage. These queries are not taken from existing datasets but are instead crafted to reflect the tasks the agents are ex- pected to perform. This allows us to test their behaviour in a controlled but relevant way. 7Chapter 2 Background 2.1 What Are Agents? Large Language Models (LLMs) such as GPT, LLaMA or Mistral have significantly advanced the field of natural language processing. These models are impressive at tasks like summarization, translation, question answering and text generation. However, LLMs are fundamentally passive systems: they take an input prompt, produce a probabilistic response and terminate the interaction. They lack memory, persistence or the ability to act autonomously across multiple steps or over time. These limitations have pushed the development of new approaches in AI. While LLMs are strong at generating text, agents build on top of them by adding memory, tool use and the ability to interact with the environment (Fauscette, 2024). Agents operate within defined boundaries and are capable of adapting to differ- ent inputs or goals. They use tools like APIs or databases to perform tasks that go beyond simple text generation, ranging from automation to more complex problem- solving. They are especially useful in open-ended and dynamic settings where instruc- tions are given in natural language, such as virtual assistants or embedded helpers. With little to no human supervision, these systems can manage workflows, trigger actions and combine multiple sources of information to complete a task (Google Cloud, 2025). 2.2 From Agents to Agentic Design Patterns As LLM-based agents have become more common, the way they are built and used has started to vary. Although they all aim to enable autonomous behaviour through language models, the way they reason, act and use tools can differ quite a bit be- tween systems. This variation comes from underlying design choices, such as how the agent’s control flow is organised, how much reasoning is passed to the model, how memory is managed and how actions are selected and carried out. These recurring struc- tures are often described as agentic design patterns. Understanding these patterns is essential not only for building agents, but also for evaluating them. Each pattern brings distinct strengths and weaknesses in terms of interpretability, generalization and performance. Therefore, the choice of agentic design is not merely technical but depends on the application it is intended for. Five of the most popular design patterns used building AI agents are explained below. 8 Chapter 2. Background 2.2.1 Tool Use Pattern The tool use pattern is based on extending the capabilities of a language model by allowing it to interact with external tools (Figure 2.1). These tools can include things like querying a vector database, running Python code or calling APIs. This gives the agent access to up to date or specialised information, so it is not limited to the knowledge stored in the model itself. Instead of trying to generate the full answer on its own, the LLM decides whether it needs to use a tool and how to structure the tool call. It is especially useful in scenarios where real-time data or external operations are needed (Microsoft, 2024). FIGURE 2.1: Architecture of a Tool Use Pattern. 2.2.2 Reflection Pattern The reflection pattern is based on the idea that the agent can review its own output before giving a final answer (Figure 2.2). After generating an initial response, the model goes back, checks for possible mistakes or inconsistencies and tries to improve the result through one or more iterations. This process allows the agent to improve its reasoning and fix basic mistakes that may come up in the first try. It is used for tasks that involve several steps or need a certain level of precision. Rather than just generating an answer in one go, the model goes back, checks what it did and adjusts the output (Vidhya, 2024). FIGURE 2.2: Architecture of a Reflection Pattern. 2.2.3 ReAct Pattern The ReAct (Reason and Act) pattern combines the two previous ideas: the agent is able to reflect on its own reasoning and it can also interact with external tools (Figure 2.2. From Agents to Agentic Design Patterns 9 2.3). This means it can generate thoughts, decide on an action, execute it and then use the result to continue reasoning. By mixing reflection and tool use, the agent becomes much more flexible and effective. It can break down a task into steps, access external information when needed and revise its own thinking. This approach was first introduced in the ReAct framework (Yao et al., 2023), which showed how combining reasoning and tool use leads to better performance in complex tasks. FIGURE 2.3: Architecture of a ReAct Pattern. 2.2.4 Planning Pattern The planning pattern is about breaking a task into smaller parts before trying to solve it (Figure 2.4). Instead of generating an answer all at once, the agent starts by creating a plan that includes sub-goals or intermediate steps. This can involve subdividing the task, outlining what needs to be done or listing objectives that help guide the reasoning process (DeepLearning.AI, 2024a). FIGURE 2.4: Architecture of a Planning Pattern. 2.2.5 Multi-Agent Pattern Multi-agent pattern is based on using several agents that work together to complete a task (Figure 2.5). Each agent has a specific role and is responsible for a particu- lar part of the process. Just like in other patterns, each agent can access tools when needed. What makes this setup different is the collaboration between agents. They can communicate, delegate tasks to one another and combine their outputs to pro- duce the final result (DeepLearning.AI, 2024b). 10 Chapter 2. Background FIGURE 2.5: Architecture of a Multi-Agent Pattern. 2.3 Evaluation Framework and Metrics Evaluating LLM-based agents requires more than assessing the correctness of their final output. As it has been explained in the previous section, section 2.2, these systems perform multi-step reasoning or make tool-related decisions all of which must be analysed to understand their real capabilities and limitations. To address this goal, different evaluation metrics will be used. Some are more traditional, while others are based on the RAGAS framework. 2.3.1 Custom Evaluation Metrics To provide an interpretable and reproducible baseline, several manual and rule- based metrics are applied. These metrics are intuitive, offering insight into tool usage correctness, output structure and task efficiency. • Tool Call Accuracy (Exact Match): this metric compares the actual tool calls made by the agent with a predefined list of expected tool invocations. A match is counted when both the tool name and its arguments match. Accuracy is computed as: Tool Accuracy = Number of correct tool calls Number of expected tool calls • Substring Match: For tasks where specific keywords or formats are required (e.g., "EUR", "converted price"), a simple heuristic checks whether the expected substring appears in the final output. The metric returns a boolean value: True if the substring is found, and False otherwise. • Numeric Value Comparison: When the output is expected to contain a spe- cific numerical value, the system extracts all numbers from the response and compares them to a target value, allowing for a small floating point tolerance. The result is a boolean value: True if the expected number (within tolerance) is found, and False otherwise. • Number of reasoning steps: In addition to correctness, the agent trace is anal- ysed by counting the number of reasoning steps involved in producing the final answer. 2.3. Evaluation Framework and Metrics 11 2.3.2 Evaluation with RAGAS To complement the rule-based approach, the thesis integrates RAGAS (Retrieval Augmented Generation Assessment), a framework designed for evaluating LLM- based systems using language model scoring and semantic understanding. • ToolCallAccuracy (Multi-Turn): This metric accounts for tool use correctness over time, even when names or arguments are not literal string matches (RA- GAS Metrics for Agents: Tool Call Accuracy 2024). It returns True if the correct tool was used with appropriate arguments at any point in the interaction, and False otherwise. Score = Number of correct tool uses Total number of queries evaluated RAGAS also provides more classical QA style metrics: • Response Relevancy: measures how relevant a response is to the user in- put. Higher scores indicate better alignment with the user input, while lower scores are given if the response is incomplete or includes redundant informa- tion (RAGAS: Answer Relevance Metric 2024). This metric is calculated using the user_input and the response as follows: 1. Generate a set of artificial questions (default=3) based on the response. Questions designed to reflect the content of the response. 2. Compute the cosine similarity between the embedding of the user input (E0) and the embedding of each generated question (Egi ). 3. Take the average of these cosine similarity scores. Response Relevancy = 1 N N ∑ i=1 cosine similarity(Egi , E0) Response Relevancy = 1 N N ∑ i=1 Egi · E0 ∥Egi∥∥E0∥ Where: – Egi : Embedding of the i th generated question. – E0: Embedding of the user input. – N: Number of generated questions (default is 3). An example can help to understand this: – User question: Who is the president of Finland? – Agent response: The current president of Finland is Alexander Stubb. – Generated questions based on the response: * Who is the current president of Finland? * What is the name of Finland’s president? * Who holds the presidency in Finland? 12 Chapter 2. Background These generated questions are compared with the original user question using cosine similarity between their embeddings. Since they are semantically very close, the response receives a high Response Relevancy score. • Faithfulness: measures how factually consistent a response is with the re- trieved context. It ranges from 0 to 1, with higher scores indicating better con- sistency (RAGAS: Faithfulness Metric 2024). A response is considered faithful if all its claims can be supported by the re- trieved context.To calculate this: 1. Identify all the claims in the response. 2. Check each claim to see if it can be inferred from the retrieved context. 3. Compute the faithfulness score using the formula: Score = No. of claims in the response supported by the retrieved context Total number of claims in the response Example. – User question: Who is the president of France? – Retrieved context: Emmanuel Macron – Agent response: Emmanuel Macron became president of France in 2010 and was born in Marseille. Claims in the response: 1. Emmanuel Macron is president of France → Supported 2. He became president in 2010 → Not supported 3. He was born in Marseille → Not supported Faithfulness score: Faithfulness = 1 3 ≈ 0.33 • Context Precision: measures the proportion of relevant chunks in the retrieved contexts. It is calculated as the mean of the precision@k for each chunk in the context. Precision@k is the ratio of the number of relevant chunks at rank k to the total number of chunks at rank k (RAGAS: Context Precision Metric 2024). In our setup, only one chunk is retrieved per query (K = 1), so the metric simplifies to a binary decision based on whether that single chunk is relevant or not. Context Precision@K = ∑ K k=1 (Precision@k× vk) Total number of relevant items in the top K results Precision@k = true positives@k true positives@k + false positives@k Where K is the total number of chunks in retrieved_contexts and vk ∈ {0, 1} is the relevance indicator at rank k. 2.3. Evaluation Framework and Metrics 13 Example 1. Relevant context (score = 1.0) • User question: Who is the president of France? • Retrieved context: Emmanuel Macron • LLM evaluation: The context is relevant (v1 = 1) Context Precision@1 = 1.0× 1 1 = 1.0 Example 2 – Irrelevant context (score = 0.0) • User question: What is the capital of Spain? • Retrieved context: Barcelona is famous for its modernist architecture. • LLM evaluation: the fragment does not answer the question, so it is marked as not relevant (v1 = 0) Context Precision@1 = 1× 0 1 = 0.0 These are computed using a locally deployed LLM and open-source embed- dings. 14 Chapter 3 Experimental Design and Results 3.1 Common Agent Architecture In this project, all evaluated agents share a common architectural structure. While their behaviour is defined by different design patterns and tool configurations, the underlying agent loop, language model and framework is the same. 3.1.1 Choice of Language Model A core component of any LLM-based agent is the language model itself, which serves as the reasoning engine behind decisions, tool selection, and natural language generation. While commercial models such as ChatGPT or Claude are the best and the most widespread LLMs, they require access to paid APIs. To address this problem, this project uses a locally hosted open source model, LLaMA 3.2 1 , developed by Meta AI. In order to run this LLM, the Ollama2 frame- work was used, which allows different language models to be downloaded and ex- ecuted locally on personal hardware. At first, the idea was to work with Mistral because of its smaller size. However, after running several experiments, Mistral was replaced by LLaMA 3.2 due to its instability and inconsistent results. 3.1.2 Agent Construction with LangGraph To manage the reasoning and decision-making of the agent, this project uses Lang- Graph3, a framework that lets you build LLM agents as stateful graphs. In this setup, the agent is represented as a directed graph where each node performs a spe- cific step and the edges define how the process moves from one step to the next. The architecture used in this project has of the following nodes: • Start Node: Initializes the interaction and passes the user query into the agent flow. • Agent Node: Invokes the LLM to perform reasoning and decide whether a tool should be used. • Tools Node: Executes external tool calls if requested, such as querying prices or performing conversions. • End Node: Terminates the execution either after producing a final answer or reaching a stop condition. 1https://ollama.com/library/llama3.2 2https://ollama.com/ 3https://www.langchain.com/langgraph 3.1. Common Agent Architecture 15 This structure is wrapped in a named subgraph called assistant, which allows LangGraph to control while keeping a clear separation between the reasoning part and the actions taken by the agent. FIGURE 3.1: LangGraph structure of the agent loop. As shown in Figure 3.1, the flow begins with the start node and proceeds to the agent reasoning node. Based on the LLM’s output, the system may either move to a tool invocation phase or end the interaction. After each tool call, the agent returns to the reasoning node with the new observation, which enables multi-step reasoning and chaining of actions. 3.1.3 Shared Agent Loop and Tool Integration All the agents that will be created will follow the structure shown in Figure 3.1. However, there are two main differences between the agents: 1. Each agent will use different tools, such as currency conversion, internet access or metal pricing. 2. The reasoning they follow will be different. In other words, the prompt to build the agent will be different so that they adapt to the situation. 3.1.4 Message Format Adaptation Although LLaMA 3.2 behaves similarly to OpenAI models in terms of reasoning and tool selection, it differs in how tool calls are structured in the output. LangGraph expects tool invocations to be returned in a dedicated tool_calls field, but LLaMA often embeds them directly in the text, which requires additional parsing to extract them properly. Not only for LangGraph, in the evaluation phase too, RAGAS is designed to work with the chat_format message structure used by models like ChatGPT. For this reason, the message must be converted into the expected format before running the evaluation. 16 Chapter 3. Experimental Design and Results To achieve this transformation, a post-processing function was implemented to reformat the model’s output. It extracts tool names and arguments and injects them into the expected schema without altering the model’s reasoning. 3.1.5 Evaluation Setup In order to compute the evaluation metrics, the agent’s responses need to be com- pared against the provided context and the original query. This process requires both a language model to interpret the responses and a set of embeddings to mea- sure semantic similarity between texts. For this purpose, the evaluation pipeline uses the open source model Mistral4 as the LLM and the all-MiniLM-L6-v25 model from Sentence Transformers to generate embeddings. The embeddings are loaded via the HuggingFaceEmbeddings interface from LangChain6. Although the agent itself runs on LLaMA 3.2, a different model is used for eval- uation. This separation is justified by the fact that the evaluation task has differ- ent requirements: rather than generating fluent responses, it focuses on comparing meaning and factual consistency. In this context, using a lighter and faster model like Mistral allows efficient scoring without significantly affecting the results. 3.2 Agent 1: Baseline ReAct Agent This first agent serves as the baseline configuration for evaluating reasoning and tool use. It implements the ReAct design pattern, interleaving natural language reason- ing with tool invocation in a loop guided by LangGraph. 3.2.1 Available Tools The agent is equipped with two tools that simulate access to external data: • get_metal_price (metal_name): returns the current price (in USD per gram) of a specified metal. • get_currency_exchange (base, target): returns a fake exchange rate be- tween two currencies. 3.2.2 Example Behaviour When asked, for example, “What is the price of gold in EUR?”, the agent should follow this reasoning process: 1. Calls get_metal_price("gold") to retrieve the price in USD. 2. Calls get_currency_exchange("USD", "EUR") to retrieve the exchange rate. 3. Combines both results to compute and return the final price in EUR. 4https://mistral.ai 5https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 6https://python.langchain.com/docs/integrations/text_embedding/huggingface 3.2. Agent 1: Baseline ReAct Agent 17 3.2.3 Implementation Notes Initially, no explicit prompt was used, as the integration with LangChain7 and Lang- Graph8 exposes tool descriptions to the LLM, allowing it to decide when to use them. However, this approach proved unreliable: the agent often failed to invoke the tools correctly or hallucinated responses without accessing them. To address this issue, a custom prompt was introduced to guide the agent on how and when to use the available tools. After this change, the agent consistently used the tools as intended, and its performance improved significantly. The prompt used was: You are a ReAct agent. Read carefully the dataset. For example: "What is the price of METAL in CURRENCY?" 1. Always call get_metal_price with the name of the metal. 2. If the currency is not USD, call get_currency_exchange with base="USD" and target=CURRENCY. 3. Use both results to compute and explain the final price. 4. If the currency is already USD, you don't need to call get_currency_exchange. 3.2.4 First Evaluation Results and Observations To validate the behaviour of the baseline ReAct agent, we conducted a series of eval- uations focused on its ability to correctly use tools, return accurate values, and main- tain efficiency across a diverse set of queries. The metrics considered include tool usage accuracy, output validation and number of reasoning steps. The following figure compares the symbolic Tool Accuracy (exact match of calls) with the semantic RAGAS ToolCallAccuracy, which evaluates whether tools were used correctly. FIGURE 3.2: Tool Accuracy vs. RAGAS Tool Accuracy As seen in Figure 3.2, while the agent performs well in most cases, a small subset of queries led to mismatches between expected and detected tool calls, which neg- atively affected the RAGAS score. When Tool Accuracy is defined as 0.5, that means the agent only invoked one of the two expected tools. In these cases, the agent ei- ther skipped the second tool call or replaced it with an incorrect one. This leads RAGAS to assign a score of 0, as the tool usage no longer aligns with the intended task structure. 7https://www.langchain.com/ 8https://www.langchain.com/langgraph 18 Chapter 3. Experimental Design and Results The next plot reports whether the final output contained the expected keywords, and whether the extracted numerical values matched the expected value within a tolerance of 0.01. As introduced in Section 2.3.1, the metric Output Contains Expected measures whether the final response includes the name of the queried metal and the target currency, ensuring that the agent has correctly interpreted and answered the ques- tion. On the other hand, Value Correctness checks whether the numerical result falls within an acceptable margin of the expected value, allowing for slight deviations due to rounding. Together, these two criteria provide a practical way to evaluate the accuracy and informativeness of the agent’s final output. FIGURE 3.3: Output contains expected phrases and value correctness Figure 3.3 reveals that most responses were complete and correct, though a few returned inaccurate values despite proper tool usage. In particular, there were cases where the agent selected and called the correct tools, but failed to compute or express the final result accurately. We further explored reasoning complexity by tracking the number of LLM steps per query and the total execution time required. FIGURE 3.4: Number of Reasoning Steps per Query As shown in Figure 3.4, the vast majority of queries required six reasoning steps, which aligns with the expected structure: initial input, internal thought, tool call, second thought, and final answer. However, a small number of queries triggered significantly more steps likely due to repeated tool calls or loops caused by the hal- lucination of the agent. On the other end, some queries ended early, either because the agent failed to complete the full reasoning loop or terminated prematurely due to hallucination. 3.2.5 Limitations and Targeted Improvements Although the baseline ReAct agent demonstrated solid performance overall, the evaluation surfaced specific limitations that required refinement. In particular, some 3.2. Agent 1: Baseline ReAct Agent 19 responses exhibited minor numerical inaccuracies despite correct tool usage, and others showed inconsistent behavior when interpreting instructions from the initial prompt. To address these issues, two focused improvements were introduced: • Prompt refinement: The original prompt was updated and simplified to in- clude a more instructive guidance. You are a ReAct agent. Read carefully the dataset. For example: "What is the price of METAL in CURRENCY?": 1) Call get_metal_price with METAL. 2) Then call get_currency_exchange with base='USD', target=CURRENCY. 3) Finally, respond combining both results. • Relaxed value comparison threshold: The evaluation logic was adjusted to a higher tolerance when comparing numeric outputs to expected values. Specif- ically, the strict comparison abs(num - expected_value) < 0.01 was relaxed to avoid false negatives caused by rounding. “‘ These two modifications improved the agent’s output correctness without alter- ing its architectural structure. These are the results obtained, for the same inputs used before. FIGURE 3.5: Tool Accuracy vs. RAGAS Tool Accuracy FIGURE 3.6: Output contains expected phrases and value correctness Comparing with the previous model it can be seen in Figure 3.5 and 3.6 that the results have improved and that the agent fails less than before. 20 Chapter 3. Experimental Design and Results 3.3 Agent 2: React Agent with Wikipedia and Wikidata Tools This agent is designed to answer factual questions using external sources instead of relying only on the model’s internal knowledge. It has access to two tools: one that gets short summaries of people from Wikipedia9 , and another that finds out who the current president of a given country is using Wikidata10 . 3.3.1 Available Tools The agent is set up with two specific tools to get information from public sources: • get_summary_of(person): returns a short summary of a person by querying Wikipedia’s API. • get_current_president_of(country): finds the name of the current presi- dent of a country by querying Wikidata. 3.3.2 Implementation Notes At the beginning, we tried accessing the internet using a tool based on the Duck- DuckGo API, but it didn’t work well, so we switched to Wikipedia. There were also issues with the prompt, since the model was hallucinating. After testing different versions, we ended up keeping only the Wikipedia tool and the one for checking who the current president is. The final prompt used was: You are a ReAct agent. You must use tools to answer questions  do not assume you know any answer beforehand. If the question is like "Who is the president of COUNTRY?": 1) Call the tool `get_current_president_of` with the country's name. 2) Use the tool's output as your answer. If the question is like "Give me a summary of PERSON": 1) Call the tool `get_summary_of` with the person's name. 2) Use the tool output to generate a natural, fluent summary. 3) Your answer must be factually faithful to the tool output  do not invent or include information from outside sources. Example: - Question: "Give me a summary of Barack Obama?" - Tool output: "Barack Hussein Obama II is an American politician who (...)" - Your response: "Barack Obama served as the 44th U.S. president and was (...)" Don't call two tools at the same time, first call one and after the next. Make sure your response is accurate, relevant, and based only on the tool result. 3.3.3 First Evaluation Results and Observations Once created the agent, a smaller set of inputs was used for testing compared to the previous agent. The main reason is that the evaluation metrics being used this time require more processing time, then, it was necessary a reduced number of queries. 9https://en.wikipedia.org/api/rest_v1/ 10https://www.wikidata.org/w/api.php 3.3. Agent 2: React Agent with Wikipedia and Wikidata Tools 21 FIGURE 3.7: Tool Accuracy scores for each query using RAGAS. As shown in Figure 3.7, some queries score the maximum value for tool accu- racy, which means the agent selected and executed the correct tools in those cases. However, there are a clear failures where the score drops to zero. This happens with prompts where the agent doesn’t identify the type of action that must be taken, mainly for those where more than a tool was need to be used. Secondly, focusing on faithfulness (Figure 3.8), most responses are well aligned with the retrieved content. The majority of scores are 1.0, meaning the model did not introduce any unsupported information. However, there are a few notable ex- ceptions that illustrate the behaviour of the metric. In the case of Who is the president of Argentina?, even though the tool was correctly invoked and returned the right answer, the final response was not marked as faithful (score 0.0). This happened because the model rephrased the output. A similar score of 0.0 appears in Who is the president of France and give me a sum- mary, where the tool retrieved two correct fragments about Emmanuel Macron, but the model failed to use them and instead claimed it could not find any relevant infor- mation. Since the final answer ignored the retrieved content completely, faithfulness drops to the lowest possible value. For Tell me about Barack., the tool returned the summary of Obama’s background, but the model generated a longer response. Even though all of this information is correct, it was not present in the retrieved content, so the metric scored 0.8125 . Summarise the current president of Mexico receives a perfect faithfulness score of 1.0 despite the tool returning an error. This is because the model did not invent an answer but acknowledged the lack of retrieved content and responded accordingly, which the metric rewards as a fully faithful behaviour. FIGURE 3.8: Faithfulness results for the evaluated queries. 22 Chapter 3. Experimental Design and Results Regarding answer relevancy (Figure 3.9), results are generally strong (above 0.75), suggesting that the majority of responses address the user’s question effectively. For instance, answers about the presidents of Argentina, Germany or Chile receive high scores, indicating that the information provided is relevant and directly answers the input prompt. However, there are clear drops in some cases. There are lower scores that appear in Give me a summary of Marie Curie and Tell me about Barack for instance. Although the model provides accurate historical facts it introduces additional details well beyond the retrieved biography. While informative, this extra content partially dilutes the focus of the response in relation to the original question. The two lowest scores appear in Who is the president of France and give me a sum- mary and Summarise the current president of Mexico. In the first, the model ignores the tool output and claims no information was found, which fails to address the ques- tion. In the second, the tool fails and the model gives a fallback response without answering the user’s request. Although these replies may be faithful, they are not considered relevant as they do not fulfil the user’s intent. FIGURE 3.9: Answer Relevancy per query. Finally, context precision (Figure 3.10) remains consistently high across almost all queries, indicating that the retrieved passages are generally well targeted and relevant to the user question. In most cases, the content used by the model aligns closely with the expected answer. The only exception is Summarise the current president of Mexico, where the score drops due to a tool failure that returned no usable content. Notably, Who is the pres- ident of France and give me a summary also scores high, as the tool correctly retrieved relevant information. However, the model failed to use it, which affects other met- rics but not context precision itself. FIGURE 3.10: Context Precision score for each query using a single retrieved context. 3.3. Agent 2: React Agent with Wikipedia and Wikidata Tools 23 3.3.4 Limitations and Targeted Improvements In order to improve the answers and reliability of the agent a new prompt was de- signed. The main problem was that the agent was ignoring the tool answer and it was answering the query from its own memory, so the prompt had to be refined to face mainly that. You are a ReAct agent. You must use tools to answer questions  do not assume you know any answer beforehand. If the question is like "Who is the president of COUNTRY?": 1) Call the tool `get_current_president_of` with the country's name. 2) Use the tool's output as your answer. If the question is like "Give me a summary of PERSON": 1) Call the tool `get_summary_of` with the person's name. 2) Use the tool output to generate a natural, fluent summary. 3) Your answer must be factually faithful to the tool output  do not invent or include information from outside sources. If the question is like "Give me a summary of the president of COUNTRY": 1) First, call the tool `get_current_president_of` using the country name. 2) Then, take the returned name (the president's name) as a string. 3) Call the tool `get_summary_of` with that name. 4) Finally, use the result to write a fluent, factual summary based strictly on the tool output. IMPORTANT: - Do not use the name of the tool as input. - Only use the actual content returned by the tool (the person's name). Example: - Question: "Give me a summary of Barack Obama?" - Tool output: "Barack Hussein Obama II is an American politician who was the 44th president of the United States from 2009 to 2017..." - Your response: "Barack Obama served as the 44th U.S. president and was the first African American to hold the office..." Example 2: - Question: "Give me a summary of the president of Germany." - First tool call: `get_current_president_of("Germany")` - First tool output: "Frank-Walter Steinmeier" - Second tool call: `get_summary_of("Frank-Walter Steinmeier")` - Second tool output: "Frank-Walter Steinmeier is a German politician serving as President of Germany since 2017. He previously served twice as Minister for Foreign Affairs and as Chief of the Federal Chancellery. He is a member of the Social Democratic Party of Germany." - Your response: "Frank-Walter Steinmeier is a German politician who has been President of Germany since 2017. He has also served as Minister for Foreign Affairs and is a member of the Social Democratic Party." Don't call two tools at the same time  first call one, then the next. IMPORTANT: Your final answer must only include facts that are *explicitly present* in the tool output. Do not add anything, even if you know it is true. Do not infer, expand, or include details from memory. Make sure your response is accurate, relevant, and based only on the tool result. Always make sure to wait for the result of each tool call before using its output in the next step. Do not hardcode or reuse tool names as strings. Use only the content returned by the tool as input for further reasoning or tool calls. result. Once defined this second promt, the same metrics used before can be measured. Firstly, looking at the tool accuracy. As shown in Figure 3.11, the value for tool accu- racy has improved compared to the previous one and the agent interprets correctly the queries. 24 Chapter 3. Experimental Design and Results FIGURE 3.11: Tool Accuracy scores for each query using RAGAS. Secondly, regarding faithfulness (Figure 3.12), the scores remain mostly high, except for Give me a summary of Marie Curie, where the agent hallucinates by relying on internal memory and ignoring the response retrieved from the tool. Since some adjectives match, the score is not completely 0. For the other queries, the agent provides factual information based on the retrieved context. FIGURE 3.12: Faithfulness results for the evaluated queries. In contrast, answer relevancy (Figure 3.13) shows slightly more consistency, with fewer extreme drops. The prompt adjustment seems to help the agent stay more focused on the user query. Some cases, however, highlight certain limitations. In Give me a summary of Marie Curie, the response includes valid and factual information, but it adds content and partially shifts the focus, which results in a lower score (0.45). Similarly, Tell me about Barack also scores poorly (0.46), as the answer is overly general and lacks specific alignment with the retrieved tool output. FIGURE 3.13: Answer Relevancy per query. 3.3. Agent 2: React Agent with Wikipedia and Wikidata Tools 25 Lastly, as shown in Figure 3.14, context precision remains consistently high across all evaluated queries, with a perfect score of 1.0 in every case. This indicates that the passages retrieved by the tool are always relevant and correctly aligned with the user’s question. FIGURE 3.14: Context Precision score for each query using a single retrieved context. 26 Chapter 3. Experimental Design and Results 3.4 Agent 3: Metal-Focused Agent with PDF, Finance and Wikipedia Tools This third agent is meant to be a more complex assistant than the first one, this too being related to metals. It can do three main things: look up technical info about a metal (like its description, industrial uses and the 2023 reference price) from a local PDF; get the latest market price from Yahoo Finance 11 ; and fetch general back- ground info from Wikipedia12 . It’s mainly useful for people like metal investors, students or engineers who want quick details about how metals are used, and even for chatbots in industrial or educational websites. 3.4.1 Available Tools The agent includes the following tools: • get_metal_info(metal): returns the metal’s description, industrial uses and its 2023 reference price by parsing a local PDF file. • get_metal_price_yfinance(metal): gets the latest market price of the metal using Yahoo Finance data, limited to a small set of supported symbols. • describe_metal(metal): retrieves a general description from Wikipedia, first trying the “(metal)” page and falling back to the base term if needed. 3.4.2 Implementation Notes This is the prompt used for the agent: You are a ReAct agent specialized in answering questions about metals. You have access to the following tools: 1. `get_metal_info(metal_name)`: Retrieves a technical description, industrial uses, and the 2023 reference price for a specific metal from a local PDF document. 2. `get_metal_price_yfinance(metal_name)`: Retrieves the most recent market price (USD/ounce) of common metals like gold, silver, platinum, palladium, and copper. 3. `describe_metal(metal_name)`: Provides a general encyclopedic description of a metal from Wikipedia. You must follow this step-by-step reasoning: 1. First, identify exactly which tool matches the user's request. - Use `get_metal_info` if the question refers to "the document", "technical data", "description", "uses", or "2023 price". - Use `get_metal_price_yfinance` if the question is about current or market price. - Use `describe_metal` only if the user is asking for general knowledge. 2. Call the tool. 3. Then summarize or quote the tool output explicitly in your final answer. Do not invent information. Do not skip this step. Your answers must be clear and informative. Do not write anything until the tool has responded. Always base your answers on the tool output. 3.4.3 First Evaluation Results and Observations The agent is tested with 10 queries and with the same metrics as the previous agent, except for ToolCallAccuracy, as its utility has already been proven in the previous cases. This time, the metrics fit better as the answers are longer strings. 11https://pypi.org/project/yfinance/ 12https://en.wikipedia.org/api/rest_v1/ 3.4. Agent 3: Metal-Focused Agent with PDF, Finance and Wikipedia Tools 27 The agent achieves high faithfulness scores in most cases (3.15), with values of 1.0. However, some drops reveal specific issues. Describe gold receives a faithfulness score of 0.5, as the model provides only a vague closing question ("Is there anything else...") instead of using the retrieved information. In Can you give me the description and price of silver from the document?, the score drops to 0.0 since the model generates a generic tool failure message, without any retrieved content, the answer cannot be evaluated as grounded. What’s the technical description of silver? receives a relatively low faithfulness score. Although the response is factually correct and related to the source, the way it is reformulated apparently weakens the traceability to the retrieved content lowering the faithfulness score. FIGURE 3.15: Faithfulness results for the evaluated queries. As shown in Figure 3.16, answer relevancy scores are generally high, with most responses achieving values above 0.75. Although, Can you give me the description and price of silver from the document? scores below 0.5, as a tool failure prevented the model from retrieving any meaningful content, resulting in a fallback response that does not fully address the original intent. Describe gold receives a relatively high score (0.75) even though the response does not directly address the question. As the response stays on topic and refers implicitly to the requested entity, it is still considered thematically relevant, despite lacking informative value. This highlights one of the limitations of the metric when evaluating vague or evasive answers. FIGURE 3.16: Answer Relevancy per query. Context precision (3.17) is perfect for most queries. However, there are two clear exceptions. In Describe gold, context precision drops to 0.0 because, although the system retrieved a relevant definition of gold, the model did not use it at all in the response. The generated output simply asks a follow-up question without incorpo- rating any retrieved facts, which breaks the link between context and answer. 28 Chapter 3. Experimental Design and Results A similar issue occurs in Can you give me the description and price of silver from the document?, where the tool fails and returns an error. Since no content is retrieved, the model generates a generic fallback message, and context precision is set to 0.0 due to the complete absence of usable input. FIGURE 3.17: Context precision per query for a single retrieved con- text. 3.4.4 Limitations and Targeted Improvements Looking at the answers received a new prompt was designed. The main goal of this new prompt was to ensure that the model would not answer the query using its internal knowledge. That rule has been emphasised many times along the prompt. You are a ReAct agent specialized in answering questions about metals. GOLDEN RULE: you MUST only use the information provided by the tools. NEVER add information by yourself. Read carefully the output of the tool to generate a focused and concise response to the specific question. Do NOT just repeat or copy the entire tool result. For example: - If the question is only about industrial uses, your answer should only summarize the "Industrial Uses" section from the tool output. - If the question asks about the 2023 price, your answer should mention only the price part from the tool. Avoid including unrelated parts (e.g., don't mention the description if it wasn't asked). Be precise and stay on topic. Summarize, don't copy-paste. Your job is to extract the relevant information and rephrase it clearly. You have access to the following tools: 1. `get_metal_info(metal_name)`: Retrieves a technical description, industrial uses, and "2023 price" for a specific metal from a local PDF document. 2. `get_metal_price_yfinance(metal_name)`: Retrieves the most recent market price (USD/ounce) of common metals like gold, silver, platinum, palladium, and copper. 3. `describe_metal(metal_name)`: Provides a general encyclopedic description of a metal from Wikipedia. You must follow this step-by-step reasoning: 1. First, identify exactly which tool matches the user's request. - Use `get_metal_info` if the question is about the 2023 price, something from the PDF, technical stuff, industrial uses, or if it says "according to the document". - Use `get_metal_price_yfinance` ONLY if the question says something like "current", "today", or "latest market price". Never use this tool for 2023 prices  it's wrong. - Use `describe_metal` only if the user is asking for general knowledge. 2. Call the tool. 3. Then summarize or quote the tool output explicitly in your final answer. Do not invent information. Do not skip this step. Your answers must be clear and informative. Do not write anything until the tool has responded. ALWAYS base your answers on the tool output. 3.4. Agent 3: Metal-Focused Agent with PDF, Finance and Wikipedia Tools 29 As shown, the prompt has been extended mainly to prevent hallucinations and to ensure it answers using the output from the tools rather than generating its own response. Once the new prompt has been applied these are the results obtained. FIGURE 3.18: Faithfulness results for the evaluated queries. The agent improves faithfulness (Figure 3.18) compared to the previous version. Some deviations are observed, such as in Can you check the latest price of copper and describe it? (faithfulness 0.4). Although the model does not introduce any unsup- ported information, the response reformulates and simplifies the retrieved content to the point where the connection to the original context becomes less explicit. This weaker traceability results in a lower score. Similar to the previous agent version, Describe gold gets a low score. Although relevant information was retrieved, the model ignores it and replies with a vague follow-up question. FIGURE 3.19: Answer Relevancy per query. For answer relevancy (Figure 3.19), the agent achieves consistent performance across most queries. As happened in the previous case (Figure 3.16), Describe gold achieves a score around 0.75 despite not directly answering the prompt. FIGURE 3.20: Context Precision per query As seen in Figure 3.20, context precision remains perfect across nearly all ques- tions, correcting previous errors except for Describe gold. 30 Chapter 4 Conclusions and future work This thesis explored the evaluation of ReAct agents with access to external tools, using both custom rules and RAGAS metrics to assess tool usage, answer relevance, factual consistency and context exploitation. Evaluation metrics As agents are still a relatively new paradigm in AI, the need to evaluate them prop- erly has become increasingly important. Unlike traditional language models, agents operate through multi-step reasoning, tool usage and memory, which introduces new dimensions to assess. However, the field of agent evaluation is still in its early stages, and most existing methods are not specifically designed for agentic work- flows. As a result, current solutions are limited and often rely on either manual in- spection or metrics that were originally built for static question answering systems. This creates a gap between how agents behave in practice and how well we are able to measure their performance. On the one hand, the most established option for evaluating agents so far is the use of RAGAS metrics. In this work, we applied several of them: Faithfulness, Context Precision, Answer Relevancy and Tool Accuracy, alongside custom rule-based metrics. Each metric provides a different perspective on the agent’s behaviour. Answer Relevancy has been useful to verify whether the response stays on topic, but it does not guarantee that the content is factually correct. It confirms that the agent understood the theme of the question, but not whether the information pro- vided is accurate. Answer Relevancy is therefore best interpreted with caution: be- cause it measures topical alignment rather than factual accuracy, its score can be high even when the answer is wrong, or low even when the answer is correct, depending on how closely the wording of the response mirrors the wording of the question. Faithfulness, in contrast, is more helpful when trying to detect hallucinations, since it checks whether the claims in the answer are supported by the retrieved con- text. Still, if the agent answers from its own memory and happens to align with the context by coincidence, the score can be misleading. Context Precision could be a powerful metric in scenarios where multiple docu- ments or passages are retrieved, as it measures how focused the retrieved informa- tion is. However, in our case, since only one chunk is used per query, the metric simplifies to a binary outcome (0 or 1), which limits its usefulness. Tool Accuracy, by contrast, proved to be particularly valuable, as it allowed us to track the reasoning process step by step and identify exactly where the agent failed or chosed the right tool, passing the correct arguments or combining the results properly. Chapter 4. Conclusions and future work 31 On the other hand, in addition to RAGAS, custom rule-based metrics were de- veloped for Agent 1 to evaluate specific aspects of task completion. These included checks for correct tool selection, presence of the metal name in the output, and nu- merical accuracy of the returned price. These metrics worked very well for struc- tured tasks involving numerical calculations or clearly defined entities. However, they are only applicable in scenarios where the expected output can be precisely defined. For more open or descriptive responses, such as summaries or general ex- planations, these checks become less useful and cannot be validated through exact matching or numeric thresholds. Problems faced Apart from evaluation metrics, one of the main takeaways is the importance of prompt design. For both Agent 2 and Agent 3, the tools were called correctly from the beginning, but the model often ignored the tool output or generated an answer from memory. Iterating on the prompt was necessary to make the agent incorpo- rate the retrieved content and remain grounded. In Agent 3, this was particularly important when using domain-specific data from PDFs or financial APIs. Today, prompt engineering is gaining increasing attention and demand. Being as precise and specific as possible when interacting with a language model has be- come essential to ensure useful and grounded outputs. It is a growing field, and this project has highlighted its importance in practice. Despite having a correct ar- chitecture and working tool integrations, the agents still failed to produce reliable answers until the prompt was carefully refined. In this case, prompt engineering was the main strategy used to fix issues and improve the overall behaviour of the system. As models become more capable but also more sensitive to phrasing, this area is rapidly evolving and opening up new professional opportunities in both re- search and industry. Hallucinations were one of the most frequent and difficult problems during the project. Even when the agent used the correct tools, it often ignored the tool out- put or gave answers based on its own memory. Many of these responses sounded correct but were not really connected to the retrieved information. Fixing this re- quired many small tests and changes until the agent started working as expected. Although it may seem simple in theory, improving and debugging an agent like this was much harder than it looked. Even so, avoiding hallucinations completely is still very difficult, especially in open or general questions. Another key insight from this project is that LLM-based agents tend to perform poorly when dealing with generic tasks, such as accessing the internet or answer- ing broad questions. In these cases, hallucinations are far more likely, and achieving reliable behaviour requires substantial prompt tuning and validation. In contrast, when the task is narrowly defined and the tools provide clear, structured outputs, the agent performs much more reliably. This suggests that LLM agents are currently best suited for highly specific use cases where the reasoning path and expected out- put can be tightly controlled. 32 Chapter 4. Conclusions and future work Future work Looking ahead, one of the biggest challenges is making sure that an agent can be used in real applications and still behave reliably. In practice, it’s very hard to guar- antee that it will work well 100% of the time. What is usually done in these cases is to define a small set of key queries that must always work. This way, you can make changes or improvements without breaking the most important behaviours. As for evaluation, while the metrics used in this project were useful, there are other tools worth exploring. One of them is Giskard1, which allows you to write tests for your model, check for hallucinations or risky outputs and make sure it behaves as expected in different situations. Using something like that would make it easier to catch problems before putting the agent into production. In general, what’s needed going forward is a more complete way to build and test these agents. That includes writing better prompts and learning how to write good ones, defining what they’re supposed to do and checking carefully how they behave. As these systems get more complex, having a solid process for improving and testing them becomes even more important. Overall, the most time-consuming challenge was getting the agent to generate the correct final answer after using the tools. Once tool calls were reliable, the final reasoning step became the main source of error. Future work should focus on im- proving how tool outputs are used in the response, and on finding better ways to avoid hallucinations without limiting the agent too much. 1https://www.giskard.ai/ 33 Bibliography Amazon Web Services (2024). What is an AI Agent? Accessed: 2025-06-26. URL: https: //aws.amazon.com/what-is/ai-agents/?nc1=h_ls. DeepLearning.AI (2024a). Agentic Design Patterns Part 4: Planning. Accessed: 2025- 05-27. URL: https://www.deeplearning.ai/the- batch/agentic- design- patterns-part-4-planning/. — (2024b). Agentic Design Patterns Part 5: Multi-Agent Collaboration. Accessed: 2025- 05-27. URL: https://www.deeplearning.ai/the- batch/agentic- design- patterns-part-5-multi-agent-collaboration/. Fauscette, Michael (2024). Agentic AI vs LLMs: Understanding the Shift from Reactive to Proactive AI. Accessed: 2025-04-25. URL: https://www.arionresearch.com/ blog/agentic-ai-vs-llms-understanding-the-shift-from-reactive-to- proactive-ai. Google Cloud (2025). What are AI agents? Definition, examples, and types. Web page. Accessed: 2025-05-07. URL: https://cloud.google.com/discover/what-are- ai-agents. Jiang, Zi et al. (2023). “Mistral: Efficient Open-Weight Language Models”. In: arXiv preprint arXiv:2310.06825. Microsoft (2024). Tool Use. Accessed: 2025-05-28. URL: https://microsoft.github. io/ai-agents-for-beginners/04-tool-use/. OpenAI (2023). GPT-4 Technical Report. Accessed: 2025-05-26. URL: https://openai. com/research/gpt-4. RAGAS (2024). RAGAS Documentation. https://docs.ragas.io. Accessed: 2025-05- 26. RAGAS: Answer Relevance Metric (2024). Accessed: 2025-05-10. URL: https://docs. ragas.io/en/latest/concepts/metrics/available_metrics/answer_relevance/ #response-relevancy. RAGAS: Context Precision Metric (2024). Accessed: 2025-05-10. URL: https://docs. ragas . io / en / latest / concepts / metrics / available _ metrics / context _ precision/#example. RAGAS: Faithfulness Metric (2024). Accessed: 2025-05-10. URL: https://docs.ragas. io/en/latest/concepts/metrics/available_metrics/faithfulness/. RAGAS Metrics for Agents: Tool Call Accuracy (2024). Accessed: 2025-05-10. URL: https: //docs.ragas.io/en/latest/concepts/metrics/available_metrics/agents/ #tool-call-accuracy. Touvron, Hugo et al. (2023). “LLaMA: Open and Efficient Foundation Language Models”. In: arXiv preprint arXiv:2302.13971. Vidhya, Analytics (2024). Agentic AI: Reflection Pattern. Accessed: 2025-05-27. URL: https://www.analyticsvidhya.com/blog/2024/10/agentic-ai-reflection- pattern/. Wu, Juncheng et al. (2025). Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains. arXiv preprint. arXiv:2506.02126v1. URL: https://arxiv.org/ pdf/2506.02126v1. 34 Bibliography Wu, Qingyun et al. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi- Agent Conversation. arXiv: 2308.08155 [cs.AI]. URL: https://arxiv.org/abs/ 2308.08155. Yao, Shunyu et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv: 2210.03629 [cs.CL]. URL: https://arxiv.org/abs/2210.03629.