RAG evaluation using Ragas - A proof-of-concept

Summary

One of the most promising use-cases for Large Language Models (LLMs) is Retrieval Augmented Generation (RAG). RAG helps LLMs handle large amounts of contextual data by helping them overcome the limitations of context stuffing by augmenting the text generation capabilities of an LLM with the ability to retrieve the relevant part of a larger context to the corresponding user query. Qualifying an LLM using automated testing is a challenge, given the non-deterministic nature of their responses. To tackle this challenge, several LLM assessment frameworks are on the rise, one of which is Ragas (RAG Assessment). The aim of this research task is to build a RAG model, qualify it using Ragas as a proof-of-concept, and share the findings with the community.

Why conduct this research?

How does RAG work?

Let us start by breaking down what RAG (Retrieval Augmented Generation) is. Similar to human working memory, LLMs (Large Language Models) have a limited context window. The more the context window is loaded, the more error prone an LLM’s responses can be. ChatGPT currently supports 128k tokens or about 96000 words (as of 11.09.25). Despite this massive size, studies have shown that LLM performance degrades when working with large context windows.

RAG offers an antidote to this degradation. With RAG, a large document or data source is broken down into smaller chunks. These chunks are indexed and stored in a database. When a user queries an LLM, the chunks that are most relevant to the query are retrieved from the database and inserted into an LLM’s context window. The LLM then uses these retrieved chunks to formulate a response. The retrieval is done using machine learning techniques such as semantic search. E.g., say you uploaded a high-school general science textbook to an LLM. When you ask a question related to gravitation, the specific chunks that pertain to this topic are retrieved and the LLM then uses these specific chunks to answer the question rather than the entire textbook.

Having understood how RAG works helps us appreciate how relevant this technique can be. Several companies will benefit from building RAG models that index their internal information and offer users a chat interface to access this information. For E.g., one of my former clients, an industrial pump manufacturer, is currently implementing a RAG model to build an LLM that help customers choose the right pump for their application.

Qualifying a RAG model

It can be difficult to objectively qualify a RAG model. However, there are some metrics that we can use to qualify most RAG models. Examples include

Another challenge with testing LLMs is their non-deterministic nature. Most automated software testing techniques rely on determinism for qualifying software, which makes them unsuitable for LLM testing. This challenge also persists with RAG models.

Ragas (RAG assessment) is a framework that aims to tackle these challenges. Firstly, Ragas establishes a set of objective metrics. The three metrics mentioned above (Answer Relevance, Faithfulness, Context Relevance) were were proposed and defined in a paper that announced Ragas. The framework now offers many more metrics of a similar nature. Secondly, non-determinism is tackled by using another LLM as a judge. This study found LLM-as-judge to be on par with human evaluation within its context. To evaluate the metrics, Ragas uses carefully designed prompts to have an LLM break down a response and provide pass-or-fail scores on the corresponding criteria, which are then aggregated to evaluate the metric.

Research objective

The objective for this research task is to build a RAG model, qualify it using Ragas, and share what we learn with the community. While building a RAG model from scratch can seem an intimidating task, it is made much easier by beginner-friendly tutorials from LangChain.

The manner in which these learnings can be shared can take the form of a repository, live-demonstrations, articles, webinars etc. We will figure this out as we go.

What has been done so far?

The following steps have been done

All of this work can be viewed in this Jupiter Notebook within the project’s repository.

Observations

Overall, the task has met its objectives. We were able to build a PoC, and here are a few of our learnings

This control is likely to be crucial to tame a RAG model to behave in the manner that it is intended to.

Exploratory Testing Results (WIP)

Next Steps

We aim to share these findings as artifacts, one of which is this article. Other promising avenues are live-demoes and video recordings.