WEBINAR | AI Prototype to Production: Operationalizing and Orchestrating AI
December 28, 2023

What is RAG? (Retrieval Augmented Generation)

Table of Contents:


Large language models (LLMs) like GPT-3.5 have proven to be capable when asked about commonly known subjects or topics that they would have received a large quantity of training data for. However, when asked about topics that include data they have not been trained on, they either state that they do not possess the knowledge or, worse, can hallucinate plausible answers.

Retrieval Augmented Generation (RAG) is a method that improves the performance of Large Language Models (LLMs) by integrating an information retrieval component with the model's text generation capabilities. This approach addresses two main limitations of LLMs:

  1. Outdated Knowledge: Traditional LLMs, like ChatGPT, have a static knowledge base that ends at a certain point in time (for example, ChatGPT's knowledge cut-off is in January 2022). This means they lack information on recent events or developments.

  2. Knowledge Gaps and Hallucination: When LLMs encounter gaps in their training data, they may generate plausible but inaccurate information, a phenomenon known as "hallucination."

RAG tackles these issues by combining the generative capabilities of LLMs with real-time information retrieval from external sources. When a query is made, RAG retrieves relevant and current information from an external knowledge store and uses this information to provide more accurate and contextually appropriate responses by adding this information to the prompt. This is equivalent to handing someone a pile of papers covered in text and instructing them that "the answer to this question is contained in this text; please find it and write it out for me using natural language." This approach allows LLMs to respond with up-to-date information and reduces the risk of providing incorrect information due to knowledge gaps.

RAG Architecture

This article focuses on what's known as "naive RAG", which is the foundational approach of integrating LLMs with knowledge bases. We'll discuss more advanced techniques at the end of this article, but the fundamental ideas of RAG systems (of all levels of complexity) still share several key components working together:

  1. Orchestration Layer: This layer manages the overall workflow of the RAG system. It receives user input along with any associated metadata (like conversation history), interacts with various components, and orchestrates the flow of information between them. These layers typically include tools like LangChain, Semantic Kernel, and custom native code (often in Python) to integrate different parts of the system.

  2. Retrieval Tools: These are a set of utilities that provide relevant context for responding to user prompts. They play an important role in grounding the LLM's responses in accurate and current information. They can include knowledge bases for static information and API-based retrieval systems for dynamic data sources.

  3. LLM: The LLM is at the heart of the RAG system, responsible for generating responses to user prompts. There are many varieties of LLM, and can include models hosted by third parties like OpenAI, Anthropic, or Google, as well as models running internally on an organization's infrastructure. The specific model used can vary based on the application's needs.

  4. Knowledge Base Retrieval: Involves querying a vector store, a type of database optimized for textual similarity searches. This requires an Extract, Transform, Load (ETL) pipeline to prepare the data for the vector store. The steps taken include aggregating source documents, cleaning the content, loading it into memory, splitting the content into manageable chunks, creating embeddings (numerical representations of text), and storing these embeddings in the vector store.

  5. API-based Retrieval: For data sources that allow programmatic access (like customer records or internal systems), API-based retrieval is used to fetch contextually relevant data in real-time.

  6. Prompting with RAG: Involves creating prompt templates with placeholders for user requests, system instructions, historical context, and retrieved context. The orchestration layer fills these placeholders with relevant data before passing the prompt to the LLM for response generation. Steps taken can include tasks like cleaning the prompt of any sensitive information and ensuring the prompt stays within the LLM's token limits


The challenge with RAG is finding the correct information to provide along with the prompt!

Indexing Stage

  • Data Organization: Imagine you’re the little guy in the cartoon above, surrounded by textbooks. We take each of these books and break them into bite-sized pieces—one might be about quantum physics, while another might be about space exploration. Each of these pieces, or documents, is processed to create a vector, which is like an address in the library that points right to that chunk of information.
  • Vector Creation: Each of these chunks is passed through an embedding model, a type of model that creates a vector representation of hundreds or thousands of numbers that encapsulate the meaning of the information. The model assigns a unique vector to each chunk—sort of like creating a unique index that a computer can understand. This is known as the indexing stage.

Querying Stage

  • Querying: When you want to ask an LLM a question it may not have the answer to, you start by giving it a prompt, such as “What’s the latest development in AI legislation?”
  • Retrieval: This prompt goes through an embedding model and transforms into a vector itself—it's like it’s getting its own search terms based on its meaning and not just identical matches to its keywords. The system then uses this search term to scour the vector database for the most relevant chunks related to your question.

  • Prepending the Context: The most relevant chunks are then served up as context. It’s similar to handing over reference material before asking your question, except we give the LLM a directive: “Using this information, answer the following question.” While the prompt to the LLM gets extended with a lot of this background information, you as a user don’t see any of this. The complexity is handled behind the scenes.
  • Answer Generation: Finally, equipped with this newfound information, the LLM generates a response that ties in the data it’s just retrieved, answering your question in a way that feels like it knew the answer all along.


The image is a two-panel cartoon. Both panels depict a woman asking a man wearing a “ChatGPT” sweater, “What’s the newest MacBook?” The cartoon is a commentary on the difference in information provided when using training data only versus incorporating additional context.

Chunking techniques

The actual chunking of the documents is somewhat of an art in itself. GPT-3.5 has a maximum context length of 4,096 tokens, or about 3,000 words. Those words represent the sum total of what the model can handle—if we create a prompt with a context 3,000 words long, the model will not have enough room to generate a response. Realistically, we shouldn’t prompt with more than about 2,000 words for GPT-3.5. This means there is a trade-off with chunk size that is data-dependent.

With smaller chunk_size values, the text returned produces more detailed chunks of text but risks missing information if they’re located far away in the text. On the other hand, larger chunk_size values are more likely to include all necessary information in the top chunks, ensuring better response quality, but if the information is distributed throughout the text, it will miss important sections.

Let’s use some examples to illustrate how this trade-off works, using the recent Tesla Cybertruck release event. While some models of the truck will be available in 2024, the cheapest model—with just RWD—will not be available until 2025. Depending on the formatting and chunking of the text used for RAG, the model’s response may or may not encounter this fact!

In these images, blue indicates where a match was found and the chunk was returned; the grey box indicates the chunk was not retrieved; and the red text indicates where relevant text existed but was not retrieved. Let’s take a look at an example where shorter chunks succeed:

Exhibit A: Shorter chunks are better… sometimes.

In the image above, on the left, the text is structured so that the admission that the RWD will be released in 2025 is separated by a paragraph but also has relevant text that is matched by the query. The method of retrieving two shorter chunks works better because it captures all the information. On the right, the retriever is only retrieving a single chunk and therefore does not have the room to return the additional information, and the model is given incorrect information.

However, this isn’t always the case; sometimes longer chunks work better when text that holds the true answer to the question doesn’t strongly match the query. Here’s an example where longer chunks succeed:

Exhibit B: Longer chunks are better… sometimes. 

Optimizing RAG

Improving the performance of a RAG system involves several strategies that focus on optimizing different components of the architecture:

  1. Enhance Data Quality (Garbage in, Garbage out): Ensure the quality of the context provided to the LLM is high. Clean up your source data and ensure your data pipeline maintains adequate content, such as capturing relevant information and removing unnecessary markup. Carefully curate the data used for retrieval to ensure it's relevant, accurate, and comprehensive.

  2. Tune Your Chunking Strategy: As we saw earlier, chunking really matters! Experiment with different text chunk sizes to maintain adequate context. The way you split your content can significantly affect the performance of your RAG system. Analyze how different splitting methods impact the context's usefulness and the LLM's ability to generate relevant responses.

  3. Optimize System Prompts: Fine-tune the prompts used for the LLM to ensure they guide the model effectively in utilizing the provided context. Use feedback from the LLM's responses to iteratively improve the prompt design.

  4. Filter Vector Store Results: Implement filters to refine the results returned from the vector store, ensuring that they are closely aligned with the query's intent. Use metadata effectively to filter and prioritize the most relevant content.

  5. Experiment with Different Embedding Models: Try different embedding models to see which provides the most accurate representation of your data. Consider fine-tuning your own embedding models to better capture domain-specific terminology and nuances.

  6. Monitor and Manage Computational Resources: Be aware of the computational demands of your RAG setup, especially in terms of latency and processing power. Look for ways to streamline the retrieval and processing steps to reduce latency and resource consumption.

  7. Iterative Development and Testing: Regularly test the system with real-world queries and use the outcomes to refine the system. Incorporate feedback from end-users to understand performance in practical scenarios.

  8. Regular Updates and Maintenance: Regularly update the knowledge base to keep the information current and relevant. Adjust and retrain models as necessary to adapt to new data and changing user requirements.

Advanced RAG techniques

So far, I've covered what's known as "naive RAG." Naive RAG typically begins with a basic corpus of text documents, where texts are chunked, vectorized, and indexed to create prompts for LLMs. This approach, while fundamental, has been significantly advanced by more complex techniques. Advancements in RAG architecture have significantly evolved from the basic or 'naive' approaches, incorporating more sophisticated methods for enhancing the accuracy and relevance of generated responses. Aas you can see by the list below, this is a fast developing field and covering all these techniques would necessitate its own article: 
  • Enhanced Chunking and Vectorization: Instead of simple text chunking, advanced RAG uses more nuanced methods for breaking down text into meaningful chunks, perhaps even summarizing them using another model. These chunks are then vectorized using transformer models. The process ensures that each chunk better represents the semantic meaning of the text, leading to more accurate retrieval.
  • Hierarchical Indexing: This involves creating multiple layers of indices, such as one for document summaries and another for detailed document chunks. This hierarchical structure allows for more efficient searching and retrieval, especially in large databases, by first filtering through summaries and then going deeper into relevant chunks.
  • Context Enrichment: Advanced RAG techniques focus on retrieving smaller, more relevant text chunks and enriching them with additional context. This could involve expanding the context by adding surrounding sentences or using larger parent chunks that contain the smaller, retrieved chunks.
  • Fusion Retrieval or Hybrid Search: This approach combines traditional keyword-based search methods with modern semantic search techniques. By integrating different algorithms, such as tf-idf (term frequency–inverse document frequency) or BM25 with vector-based search, RAG systems can leverage both semantic relevance and keyword matching, leading to more comprehensive search results.
  • Query Transformations and Routing: Advanced RAG systems use LLMs to break down complex user queries into simpler sub-queries. This enhances the retrieval process by aligning the search more closely with the user's intent. Query routing involves decision-making about the best approach to handle a query, such as summarizing information, performing a detailed search, or using a combination of methods.
  • Agents in RAG: This involves using agents (smaller LLMs or algorithms) that are assigned specific tasks within the RAG framework. These agents can handle tasks like document summarization, detailed query answering, or even interacting with other agents to synthesize a comprehensive response.
  • Response Synthesis: In advanced RAG systems, the process of generating responses based on retrieved context is more intricate. It may involve iterative refinement of answers, summarizing context to fit within LLM limits, or generating multiple responses based on different context chunks for a more rounded answer.
  • LLM and Encoder Fine-Tuning: Tailoring the LLM and the Encoder (responsible for context retrieval quality) for specific datasets or applications can greatly enhance the performance of RAG systems. This fine-tuning process adjusts these models to be more effective in understanding and utilizing the context provided for response generation.

Putting it all together

RAG is a highly effective method for enhancing LLMs due to its ability to integrate real-time, external information, addressing the inherent limitations of static training datasets. This integration ensures that the responses generated are both current and relevant, a significant advancement over traditional LLMs. RAG also mitigates the issue of hallucinations, where LLMs generate plausible but incorrect information, by supplementing their knowledge base with accurate, external data. The accuracy and relevance of responses are significantly enhanced, especially for queries that demand up-to-date knowledge or domain-specific expertise.

Furthermore, RAG is customizable and scalable, making it adaptable to a wide range of applications. It offers a more resource-efficient approach than continuously retraining models, as it dynamically retrieves information as needed. This efficiency, combined with the system's ability to continuously incorporate new information sources, ensures ongoing relevance and effectiveness. For end-users, this translates to a more informative and satisfying interaction experience, as they receive responses that are not only relevant but also reflect the latest information. RAG's ability to dynamically enrich LLMs with updated and precise information makes it a robust and forward-looking approach in the field of artificial intelligence and natural language processing.