Large language models (LLMs) like GPT-3.5 have proven to be capable when asked about commonly known subjects or topics that they would have received a large quantity of training data for. However, when asked about topics that include data they have not been trained on, they either state that they do not possess the knowledge or, worse, can hallucinate plausible answers.
Retrieval Augmented Generation (RAG) is a method that improves the performance of Large Language Models (LLMs) by integrating an information retrieval component with the model's text generation capabilities. This approach addresses two main limitations of LLMs:
Outdated Knowledge: Traditional LLMs, like ChatGPT, have a static knowledge base that ends at a certain point in time (for example, ChatGPT's knowledge cut-off is in January 2022). This means they lack information on recent events or developments.
Knowledge Gaps and Hallucination: When LLMs encounter gaps in their training data, they may generate plausible but inaccurate information, a phenomenon known as "hallucination."
RAG tackles these issues by combining the generative capabilities of LLMs with real-time information retrieval from external sources. When a query is made, RAG retrieves relevant and current information from an external knowledge store and uses this information to provide more accurate and contextually appropriate responses by adding this information to the prompt. This is equivalent to handing someone a pile of papers covered in text and instructing them that "the answer to this question is contained in this text; please find it and write it out for me using natural language." This approach allows LLMs to respond with up-to-date information and reduces the risk of providing incorrect information due to knowledge gaps.
This article focuses on what's known as "naive RAG", which is the foundational approach of integrating LLMs with knowledge bases. We'll discuss more advanced techniques at the end of this article, but the fundamental ideas of RAG systems (of all levels of complexity) still share several key components working together:
Orchestration Layer: This layer manages the overall workflow of the RAG system. It receives user input along with any associated metadata (like conversation history), interacts with various components, and orchestrates the flow of information between them. These layers typically include tools like LangChain, Semantic Kernel, and custom native code (often in Python) to integrate different parts of the system.
Retrieval Tools: These are a set of utilities that provide relevant context for responding to user prompts. They play an important role in grounding the LLM's responses in accurate and current information. They can include knowledge bases for static information and API-based retrieval systems for dynamic data sources.
LLM: The LLM is at the heart of the RAG system, responsible for generating responses to user prompts. There are many varieties of LLM, and can include models hosted by third parties like OpenAI, Anthropic, or Google, as well as models running internally on an organization's infrastructure. The specific model used can vary based on the application's needs.
Knowledge Base Retrieval: Involves querying a vector store, a type of database optimized for textual similarity searches. This requires an Extract, Transform, Load (ETL) pipeline to prepare the data for the vector store. The steps taken include aggregating source documents, cleaning the content, loading it into memory, splitting the content into manageable chunks, creating embeddings (numerical representations of text), and storing these embeddings in the vector store.
API-based Retrieval: For data sources that allow programmatic access (like customer records or internal systems), API-based retrieval is used to fetch contextually relevant data in real-time.
Prompting with RAG: Involves creating prompt templates with placeholders for user requests, system instructions, historical context, and retrieved context. The orchestration layer fills these placeholders with relevant data before passing the prompt to the LLM for response generation. Steps taken can include tasks like cleaning the prompt of any sensitive information and ensuring the prompt stays within the LLM's token limits
The challenge with RAG is finding the correct information to provide along with the prompt!
The actual chunking of the documents is somewhat of an art in itself. GPT-3.5 has a maximum context length of 4,096 tokens, or about 3,000 words. Those words represent the sum total of what the model can handle—if we create a prompt with a context 3,000 words long, the model will not have enough room to generate a response. Realistically, we shouldn’t prompt with more than about 2,000 words for GPT-3.5. This means there is a trade-off with chunk size that is data-dependent.
chunk_size values, the text returned produces more detailed chunks of text but risks missing information if they’re located far away in the text. On the other hand, larger
chunk_size values are more likely to include all necessary information in the top chunks, ensuring better response quality, but if the information is distributed throughout the text, it will miss important sections.
Let’s use some examples to illustrate how this trade-off works, using the recent Tesla Cybertruck release event. While some models of the truck will be available in 2024, the cheapest model—with just RWD—will not be available until 2025. Depending on the formatting and chunking of the text used for RAG, the model’s response may or may not encounter this fact!
In these images, blue indicates where a match was found and the chunk was returned; the grey box indicates the chunk was not retrieved; and the red text indicates where relevant text existed but was not retrieved. Let’s take a look at an example where shorter chunks succeed:
Exhibit A: Shorter chunks are better… sometimes.
In the image above, on the left, the text is structured so that the admission that the RWD will be released in 2025 is separated by a paragraph but also has relevant text that is matched by the query. The method of retrieving two shorter chunks works better because it captures all the information. On the right, the retriever is only retrieving a single chunk and therefore does not have the room to return the additional information, and the model is given incorrect information.
However, this isn’t always the case; sometimes longer chunks work better when text that holds the true answer to the question doesn’t strongly match the query. Here’s an example where longer chunks succeed:
Exhibit B: Longer chunks are better… sometimes.
Improving the performance of a RAG system involves several strategies that focus on optimizing different components of the architecture:
Enhance Data Quality (Garbage in, Garbage out): Ensure the quality of the context provided to the LLM is high. Clean up your source data and ensure your data pipeline maintains adequate content, such as capturing relevant information and removing unnecessary markup. Carefully curate the data used for retrieval to ensure it's relevant, accurate, and comprehensive.
Tune Your Chunking Strategy: As we saw earlier, chunking really matters! Experiment with different text chunk sizes to maintain adequate context. The way you split your content can significantly affect the performance of your RAG system. Analyze how different splitting methods impact the context's usefulness and the LLM's ability to generate relevant responses.
Optimize System Prompts: Fine-tune the prompts used for the LLM to ensure they guide the model effectively in utilizing the provided context. Use feedback from the LLM's responses to iteratively improve the prompt design.
Filter Vector Store Results: Implement filters to refine the results returned from the vector store, ensuring that they are closely aligned with the query's intent. Use metadata effectively to filter and prioritize the most relevant content.
Experiment with Different Embedding Models: Try different embedding models to see which provides the most accurate representation of your data. Consider fine-tuning your own embedding models to better capture domain-specific terminology and nuances.
Monitor and Manage Computational Resources: Be aware of the computational demands of your RAG setup, especially in terms of latency and processing power. Look for ways to streamline the retrieval and processing steps to reduce latency and resource consumption.
Iterative Development and Testing: Regularly test the system with real-world queries and use the outcomes to refine the system. Incorporate feedback from end-users to understand performance in practical scenarios.
Regular Updates and Maintenance: Regularly update the knowledge base to keep the information current and relevant. Adjust and retrain models as necessary to adapt to new data and changing user requirements.
RAG is a highly effective method for enhancing LLMs due to its ability to integrate real-time, external information, addressing the inherent limitations of static training datasets. This integration ensures that the responses generated are both current and relevant, a significant advancement over traditional LLMs. RAG also mitigates the issue of hallucinations, where LLMs generate plausible but incorrect information, by supplementing their knowledge base with accurate, external data. The accuracy and relevance of responses are significantly enhanced, especially for queries that demand up-to-date knowledge or domain-specific expertise.
Furthermore, RAG is customizable and scalable, making it adaptable to a wide range of applications. It offers a more resource-efficient approach than continuously retraining models, as it dynamically retrieves information as needed. This efficiency, combined with the system's ability to continuously incorporate new information sources, ensures ongoing relevance and effectiveness. For end-users, this translates to a more informative and satisfying interaction experience, as they receive responses that are not only relevant but also reflect the latest information. RAG's ability to dynamically enrich LLMs with updated and precise information makes it a robust and forward-looking approach in the field of artificial intelligence and natural language processing.