Long context window models vs. RAG

What is the best way to give a large language model the context it needs to answer a specific query or accomplish a specific task? Today we will explore two popular methods: 1) “prompt stuffing” a model with a long context window and 2) Retrieval Augmented Generation (RAG) Let’s dive in.

Prompt stuffing LLMs with long context windows

A few months back, OpenAI announced GPT-4 turbo, a model with a 128k input context window. A few months before that, Anthropic released Claude 2, a model with a 100K input context window.

There’s a lot to like with long context windows. It’s convenient to drop a bunch of files into an LLM prompt, ask it a few questions and get neatly organized answers. However, empirical evidence shows “prompt stuffing” decreases answer accuracy (higher risk of hallucinations) and increases costs (due to greater computational requirements to process long context windows)

Lost in the middle: the accuracy problem with prompt stuffing

In their November 2023 paper “Lost in the Middle”, Liu (Stanford) and team found that models with long context windows do not robustly retrieve information that is buried in the middle of a long prompt. The bigger the context window, the greater the loss of mid-prompt context.

‍

Figure 1: LLM response accuracy goes down when context needed to answer correctly is found in the middle of the context window. The problem gets worse with larger context models.

Source: “Lost in the Middle: How Language Models Use Long Contexts”, F. Liu et al. 2023.

‍

“LLM accuracy is highest when relevant information occurs at the very start or end of a long input sequence, and rapidly degrades when models must reason over information in the middle of their input context - F. Liu et al. 2023”

‍

F. Liu et al. tested several open source and proprietary models on a simple question and answering-based test. Each test involved a question (like "who got the first Nobel prize in physics?"). The models were given several documents or passages from Wikipedia to help them answer the question (up to 30 documents). Here's the catch: only one of these documents actually had the answer to the question. The rest did not have the answer and were there to distract the models. In each experiment, the researchers changed the position of the one document with the right answer to the question.

They found that model accuracy was highest when the relevant document was either at the beginning or the end of the context window (see graphs above)

‍

Compute: the cost problem with long context windows

Inferencing a request with a large input prompt requires more memory and compute than inferencing a request with a small input prompt. This means it’s more expensive.

We confirmed this in an experiment where we ran prompts of various lengths through a llama 2 70b chat variant.

As you would expect, all things equal, longer prompts are slower to process end-to-end, leading to greater costs.
Even keeping hardware constant, inference costs per request go up with longer input sequences

‍

Figure 2a: Inference throughput goes down with longer input sequences

‍

Figure 2b: Inference costs per request go up with longer input sequences

‍

The cost difference gets even larger in production when request volumes are several orders of magnitude larger. This is especially true for real-time and near-real time use-case supporting hundreds or thousands of concurrent users.

Is more context worth it though? It might be. For complex coding use-cases ingesting a full application may be needed. 128k tokens is barely enough to hold the raw HTML of a single web page or to ingest the full contents of all documents that might be required to navigate complex knowledge and draw nuanced conclusions. However, as of the time of writing (early 2024) these use-cases are far beyond the reasoning capabilities of even the most powerful AI models.

128k tokens is roughly equivalent to 300 pages worth of text. Most question & answering use cases don’t require anywhere close to 300 pages worth of information to answer. If I am looking for the order details of customerID #19248103910’s November 29th purchase all I need is a few hundred tokens worth of information to know what this customer bought, how much they paid, their shipping address and their order status. Any irrelevant context fed into the model slows down my query, costs me more money to process and increases the likelihood that my model will get distracted and hallucinate.

Here’s where RAG helps.

‍

Retrieval Augmented Generation (RAG)

At a high level RAG retrieves contextually relevant information from a collection of documents held in a data store outside the model (typically a vector store) and appends that context to the prompt that is ultimately sent to the large language model (see figure 3 below). If the RAG system is well constructed, the LLM is only fed the context it needs to provide an accurate response to the user’s query.

‍

Figure 3: anatomy of a Retrieval Augmented Generation System

Source: Maxim Saplin - GPT-4, 128K context - it is not big enough (Nov 2023)

‍

RAG has a lot going for it:

Retrieval systems are time-tested tech that has been optimized over decades to extract relevant information from large corpuses of text / document cost-effectively
Extendability - documents can be easily added to the vector store to expand the overall RAG system’s coverage
Durability - documents within the data store can be updated to keep the data fresh over time without changing anything about the LLM
Flexibility - there are a number of parameters one can play with to optimize RAG (e.g., chunk size, overlap, retrieval method, re-reranker, embedding model, chunk metadata + filters)
Cost-efficient vs. prompt stuffing - RAG only appends text chunks relevant to the user’s query to the LLM prompt. All things equal, shorter prompts make inference faster and cheaper. Prompt stuffing feeds the LLM all documents that might contain the context relevant to the user’s query, a rather ham-fisted approach in comparison.

RAG is not without drawbacks though. Setting up a RAG systems involves managing both the generation model and retriever component, increasing complexity:

Chunking is difficult and important: if your chunks are too small (insufficient information density) or too large (high signal-to-noise ratio in the chunk), the retrieval system could end up fetching results that are irrelevant to the user’s query or exceeding the model’s context limit. Either scenario is computationally inefficient and increases hallucination risk (more info on chunking for LLM applications here)
Retrieval is important: When a user passes a query through a RAG system, all the available chunks are retrieved and ranked from most to least likely to be relevant to the user’s query (usually via top-k embedding similarity). The most relevant chunks are passed on to the LLM. Depending on the data structure and type (text, images, tables) different calculations of similarity (e.g., euclidean distance, cosine similarity, jaccard similarity) can provide better results.
You may need to fine-tune an embeddings model: pre-trained embeddings models were optimized to their pre-training objectives. In order to retrieve the right context, your embeddings model might need to be optimized to the specific semantic structures that matter to your use-case.
Chunk ranking is difficult and important - traditional RAG uses top-k embedding similarity look-up to decide which pieces of context are relevant to the query and should be added to the prompt. The thing is: top-k embedding similarity search (most common retrieval method) is good at taking a large corpus of data and short-listing potentially relevant chunks but it can fall short on finding the best chunks within the short-list. This is where re-rankers come in as an additional step to ensure only the most relevant chunks are passed on to the LLM (more info on re-ranking here)

Despite these difficulties, for most present-day use-cases RAG results in more accurate answers, faster response times and lower costs in production than prompt stuffing a model with a long-context window

‍

But RAG alone is often not enough. Fine-tuning is often required in conjunction with RAG to:

Enhance model reasoning capabilities in specific domains
Increase LLM output reliability and consistency (e.g., systematically outputting valid JSON)
Further reducing input prompt length, decreasing costs in production
Building sustainable differentiation and defensibility into your product (a fine-tuned model is your I.P. and source of differentiation)

For a deeper look into how RAG and fine-tuning come-together. check out our post on how to improve LLM response quality

‍

We are excited about helping application developers build better AI-powered apps.

Check out our API if you’re looking to test different models to power your RAG system
Want to test model but don't feel like coding? Try the no-code playground.
Looking to create a custom model fine-tuned on your data? We can help.

‍

References:

Lost in the Middle: How Language Models Use Long Contexts (F. Liu et. al)

GPT-4, 128K context - it is not big enough (Maxim Sapline)

Less is More: Why Use Retrieval Instead of Larger Context Windows (Pinecone)

‍