How to increase LLM response quality?

Today we’ll take a look at how to increase Large Language Model (LLM) response quality. For an application to be successful it can't lose money for prolonged periods of time. As such, we will also be considering how different approaches to increase response LLM quality affect cost and speed.

How to increase LLM response quality?

We’ll share a structured approach on how to think about response improvement techniques like fine-tuning and RAG holistically and dive into the pros and cons of each technique.

The goal is to help you improve the quality of your large language model responses at the best possible cost and speed for your application. We hope you’ll walk away from this with a better sense for when to apply which technique and how to get the results you want out of your app. Let’s dive in.

A mental model to improve LLM response quality:

When it comes to improving LLM response quality, you have three primary techniques at your disposal:

  • Prompt engineering - tailoring the prompts to guide the model’s responses
  • Retrieval Augment Generation (RAG) - supplying the model with context from data sources that the model wasn’t trained on (e.g., your proprietary knowledge base)
  • Fine-tuning - customizing a model by partially re-training it on domain or task-specific data

Each of these techniques serves a different purpose. They are additive. If fine-tuning helps, it can be combined with prompt engineering and/or RAG to improve performance even further. The right answer for your use-case likely involves using multiple techniques simultaneously.

Because each technique brings something different to the table, it’s important to understand the advantages and limitations of each technique. This is the foundation for a clear mental model for when to use what technique. Here’s a summary (we go into much greater detail below):

  • Prompt engineering is a great starting point, it’s cost-effective and easy to iterate with but it doesn’t scale well.
  • Retrieval Augmented Generation works great to supply the model with context or information it did not have in pre-training but needs to get the job done (e.g., information specific to your company)
  • Fine-tuning is great to teach the model to consistently behave as you want it to behave (e.g., I want my model to always output valid JSON).

Source: A Survey of Techniques for Maximizing LLM Performance (OpenAI)

With this mental model in place, it becomes much easier to follow a structured and iterative approach to systematically improve response quality.

  • Start by prompt engineering a high-performing model and take it as far as you can
  • Once you’ve maxed out response quality with prompt engineering alone. Pause and ask yourself: “when my model’s response quality is poor is it because it’s missing context / knowledge or is it because of the way it’s responding”?
  • If the model’s remaining shortcomings are due to a lack of context or knowledge, you should explore RAG
  • If the problem is the way the model is responding, (e.g., the output structure is off, the model’s behavior is inconsistent, the model’s tone is off), you should explore fine-tuning
  • Every time you make a change to your model or the system around your model, you should pause, measure response quality and ask yourself the question: “when response quality is poor is it because it’s missing context / knowledge or is it because of the way it’s responding”? The answer should guide your next step.
  • Keep iterating until your model is consistently responding as expected. Be ready, it may take multiple iterations and experimentation across multiple techniques to get there.
  • If you’re happy with your model’s response quality but cost and/or latency are an issue, then consider using a smaller fine-tuned version of the same model to reduce inference cost and increase speed

Now that we’ve laid out the broad approach for how prompt engineering, RAG and fine-tuning, come together to maximize LLM response quality and performance, let’s take a closer look at each technique on its own.

Prompt engineering:

🐎 Prompt engineering is the best first step: easy and cheap to get started with but will only get you so far (limited personalization, expensive inference for long prompts)

Prompt engineering means designing the language model’s input prompt to steer it towards a desired behavior.

Clear and specific prompting, proper context and examples can drastically improve response accuracy and relevance.

Advantages:
  • Easy and cost-effective to get started with: all you need is a pre-trained model. No additional software, data or model re-training required.
  • Versatile: you can achieve broad behavioral changes in model responses with a few examples and minor prompting tweaks
  • Easy to iterate: you can tweak a prompt with a few keystrokes and immediately observe changes in response / behavior. You can start with a simple prompt and progressively iterate your way into richer, more context and example-heavy versions.
Limitations and considerations:
  • Limited personalization: most foundation models have been pre-trained on publicly available data like Wikipedia and data scraped from the web. As a result, their knowledge base is broad but shallow. Pre-trained foundation models probably don’t know the intricacies of your application and they definitely don’t know anything about your users or their relationship to your product. It’s unlikely you can provide a personalized experience at scale for each user / company / industry given context window limits
  • High costs at scale: detailed instructions and examples lead to longer prompts. All things equal, longer prompts require more computational resources (GPU seconds) to process. In fact, when running inference doubling the size of the input prompt increases the memory and compute requirements by 4x
  • High latency in production: longer prompts take more time to process leading to a poor UX for real-time or near real-time use-cases. This problem is compounded when processing multiple concurrent requests. This can be mitigating by adding more GPUs (albeit at a higher cost)
  • Poor edge case coverage: the amount of context you can supply a model is limited by the model’s context window length. Unless your use-case is very simple you won’t be able to instruct the model on how to behave on every edge-case
  • Model size matters: small models have limited in-context learning capabilities and will benefit very little from prompt engineering
Getting the most out of prompt engineering:

LLMs benefit from clear instructions. The prompt engineering methods described below can sometimes be combined for greater effectiveness.

  • Write clear and specific instructions - If your instructions are vague, the model will have to take a guess to provide a response, increasing the risk of hallucinations or off-topic responses. Clearly state your request or question.
  • Provide sufficient relevant context - this helps the LLM understand the broader picture and tailor its response accordingly.
  • Provide examples (few-shot prompting) - demonstrating the desired input-output format across one or more examples gives the model a reference to base its response
  • Specify the desired length of output - You can ask the model to produce outputs of a specific length (in count of words, sentences, paragraphs or bullet points)
  • Specify the tone or style - (e.g., formal, casual, technical). This ensures that the response matches your expectations in terms of presentation.
  • Specify intermediary steps - if the model needs to systematically follow pre-determined steps to carry out its task you can lay them out in the prompt
  • Ask the model to expose its thinking - explicitly instructing the model to reason from first principles before coming to a conclusion helps. Small prompt additions such as  “break down your reasoning into its component steps before answering” or “think step-by-step” have been shown to improve performance

Sample prompt - customer support bot

Because of its simplicity prompt engineering is an ideal starting point when tackling a new use-case. Test different prompts and take note of where your model is falling short. This is where our other tools come into play:

  • Experiment with RAG if the model is behaving as expected but its responses are missing context
  • Fine-tune if the model isn’t consistently behaving as expected (e.g., you’re asking for a SQL query and you’re getting text instead, the response style or structure is off)

You can test prompting strategies across multiple models side-by-side in our playground

Where to go once you’ve exhausted the limits of prompt engineering?

Source: A Survey of Techniques for Maximizing LLM Performance (OpenAI)

Retrieval-Augment-Generation (RAG)

📰 RAG is great to give the model context that’s relevant and up-to-date via external knowledge sources

Retrieval-Augmented Generation (RAG) means augmenting an LLM’s prompt with relevant chunks of context retrieved from an external knowledge base. When a users sends a query, the RAG system fetches chunks of data that are relevant to the user’s query from the external knowledge base using semantic search and appends them to LLM’s prompt before processing the response.

The external database can contain domain/industry/product context and can be updated without changing anything about the LLM.

Advantages:
  • Data freshness: RAG can pull in information from a large, constantly updated knowledge base, allowing it to supply up-to-date data to a static pre-trained model
  • Ability to incorporate large knowledge corpuses: RAG systems can scale their knowledge bases easily by updating or expanding their retrieval sources, without the need for retraining the entire model
Limitations and considerations:
  • Does not give the model depth or understanding of a broad knowledge domain, it merely feeds the same model context deemed to be relevant to the user’s query
  • Does not fix the issue of high costs and latency at scale: RAG relies on model input prompts to pass on context relevant to the query, leading to long prompts. Long prompts mean higher compute consumption per query leading to higher costs and slower inference at scale
  • Introduces an additional context search and retrieval system on top of the LLM. Retrieval has its own set of vulnerabilities, it may fetch information that is irrelevant to the user’s query, eroding response quality
Getting the most out of RAG:

RAG has many failure modes. Often times data preparation and retrieval falls short. If the RAG systems fetches unhelpful context to the query, the LLM’s capabilities won’t matter, response quality will be poor.

  • Chunking is important: RAG fetches data stored in vector databases. Any data stored in a vector database needs to be embedded first. What this means is unstructured data (like a pdf) gets chunked and each chunk gets turned into an n-dimensional vector that captures its semantic meaning. These vectors are then stored in the vector database. If your chunks are too small (insufficient information density) or too large (high signal-to-noise ratio in the chunk), the retrieval system could end up fetching results that are irrelevant to the user’s query or exceeding the model’s context limit (more info on chunking for LLM applications here)
  • Retrieval is important: When a user passes a query through a RAG system, all the available chunks are retrieved and ranked from most to least likely to be relevant to the user’s query (usually via top-k embedding similarity). The most relevant chunks are passed on to the LLM. Depending on the data structure and type (text, images, tables) different calculations of similarity (e.g., euclidean distance, cosine similarity, jaccard similarity) can provide better results.
  • Fine-tune an embeddings model: pre-trained embeddings models were optimized to their pre-training objectives. These may not align to your own retrieval objectives and data. For example, if you’re building for an insurance use-case, words like “network”, “rider” and “waiting period” mean something very different than they do in day-to-day life. A general purpose embeddings model will get things mixed up and fall short. In order to retrieve the right context, your embeddings model might need to be optimized to the specific semantic structures that matter to your use-case.
  • Rerankers and two-stage retrieval - traditional RAG uses top-k embedding similarity look-up to decide which pieces of context are relevant to the query and should be added to the prompt. The thing is, top-k embedding similarity search is good at taking a large corpus of data and short-listing potentially relevant chunks but it can fall short on finding the best chunks within the short-list. This is where re-rankers come in as an additional step to ensure only the most relevant chunks are passed on to the LLM (more info on re-ranking here)

See here for a cool framework to measure RAG pipeline generation and retrieval quality

Fine tuning:

TLDR: 🏎️ Fine tuning is great to optimize performance on domain-specific tasks, improve efficiency at scale and build defensibility

Fine-tuning a language model involves updating its weights using a task-specific dataset to improve its performance in a particular task or domain. The process results in a new model altogether.

A pre-trained foundation model acts as a starting point. Fine-tuning adjusts some of the pre-trained model’s weights to better fit the fine-tuning dataset. The result is a better understanding of the specific context and language patterns of the task it is being fine-tuned for (without the need to train a model from scratch).

Fine-tuning is ideal to control the way in which your model is responding (e.g., consistently output SQL). Another big reason is so you can use a small model do a task that would normally require a much larger model. This means you can do the same task, but cheaper and faster.

Impact of fine-tuning on llama-2-13b across various tasks and performance vs. GPT-4

Advantages:
  • Task-specific depth: fine-tuning tailors the model to specific tasks and domains or knowledge bases. Fine-tuning can incorporate millions or hundreds of millions of tokens of data, orders of magnitude more than what can be fit into a prompt, resulting in responses with greater depth, relevance and accuracy for specific domains / product / users.
  • Reliability and Consistency: fine-tuning can emphasize desired behaviors and relevant knowledge that already exists in a model leading to more reliable and consistent outputs (e.g., systematically outputting valid JSON formats, understanding a specific SQL schema)
  • Control: by customizing the model on a curated dataset, you can have greater control over the format, style, tone of generated outputs, reducing the likelihood of off-target responses (e.g., wrong data format)
  • Faster and cheaper at scale: you can fine-tune a small model (e.g., Mistral 7b) do a task that would normally require a larger model (e.g., GPT-4). This means you can do the same task, but cheaper and faster. You can also encode your fine-tuned model’s desired behavior and tone into its fine-tuning dataset instead of its prompt. This means no lengthy prompts with instructions or examples. Fine-tuned models have been trained to behave as you want them to so you don’t need to explain the desired behavior with every single prompt. Shorter prompts means more cost effective and faster inference
  • Differentiation: fine-tuning allows for the continuous integration of unique, proprietary, or specialized data. You can set-up regular fine-tuning jobs to incorporate new information so your model is always learning, making it a lasting differentiator that is uniquely specific to your application.
Limitations and considerations:
  • Not good for quickly iterating for a new use-case. Each individual fine-tuning iteration takes time and getting to the right model will take multiple iterations. This should not be the first step in your LLM journey.
  • Base model matters: Fine tuning emphasizes select properties imparted upon the model during pre-training. As such, it is essential to select a base model pre-trained for the task you’d like to tackle. If you’re fine-tuning a model to learn a specific SQL schema, start with a foundation model pre-trained for text-to-SQL use-cases. Fine-tuning a model into acquiring a capability it wasn’t pre-trained for (e.g., teaching a large language model advanced mathematical reasoning) is extremely data and compute intensive.

Fine tuning can be involved. The wrong technique, configuration or fine-tuning data-set might degrade the base model’s performance or overfit it to the training data. A model may forget its emergent reasoning capabilities and information it learned during pre-training as it learns new information (this is known as catastrophic forgetting).

  • It’s all about data quality: data quality will drive model performance. Aim for 100 rows of high-quality data at a minimum for effective fine-tuning. This data needs to be thoughtfully gathered, normalized and cleaned for the best results
  • Choice of technique matters: full-parameter fine-tuning involves retraining every parameters within a model. This may be needed for certain use-cases but it is computationally expensive, it requires ample training data and you run the risk of forgetting previously acquired knowledge from pre-training. LoRA is a resource-efficient (therefore cost-effective) technique that works great to improve summarization, text-to-SQL, classification and Q&A use-cases but won’t do much to improve a model’s ability to do accurate math (especially if that model is small).
  • The process is iterative: Every task and dataset is unique. Getting the best results typically requires experimentation with various techniques, hyper-parameters, freezing and unfreezing layers, loss functions, different epochs. Fine-tuning works best as an iterative process with experimentation and validation across a dozen or more model variants.
  • Maintenance and updating: A fine-tuned model might need periodic retraining to stay current, especially if the domain of application is dynamic and evolves over time.
Getting the most out of fine tuning:
  • Start small - baseline your foundation model’s performance before fine-tuning. Then, create a small but high quality dataset (10-20 rows) and do an initial fine-tuning run. Measure again and see if you are moving in the right direction. If results are moving in the right direction, double down with more similar data. Otherwise, adjust accordingly.
  • Iterate intentionally - identify areas in which the model is falling short and prepare datasets that specifically address these short-comings.
  • Focus on data quality - fine-tuning is about refining a model to fit your use-case. Data quality trumps data quantity here.

In closing:

Enhancing Large Language Model response quality is a multifaceted and iterative process. Prompt engineering, RAG and fine-tuning each offer distinct benefits and address different aspects of model optimization.

  • Prompt engineering is the simplest and most cost-effective way to start, though it has scalability limitations
  • RAG enriches the model’s context with external knowledge, bridging gaps in its understanding
  • Fine-tuning tailors the model specifically to your domain or task, leading to improved performance and efficiency (lower cost and faster inference) at scale

The key to maximizing LLM performance is to understand the unique requirements of your application and combining these techniques effectively.

Start with prompt engineering. Once you hit a plateau ask yourself if the model’s remaining shortcomings are due to a lack of context or knowledge (go down the RAG route) or if the problem is either the way the model is responding or cost / latency  (time to explore fine-tuning).

Remember, the journey to optimize an LLM to your use-case is iterative and requires continuous evaluation and adjustment. Embrace experimentation, measure performance at each step and be prepared to dive into multiple techniques.

We are excited about this problem and are working for ways to help application developers build better AI-powered apps.

  • Looking to test different models and prompts? Check out our API.
  • Don't feel like coding? Try the no-code playground.
  • Looking to create a custom model fine-tuned on your data? We can help.
  • Looking to run AI models in production but don't want to deal with the hassles of running production infrastructure at scale? We can help.

References:

A Survey of Techniques for Maximizing LLM Performance (OpenAI)

Vector Embeddings for Developers: The Basics (Pinecone)

Chunking Strategies for LLM Applications (Pinecone)

Rerankers and Two-Stage Retrieval (Pinecone)

Shivani Modi