The Silver Bullet(s) to Defeating RAG Hallucinations

Spoiler alert: there's no silver bullet to completely eliminating RAG hallucinations... but I can show you an easy path to get very close.

I've implemented hundreds of Retrieval-Augmented Generation (RAG) workflows; trust me, this is the way. The expert diagram below, although a piece of art in and of itself and an homage to Street Fighter, also represents the two RAG models that I pitted against each other to win the RAG Fight belt and showcase a RAG champion:

On the left of the diagram is the model of a basic RAG. It represents the ideal architecture for the ChatGPT and LangChain weekend warriors living on the Pinecone free tier.

On the right is the model of the "silver bullet" RAG. If you added hybrid search it would basically be the FAANG of RAGs. (You can deploy the "silver bullet" RAG in one click using a template)

Given a set of 99 questions about a specific domain (33 easy, 33 medium, and 33 technical hard… Huge sample size I know), I experimented by asking each of these RAGs the questions and hand-checking the results. Here's what I observed:

Basic RAG

Easy: 94% accuracy (31/33 correct)
Medium: 83% accuracy (27/33 correct)
Technical Hard: 47% accuracy (15/33 correct)

Silver Bullet RAG

Easy: 100% accuracy (33/33 correct)
Medium: 94% accuracy (31/33 correct)
Technical Hard: 81% accuracy (27/33 correct)

So, what are the "silver bullets" in this case?

Generated Knowledge Prompting
Multi-Response Generation
Response Quality Checks

Let's delve into each of these:

1. Generated Knowledge Prompting

Enhance. Generated Knowledge Prompting reuses outputs from existing knowledge to enrich the input prompts. By incorporating previous responses and relevant information, the AI model gains additional context that enables it to explore complex topics more thoroughly.

This technique is especially effective with technical concepts and nested topics that may span multiple documents. For example, before attempting to answer the user’s input, you pay pass the user’s query and semantic search results to an LLM with a prompt like this:

You are a customer support assistant. A user query will be passed to you in the user input prompt. Use the following technical documentation to enhance the user's query. Your sole job is to augment and enhance the user's query with relevant verbiage and context from the technical documentation to improve semantic search hit rates. Add keywords from nested topics directly related to the user's query, as found in the technical documentation, to ensure a wide set of relevant data is retrieved in semantic search relating to the user’s initial query. Return only an enhanced version of the user’s initial query which is passed in the user prompt.

Benefits:

Enhances understanding of complex queries.
Reduces the chances of missing critical information in semantic search.
Improves coherence and depth in responses.

2. Multi-Response Generation

Multi-Response Generation involves generating multiple responses for a single query and then selecting the best one. By leveraging the model's ability to produce varied outputs, we increase the likelihood of obtaining a correct and high-quality answer. Kinda like mutation and/in evolution (It's still ok to say the "e" word, right?).

How it works:

Multiple Generations: For each query, the model generates several responses (e.g., 3-5).
Evaluation: Each response is evaluated based on predefined criteria like as relevance, accuracy, and coherence.
Selection: The best response is selected either through automatic scoring mechanisms or a secondary evaluation model.

Benefits:

By comparing multiple outputs, inconsistencies can be identified and discarded.
The chance of at least one response being correct is higher when multiple attempts are made.
Allows for more nuanced and well-rounded answers.

3. Response Quality Checks

Response Quality Checks is a pseudo scientific name for basically just double checking the output before responding to the end user. This step acts as a safety net to catch potential hallucinations or errors. The ideal path here is “human in the loop” type of approval or QA processes in Slack or w/e, but this quality checking can be automated as well with meaningful impact.

How it works:

Automated Evaluation: After a response is generated, it is assessed using another LLM that checks for factual correctness and relevance.
Feedback Loop: If the response fails the quality check, the system can prompt the model to regenerate the answer or adjust the prompt.
Final Approval: Only responses that meet the quality criteria are presented to the user.

Benefits:

Users receive information that has been vetted for accuracy.
Reduces the spread of misinformation, increasing user confidence in the system.
Helps in fine-tuning the model for better future responses.

Using these three “silver bullets” I promise you can significantly mitigate hallucinations and improve the overall quality of responses. The "silver bullet" RAG outperformed the basic RAG across all question difficulties, especially in technical hard questions where accuracy is crucial. Also, people tend to forget this, your RAG workflow doesn’t have to respond. From a fundamental perspective, the best way to deploy customer facing RAGs and avoid hallucinations, is to just have the RAG not respond if it’s not highly confident it has a solution to a question.

Disagree? Have better ideas? Let me know!

Happy building~ 🚀

Basic RAG

Silver Bullet RAG

1. Generated Knowledge Prompting

2. Multi-Response Generation

3. Response Quality Checks

Related posts

Reducing AI Hallucinations with Self-Consistency Techniques

Utilizing Generated Knowledge Prompting in AI Applications

Guardrails in AI: How to Protect Your RAG Applications

Ready to get started?