#021 The Problems You Will Encounter With RAG At Scale And How To Prevent (or fix) Them

17 snips

Sep 12, 2024

Nirant Kasliwal, an author known for his expertise in metadata extraction and evaluation strategies, shares invaluable insights on scaling Retrieval-Augmented Generation (RAG) systems. He dives into common pitfalls such as the challenges posed by naive RAG and the sensitivity of LLMs to input. Strategies for query profiling, user personalization, and effective metadata extraction are discussed. Nirant emphasizes the importance of understanding user context to deliver precise information, ultimately aiming to enhance the efficiency of RAG implementations.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ADVICE

Debug With Real User Logs First

Start debugging by reviewing user complaints and logged retrieval → prompt → answer chains to identify clear failures.
Fixes for the first 30–50% are quick; use better prompts, examples, and diagnostics to capture low-hanging gains.

ADVICE

Use Balanced Synthetic And Human Eval

Build an eval set from synthetic QA plus real user complaints, but keep synthetic volume <= human-derived cases.
Use human verification for domain-expert areas and avoid letting synthetic data dominate evaluation.

ADVICE

Generate Hard Negatives From Baselines

Derive hard negatives by running known good queries against a strong baseline (e.g., BM25/Elastic) and flagging mismatches as negatives.
Use domain experts to verify hard negatives when user feedback shows failures without clear ground truth.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Hey! Welcome back.

Today we look at how we can get our RAG system ready for scale.

We discuss common problems and their solutions, when you introduce more users and more requests to your system.

For this we are joined by Nirant Kasliwal, the author of fastembed.

Nirant shares practical insights on metadata extraction, evaluation strategies, and emerging technologies like Colipali. This episode is a must-listen for anyone looking to level up their RAG implementations.

"Naive RAG has a lot of problems on the retrieval end and then there's a lot of problems on how LLMs look at these data points as well."

"The first 30 to 50% of gains are relatively quick. The rest 50% takes forever."

"You do not want to give the same answer about company's history to the co-founding CEO and the intern who has just joined."

"Embedding similarity is the signal on which you want to build your entire search is just not quite complete."

Key insights:

Naive RAG often fails due to limitations of embeddings and LLMs' sensitivity to input ordering.
Query profiling and expansion:
- Use clustering and tools like latent Scope to identify problematic query types
- Expand queries offline and use parallel searches for better results
Metadata extraction:
- Extract temporal, entity, and other relevant information from queries
- Use LLMs for extraction, with checks against libraries like Stanford NLP
User personalization:
- Include user role, access privileges, and conversation history
- Adapt responses based on user expertise and readability scores
Evaluation and improvement:
- Create synthetic datasets and use real user feedback
- Employ tools like DSPY for prompt engineering
Advanced techniques:
- Query routing based on type and urgency
- Use smaller models (1-3B parameters) for easier iteration and error spotting
- Implement error handling and cross-validation for extracted metadata

Nirant Kasliwal:

Nicolay Gerold:

query understanding, AI-powered search, Lambda Mart, e-commerce ranking, networking, experts, recommendation, search