Agentic RAG Cookbook
In this lesson: retrieval design comes first, you measure real behavior, and you iterate on the signal.
Top 3 takeaways
Retrieval design comes first
A production RAG system lives or dies on retrieval. Get the retrieval layer right before you reach for a bigger model or a cleverer prompt.
Measure real behavior
Evaluation metrics only help when they reflect how the system actually behaves on real queries. A polished demo can hide the issues that show up in production.
Iterate on the signal
A loop that turns evaluation signal into improvement is what gets you from a prototype that demos well to a system you can ship.

Ash Tilawat
CTO, Gauntlet AI
CTO of Gauntlet AI, leading the company's technical direction and AI-native training programs. Has trained 1,200+ engineers across 104 companies and run multiple corporate trainings this year — including the AI sales course that firms like a16z, Mainsail, and PwC brought Gauntlet in to teach. Focused on turning AI from a prototype tool into something teams use in real production workflows, with an emphasis on evaluation and systems thinking.
Lesson notes
A written walkthrough of the lecture, covering the patterns, the code, and the things that trip people up.
What Agentic RAG Is and Why It Exists
Retrieval-Augmented Generation (RAG) improves an LLM by retrieving relevant information before generating a response.
Traditional RAG embeds a query, searches a knowledge base for similar content, and places the retrieved information into the model's context window. This works well until the system starts retrieving too much irrelevant information. Larger context windows don't solve this problem—they often make it worse by overwhelming the model with unnecessary context.
Agentic RAG addresses this by placing an agent between the user and the knowledge base. Instead of retrieving everything, the agent decides what information to retrieve, where to retrieve it from, and when it has enough context to answer the question.
The Evolution of RAG
RAG systems generally evolve through five stages:
- Naive RAG – Retrieve from a single knowledge base.
- Metadata Filtering – Narrow results using filters like topic, title, or date.
- Hybrid Search – Combine keyword and vector search.
- Graph RAG – Retrieve information through knowledge graphs.
- Agentic RAG – Let an agent choose the best retrieval strategy and tools.
More advanced systems also use multi-hop retrieval, where the agent performs multiple searches before generating a final answer.
How Agentic RAG Works
At its core, an agent follows a simple loop: reason, act, and observe.
Rather than following a fixed retrieval pipeline, the agent decides:
- Whether retrieval is needed at all.
- Which data source or tool to use.
- How to break complex questions into smaller ones.
- When enough information has been gathered.
- When to stop searching or escalate to a human.
This makes retrieval more accurate, more efficient, and more resilient than traditional RAG pipelines.
Evals Are the Competitive Advantage
The lecture emphasizes that evaluation—not retrieval—is what separates great AI systems from average ones.
Every RAG system should measure:
- Groundedness – Is the answer supported by retrieved information?
- Precision – Was the retrieved information actually relevant?
- Recall – Did the system retrieve everything it needed?
- Latency
- Cost
- User satisfaction
Agentic systems should also evaluate tool selection, query decomposition, and stopping behavior. Treat retrieval and evaluation as separate systems so they can evolve independently.
Practical Guidance
Building a production RAG requires more than retrieval.
Balance latency against answer quality, keep deterministic workflows outside the LLM whenever possible, and continuously refresh your knowledge base. Build evaluation datasets with domain experts, launch with real users, and use human review to calibrate automated graders.
The broader lesson is that retrieval is only one piece of the puzzle. The long-term advantage comes from designing agents that retrieve intelligently and measuring whether they actually improve outcomes.
FAQ
What is Agentic RAG? +
How is Agentic RAG different from basic RAG? +
RAG versus fine-tuning, which should you use? +
Why do RAG systems return confident wrong answers? +
Why does retrieval design come first? +
How do you evaluate a RAG system? +
What metrics actually reflect real RAG behavior? +
What's next?
Keep building with the rest of Night School, or apply to Gauntlet — twelve weeks of technical intensity with the best AI engineers we can find.