Agentic RAG Cookbook: A Production Build

Name: Agentic RAG Cookbook
Uploaded: 2026-04-22
Duration: 60 min
Channel: Ash Tilawat
Description: Build a production-ready Agentic RAG system from the ground up: retrieval design, evaluation metrics that reflect real behaviour, and the iteration loop that turns signal into improvement.

↓Get slides ↗GitHub repo

Released April 22, 2026

Top 3 takeaways

01

Retrieval design comes first

A production RAG system lives or dies on retrieval. Get the retrieval layer right before you reach for a bigger model or a cleverer prompt.

02

Measure real behavior

Evaluation metrics only help when they reflect how the system actually behaves on real queries. A polished demo can hide the issues that show up in production.

03

Iterate on the signal

A loop that turns evaluation signal into improvement is what gets you from a prototype that demos well to a system you can ship.

Ash Tilawat

CTO, Gauntlet AI

CTO of Gauntlet AI, leading the company's technical direction and AI-native training programs. Has trained 1,200+ engineers across 104 companies and run multiple corporate trainings this year — including the AI sales course that firms like a16z, Mainsail, and PwC brought Gauntlet in to teach. Focused on turning AI from a prototype tool into something teams use in real production workflows, with an emphasis on evaluation and systems thinking.

Lesson notes

A written walkthrough of the lecture, covering the patterns, the code, and the things that trip people up.

What Agentic RAG Is and Why It Exists

Retrieval-Augmented Generation (RAG) improves an LLM by retrieving relevant information before generating a response.

Traditional RAG embeds a query, searches a knowledge base for similar content, and places the retrieved information into the model's context window. This works well until the system starts retrieving too much irrelevant information. Larger context windows don't solve this problem—they often make it worse by overwhelming the model with unnecessary context.

Agentic RAG addresses this by placing an agent between the user and the knowledge base. Instead of retrieving everything, the agent decides what information to retrieve, where to retrieve it from, and when it has enough context to answer the question.

The Evolution of RAG

RAG systems generally evolve through five stages:

Naive RAG – Retrieve from a single knowledge base.
Metadata Filtering – Narrow results using filters like topic, title, or date.
Hybrid Search – Combine keyword and vector search.
Graph RAG – Retrieve information through knowledge graphs.
Agentic RAG – Let an agent choose the best retrieval strategy and tools.

More advanced systems also use multi-hop retrieval, where the agent performs multiple searches before generating a final answer.

How Agentic RAG Works

At its core, an agent follows a simple loop: reason, act, and observe.

Rather than following a fixed retrieval pipeline, the agent decides:

Whether retrieval is needed at all.
Which data source or tool to use.
How to break complex questions into smaller ones.
When enough information has been gathered.
When to stop searching or escalate to a human.

This makes retrieval more accurate, more efficient, and more resilient than traditional RAG pipelines.

Evals Are the Competitive Advantage

The lecture emphasizes that evaluation—not retrieval—is what separates great AI systems from average ones.

Every RAG system should measure:

Groundedness – Is the answer supported by retrieved information?
Precision – Was the retrieved information actually relevant?
Recall – Did the system retrieve everything it needed?
Latency
Cost
User satisfaction

Agentic systems should also evaluate tool selection, query decomposition, and stopping behavior. Treat retrieval and evaluation as separate systems so they can evolve independently.

Practical Guidance

Building a production RAG requires more than retrieval.

Balance latency against answer quality, keep deterministic workflows outside the LLM whenever possible, and continuously refresh your knowledge base. Build evaluation datasets with domain experts, launch with real users, and use human review to calibrate automated graders.

The broader lesson is that retrieval is only one piece of the puzzle. The long-term advantage comes from designing agents that retrieve intelligently and measuring whether they actually improve outcomes.

FAQ

What is Agentic RAG? +

Agentic RAG pairs retrieval-augmented generation with an agent loop that retrieves context, reasons over it, and iterates. Building one for production means getting retrieval design right, measuring behavior with evaluation metrics that reflect reality, and tightening the loop that turns that signal into improvement.

How is Agentic RAG different from basic RAG? +

Basic RAG retrieves once and answers in a single pass, while agentic RAG adds a reasoning loop that can decide what to retrieve next, verify, and iterate before answering.

RAG versus fine-tuning, which should you use? +

Use retrieval when the model is missing knowledge, and fine-tuning when you need consistent style, format, or domain behavior. For most cases where the model simply does not know X, RAG is the cheaper and faster fix.

Why do RAG systems return confident wrong answers? +

It is usually a retrieval issue, since when the right context is not retrieved the model fills the gap fluently. Fixing retrieval and evaluation does more for accuracy than tweaking the prompt.

Why does retrieval design come first? +

A production RAG system lives or dies on retrieval, so you get the retrieval layer right before reaching for a bigger model or a cleverer prompt.

How do you evaluate a RAG system? +

You use evaluation metrics that reflect real behavior on representative queries, and you run them as part of the iteration loop rather than as a one-off demo check.

What metrics actually reflect real RAG behavior? +

The ones tied to whether the right context was retrieved and used on representative queries. Vanity scores that look good only in a polished demo tell you very little.