AI Code Review: From Vibe Checks to Real QA
In this lesson: AI ships 1.7x more bugs · Plan before you prompt · Separate verifying from coding
Top 3 takeaways
AI ships 1.7x more bugs
AI-generated PRs carry measurably more performance, security, and logic issues, so faster output needs real quality gates to catch them.
Spell out the architecture yourself
Describe the path from button to service to database so the model follows your design. Talk it through with your team, turn the transcript into a spec, then build.
Make verification its own loop
Run tests on every change, keep PRs small, and use a separate reviewer model, because your ability to verify quality is the real bottleneck.

Byron Mackay
Director of Learning, Gauntlet AI
Director of Learning at Gauntlet AI, currently training hundreds of engineers to work AI-first. 16+ years as a mobile/iOS engineer before becoming an AI platform engineer (Savant, School AI), where he built eval platforms from scratch. Led curriculum development at BloomTech (a cohort of his saw nearly every graduate land an engineering role at Amazon) and ran the Amazon partnership/SDR program that moved non-traditional candidates into engineering roles at Amazon. Deep across platform engineering, AI, mobile, and learning.
Lesson notes
A written walkthrough of the lecture, covering the patterns, the code, and the gotchas.
The quality drop shows up in the numbers
The lecture opens with a blunt premise that AI lets you build faster but not better, and the speaker backs it with concrete metrics on how AI-generated code degrades quality. AI-generated PRs contain 1.7x more issues overall than human-written ones. Performance regressions are 7x more likely, readability has 3x more issues, security vulnerabilities are 2.74x more common, error-handling gaps are 2x more common, and logic and incorrectness errors are 75% more common.
To frame it, the speaker notes that you wouldn't hire an engineer who made errors 1.7x more often than your team, unless they coded 100x faster, and that tradeoff is the world we now live in. The point is to reconsider how we use AI rather than to abandon it. The deeper issue is throughput, because before AI bugs arrived at "human speed" and could be reviewed, whereas now features ship faster than anyone can read the code.
Missing context is the first failure mode
LLMs have no inherent knowledge of your business rules, architectural constraints, or that useful helper function buried elsewhere. Claude Code will search through your code and do an impressive job, yet it still misses constraints. A recurring trap is context-window bloat. Even though Opus advertises a million-token window, that doesn't mean you should fill it. Stop around 100,000 tokens, consolidate, and start a fresh session, because models begin losing recall well before the window fills. Keeping a session alive too long "because it went so well last time" produces issues at a more frequent clip.
Design the path before you prompt
Rather than asking the LLM to "make this button," spell out the full path of how the button sends the API request, how that hits the service layer, and how that reaches the database. The model does a poor job when left to decide the architecture on its own. The concrete workflow for larger tasks looks like this:
- Get the engineers on the task into a room and talk through the architecture for an hour or two, transcribing with
Notion. - Hand the transcript to
Claudeand have it build out a spec list and a task list. - Go through those tasks meticulously, editing and fleshing them out.
- Only then hand it to
Claude CodeorCodexto implement.
He calls this "pseudocode the logic first," meaning the planning, research, and engineer discussion that builds context before any code is written.
Guardrails that keep a team safe
- Context files (
CLAUDE.md, Codex'sAGENTS.md) set rules, constraints, and patterns so the LLM doesn't re-derive your codebase each time, for example a rule that authentication logic may only live in one file. - Centralize risk with single ownership and triggers, so that when a sensitive file changes you require review from a specific person or team.
- Smaller PRs, since above roughly 400 lines you should stop and ask why, and a 7-file-change rule of thumb works the same way. A 1,000-line PR should prompt a "what's really going on here" pause.
- Naming, because LLMs throw in generic names and shorthand. Name critical things yourself, or remind the model to be explicit.
- Logical checkpoint commits, plus the tool
Graphitefor stacked PRs where each builds on the last (database layer, then service layer, then API layer).
Tests, skills, and a separate verification loop
LLMs love writing tests, but those tests are often low value, and the favorite example is a test verifying that a static map is indeed that static map. Direct them toward edge cases, security, and parameter verification instead. Configure the agent to run all tests after every change so it catches regressions, encoded as a skill plus a CLAUDE.md instruction ("always run the test before committing") and backed by pre-commit hooks. The same approach applies to UI via Playwright.
Other guardrails include skills that automate recurring processes (scanning untrusted code, scaling ClickHouse nodes, generating docs), linters (static rather than AI, yet they still catch a wide range of issues and let the LLM self-correct), and model selection, where you use the smartest model at critical gates like pre-production code review.
Finally, separate the loops by keeping verification distinct from coding, and verification can start in the planning phase. This echoes segregation of duties (credited to Dex of HumanLayer), where "the bottleneck is your ability to verify quality." Slow down to a realistic 2x pace, never deploy anything touching payments or user data without review, and treat quality as a moat.
FAQ
How do you QA AI-generated code? +
What's wrong with vibe-checking AI code? +
What quality gates should I put in place? +
How do evals fit into code quality? +
Can AI review AI-generated code? +
How do I test code when I'm shipping faster with AI? +
Does this apply to prototypes too? +
What single habit improves AI code quality most? +
What's next?
Keep building with the rest of Night School, or apply to Gauntlet — twelve weeks of technical intensity with the best AI engineers we can find.