AI Code Review: QA for AI-Generated Code

Name: AI Code Review: From Vibe Checks to Real QA
Uploaded: 2026-04-01
Duration: 60 min
Channel: Byron Mackay
Description: Evaluation strategies that actually work: practical quality gates, testing approaches, and review workflows for AI-assisted code. Walk away with a clear framework for evaluating AI-generated code before it hits production.

Released April 1, 2026

Top 3 takeaways

01

AI ships 1.7x more bugs

AI-generated PRs carry measurably more performance, security, and logic issues, so faster output needs real quality gates to catch them.

02

Spell out the architecture yourself

Describe the path from button to service to database so the model follows your design. Talk it through with your team, turn the transcript into a spec, then build.

03

Make verification its own loop

Run tests on every change, keep PRs small, and use a separate reviewer model, because your ability to verify quality is the real bottleneck.

Byron Mackay

Director of Learning, Gauntlet AI

Director of Learning at Gauntlet AI, currently training hundreds of engineers to work AI-first. 16+ years as a mobile/iOS engineer before becoming an AI platform engineer (Savant, School AI), where he built eval platforms from scratch. Led curriculum development at BloomTech (a cohort of his saw nearly every graduate land an engineering role at Amazon) and ran the Amazon partnership/SDR program that moved non-traditional candidates into engineering roles at Amazon. Deep across platform engineering, AI, mobile, and learning.

Lesson notes

A written walkthrough of the lecture, covering the patterns, the code, and the gotchas.

The quality drop shows up in the numbers

The lecture opens with a blunt premise that AI lets you build faster but not better, and the speaker backs it with concrete metrics on how AI-generated code degrades quality. AI-generated PRs contain 1.7x more issues overall than human-written ones. Performance regressions are 7x more likely, readability has 3x more issues, security vulnerabilities are 2.74x more common, error-handling gaps are 2x more common, and logic and incorrectness errors are 75% more common.

To frame it, the speaker notes that you wouldn't hire an engineer who made errors 1.7x more often than your team, unless they coded 100x faster, and that tradeoff is the world we now live in. The point is to reconsider how we use AI rather than to abandon it. The deeper issue is throughput, because before AI bugs arrived at "human speed" and could be reviewed, whereas now features ship faster than anyone can read the code.

Missing context is the first failure mode

LLMs have no inherent knowledge of your business rules, architectural constraints, or that useful helper function buried elsewhere. Claude Code will search through your code and do an impressive job, yet it still misses constraints. A recurring trap is context-window bloat. Even though Opus advertises a million-token window, that doesn't mean you should fill it. Stop around 100,000 tokens, consolidate, and start a fresh session, because models begin losing recall well before the window fills. Keeping a session alive too long "because it went so well last time" produces issues at a more frequent clip.

Design the path before you prompt

Rather than asking the LLM to "make this button," spell out the full path of how the button sends the API request, how that hits the service layer, and how that reaches the database. The model does a poor job when left to decide the architecture on its own. The concrete workflow for larger tasks looks like this:

Get the engineers on the task into a room and talk through the architecture for an hour or two, transcribing with Notion.
Hand the transcript to Claude and have it build out a spec list and a task list.
Go through those tasks meticulously, editing and fleshing them out.
Only then hand it to Claude Code or Codex to implement.

He calls this "pseudocode the logic first," meaning the planning, research, and engineer discussion that builds context before any code is written.

Guardrails that keep a team safe

Context files (CLAUDE.md, Codex's AGENTS.md) set rules, constraints, and patterns so the LLM doesn't re-derive your codebase each time, for example a rule that authentication logic may only live in one file.
Centralize risk with single ownership and triggers, so that when a sensitive file changes you require review from a specific person or team.
Smaller PRs, since above roughly 400 lines you should stop and ask why, and a 7-file-change rule of thumb works the same way. A 1,000-line PR should prompt a "what's really going on here" pause.
Naming, because LLMs throw in generic names and shorthand. Name critical things yourself, or remind the model to be explicit.
Logical checkpoint commits, plus the tool Graphite for stacked PRs where each builds on the last (database layer, then service layer, then API layer).

Tests, skills, and a separate verification loop

LLMs love writing tests, but those tests are often low value, and the favorite example is a test verifying that a static map is indeed that static map. Direct them toward edge cases, security, and parameter verification instead. Configure the agent to run all tests after every change so it catches regressions, encoded as a skill plus a CLAUDE.md instruction ("always run the test before committing") and backed by pre-commit hooks. The same approach applies to UI via Playwright.

Other guardrails include skills that automate recurring processes (scanning untrusted code, scaling ClickHouse nodes, generating docs), linters (static rather than AI, yet they still catch a wide range of issues and let the LLM self-correct), and model selection, where you use the smartest model at critical gates like pre-production code review.

Finally, separate the loops by keeping verification distinct from coding, and verification can start in the planning phase. This echoes segregation of duties (credited to Dex of HumanLayer), where "the bottleneck is your ability to verify quality." Slow down to a realistic 2x pace, never deploy anything touching payments or user data without review, and treat quality as a moat.

FAQ

How do you QA AI-generated code? +

Replace gut-feel vibe checks with real quality gates, meaning automated tests, structured review workflows, and evaluation criteria applied before code reaches production. The goal is to catch AI mistakes systematically rather than by spot-checking.

What's wrong with vibe-checking AI code? +

Eyeballing output catches obvious errors but misses subtle logic, edge-case, and regression bugs, which is exactly where AI-generated code fails, and it doesn't scale as volume grows.

What quality gates should I put in place? +

Put in place tests that encode expected behavior, a review step against clear criteria, and CI checks that block merges when something fails.

How do evals fit into code quality? +

Evals turn "does this work?" into a repeatable, measurable check you run on every change, so quality stays verified continuously as the code keeps changing.

Can AI review AI-generated code? +

Yes, as a first-pass triage to flag issues, though it isn't ground truth, so keep a human gate for anything high-risk.

How do I test code when I'm shipping faster with AI? +

Generate tests alongside the implementation and make them part of the definition of done, because skipping them while moving fast just ships bugs faster.

Does this apply to prototypes too? +

You can stay lighter-weight for throwaway prototypes, but anything headed for users needs the gates, and the line is "will a real user touch this?"

What single habit improves AI code quality most? +

Write the check, a test or an eval, before or alongside the code, so correctness gets defined up front while you build.