case-studyProductionCode GenerationReliability

Building a Production Code Review Agent: Lessons From the Field

How one engineering team replaced a three-hour manual code review workflow with an autonomous agent that runs in under four minutes — and what they learned along the way.

AgentEngineering EditorialFebruary 20, 20253 min read

Share Y

The Problem

A mid-size fintech engineering team was spending an estimated 15 engineer-hours per day on initial code review — checking for security vulnerabilities, style violations, test coverage gaps, and documentation completeness. The work was repetitive, inconsistent across reviewers, and increasingly a bottleneck as the team scaled.

The question: could an agent reliably automate the first-pass review, freeing human reviewers to focus on architectural judgment and business logic?

Architecture Overview

The team designed a multi-step review pipeline with five discrete agent calls:

Diff parser — extracts changed files, diffs, and metadata from the GitHub PR API.
Security scanner — invokes Semgrep rules via a code execution tool, passes results to an LLM for triage.
Test coverage analyzer — runs coverage tooling and identifies uncovered changed lines.
Style and documentation checker — inspects docstrings, naming conventions, and inline comments.
Review synthesizer — aggregates findings from steps 2–4 and produces a structured review comment in GitHub Markdown.

Each step is isolated: a failure in the security scanner does not block the style check. Results are passed downstream via a shared state object.

Key Engineering Decisions

Deterministic Pre-processing, LLM Post-processing

The team explicitly separated deterministic tooling (Semgrep, coverage, linters) from LLM reasoning. The LLM never executes code — it receives structured outputs and summarizes, ranks, and explains findings. This containment strategy made the system more reliable and auditable.

Confidence Scoring

Each finding includes a model-generated confidence score (low / medium / high). Low-confidence findings are grouped separately and labeled as "suggestions to verify" rather than "required changes." This reduced false-positive friction significantly.

Human-in-the-Loop on High-Severity Findings

Security findings rated "critical" trigger a Slack notification to the security team before the automated comment is posted. A human must confirm or dismiss before the review is finalized.

Important

Fully automated agents should not take irreversible actions (posting security findings, blocking merges) without a human checkpoint when the stakes are high. Design your escalation path before you need it.

Results

After three months in production across 1,200+ pull requests:

Average first-pass review time: 3h 40m → 4 minutes.
Security findings surfaced that passed human review: 23 (of which 17 were confirmed genuine).
False positive rate on security findings: 26% (reduced from 61% in the initial prototype through prompt iteration).
Developer satisfaction with review quality: 4.1/5 (versus 3.8/5 for human-only first-pass reviews in an earlier survey).

What Did Not Work

End-to-end autonomous merge — attempted briefly, abandoned after two incidents where the agent approved PRs with subtle logic errors that required context the agent lacked.
Single monolithic LLM call — the initial design tried to pass the entire diff to one model call. Accuracy dropped significantly on PRs larger than 400 changed lines. Breaking into discrete steps improved consistency dramatically.

Lessons

Use deterministic tools for deterministic tasks. Reserve LLM reasoning for synthesis, explanation, and judgment.
Small, focused agent calls outperform large, general ones.
Design your escalation path before you deploy.
Eval on real data early — prototype accuracy on toy examples rarely predicts production accuracy.