Agentic Security Review

Multi-agent security review with evidence-first pipeline, deterministic consolidation, severity rubric, and release-policy engine.

Run three AI agents (Codex, Claude, optional Bob) through structured security review of your codebase. Get an HTML report with executive summary, full audit log, and machine-readable consolidation — without trusting any single agent.

Install

One command, requires Python 3.13+ and pipx:

curl -fsSL https://anderssol.github.io/CodeReview/install.sh | bash

The install script verifies SHA-256 against latest.json before installing.

What you get

agentic-security-review — pipeline CLI for consolidating agent outputs into HTML report + JSON audit log
agentic-security-preflight — deterministic preflight (sandbox + inventory + optional gitleaks/bandit/semgrep/pip-audit + 5-level prompt-injection scan)
agentic-evidence-verify — standalone evidence-verifier CLI for ad-hoc finding sanity checks
agentic-security-claude-adapter — EXPERIMENTAL Claude adapter for non-Claude-Code contexts (CI, scripts)
Claude Code skill at ~/.claude/skills/security-review (invoke with /security-review <path>)
Bundled v2 JSON schema + v5 HTML template + 8-rule default release policy

How it works

You hand the framework a codebase. Three AI agents read it in parallel. The framework then checks each finding against actual source code, applies a fixed rule for how serious it is, and gives you a clear "ship / review / block" verdict — with full evidence.

You point it at a codebase

A folder of source files. Could be a small library or a full application. The framework first locks down the target so it cannot read outside that folder.

sandbox + inventory

Automated scanners run first

Standard security tools (like virus scanners for code) check for leaked passwords, known weak patterns, and dependencies with known vulnerabilities. Plus a scan for "prompt-injection" — text inside the codebase that could trick the AIs.

gitleaks · bandit · semgrep · pip-audit

Three AI agents read in parallel

Each gets the same hashed prompt and the scanner results. They look for security weaknesses independently and write structured findings. One AI alone misses things; three together catch more.

Codex (OpenAI) Claude (Anthropic) Bob (IBM, optional)

Each finding is fact-checked against the source code

The framework opens the file the AI cited, finds the exact line, and confirms the quoted code is actually there. AIs sometimes hallucinate — citing line 247 when the code is on line 142, or quoting something that doesn't exist. Those findings are flagged as unverified before they get consolidated.

file open · ±5-line search · quote match

Severity comes from a fixed rule — not AI opinion

Each finding is scored on six factors: what an attacker could do, what access they need, what environment is hit, where the input comes from, how easy to trigger, and whether the code is actually reachable. The combination determines Critical / High / Medium / Low — deterministically. The same finding always gets the same severity.

impact × privilege × env × input × trigger × reachability

Every finding gets one of five states

Nothing is filtered away — every finding is preserved and labelled. You see exactly which ones are firmly real, which need a human glance, and which look like noise.

confirmed plausible weak evidence needs manual review false-positive candidate

Eight rules decide: ship, review, accept, or block

A release policy engine evaluates every finding against eight standard rules ("any confirmed Critical → block", "exposed secret → block", "auth-bypass at Medium+ → needs review", etc.). All rules are tested and logged. You get a clear verdict, not a maybe.

first match wins · all evaluated · override with --release-policy

You get a report + a full audit log

A single HTML file with an executive summary, per-finding business-risk narrative, evidence citations, and the 6×6 coverage matrix. Plus a JSON audit log with every hash needed to reproduce the result later — schema, rubric, policy, prompt, every agent's raw output.

HTML report + reproducible audit JSON

Why three AIs and a deterministic rubric?

AI agents fail in ways you can't predict. One day Codex misses an auth-bypass bug; another day Claude hallucinates a line number; Bob has been known to recommend the buggy code as the fix. Catastrophically.

Three agents mean that when one fails, the other two usually catch what it missed (or expose what it got wrong).
Evidence verification catches the "AI said X but the code says Y" failure mode.
The fixed rubric means that a real Critical bug stays Critical even if only one agent caught it — instead of being dismissed as low-confidence.
The release policy turns "should we ship?" from an opinion into a documented decision you can audit.

This is operational triangulation, not magic. Limitations are stated prominently in every report.

Production status

Use case	Status
Personal use on small/medium codebases	✅ Ready (219 tests, live-validated 2026-05-14)
Internal team shared use	🟡 Alpha (rough edges on Bob format, CWE-policy, enum aliases)
External customer / board delivery	❌ Not yet (auto-gen business-risk is draft-tier; needs hand-written executive narrative)
CI/CD integration	❌ Not yet

This release is v5.2.0rc1. Auto-update via latest.json manifest. Source code is private at AndersSol/agentic-security-review — only release artefacts and install script are public.

Requirements

Python 3.13+
pipx (brew install pipx)
Codex CLI (required) — primary code-reasoner agent
Claude Code (required) — threat-model reviewer (sub-agent)
Bob / IBM Bob Shell (optional) — adversarial reviewer
Optional scanners: gitleaks, bandit, semgrep, pip-audit

Verifying a release

curl -fsSL https://anderssol.github.io/CodeReview/latest.json
# version, wheel_url, sha256, size_bytes, python_requires, released

Install script downloads from wheel_url, recomputes SHA-256, fails hard on mismatch.

Releases

See GitHub Releases for all versions.