Agentic Security Review
Multi-agent security review with evidence-first pipeline, deterministic consolidation, severity rubric, and release-policy engine.
Run three AI agents (Codex, Claude, optional Bob) through structured security review of your codebase. Get an HTML report with executive summary, full audit log, and machine-readable consolidation — without trusting any single agent.
Install
One command, requires Python 3.13+ and pipx:
The install script verifies SHA-256 against latest.json before installing.
What you get
agentic-security-review— pipeline CLI for consolidating agent outputs into HTML report + JSON audit logagentic-security-preflight— deterministic preflight (sandbox + inventory + optional gitleaks/bandit/semgrep/pip-audit + 5-level prompt-injection scan)agentic-evidence-verify— standalone evidence-verifier CLI for ad-hoc finding sanity checksagentic-security-claude-adapter— EXPERIMENTAL Claude adapter for non-Claude-Code contexts (CI, scripts)- Claude Code skill at
~/.claude/skills/security-review(invoke with/security-review <path>) - Bundled v2 JSON schema + v5 HTML template + 8-rule default release policy
How it works
You hand the framework a codebase. Three AI agents read it in parallel. The framework then checks each finding against actual source code, applies a fixed rule for how serious it is, and gives you a clear "ship / review / block" verdict — with full evidence.
You point it at a codebase
A folder of source files. Could be a small library or a full application. The framework first locks down the target so it cannot read outside that folder.
Automated scanners run first
Standard security tools (like virus scanners for code) check for leaked passwords, known weak patterns, and dependencies with known vulnerabilities. Plus a scan for "prompt-injection" — text inside the codebase that could trick the AIs.
Three AI agents read in parallel
Each gets the same hashed prompt and the scanner results. They look for security weaknesses independently and write structured findings. One AI alone misses things; three together catch more.
Each finding is fact-checked against the source code
The framework opens the file the AI cited, finds the exact line, and confirms the quoted code is actually there. AIs sometimes hallucinate — citing line 247 when the code is on line 142, or quoting something that doesn't exist. Those findings are flagged as unverified before they get consolidated.
Severity comes from a fixed rule — not AI opinion
Each finding is scored on six factors: what an attacker could do, what access they need, what environment is hit, where the input comes from, how easy to trigger, and whether the code is actually reachable. The combination determines Critical / High / Medium / Low — deterministically. The same finding always gets the same severity.
Every finding gets one of five states
Nothing is filtered away — every finding is preserved and labelled. You see exactly which ones are firmly real, which need a human glance, and which look like noise.
Eight rules decide: ship, review, accept, or block
A release policy engine evaluates every finding against eight standard rules ("any confirmed Critical → block", "exposed secret → block", "auth-bypass at Medium+ → needs review", etc.). All rules are tested and logged. You get a clear verdict, not a maybe.
You get a report + a full audit log
A single HTML file with an executive summary, per-finding business-risk narrative, evidence citations, and the 6×6 coverage matrix. Plus a JSON audit log with every hash needed to reproduce the result later — schema, rubric, policy, prompt, every agent's raw output.
Why three AIs and a deterministic rubric?
AI agents fail in ways you can't predict. One day Codex misses an auth-bypass bug; another day Claude hallucinates a line number; Bob has been known to recommend the buggy code as the fix. Catastrophically.
- Three agents mean that when one fails, the other two usually catch what it missed (or expose what it got wrong).
- Evidence verification catches the "AI said X but the code says Y" failure mode.
- The fixed rubric means that a real Critical bug stays Critical even if only one agent caught it — instead of being dismissed as low-confidence.
- The release policy turns "should we ship?" from an opinion into a documented decision you can audit.
This is operational triangulation, not magic. Limitations are stated prominently in every report.
Production status
| Use case | Status |
|---|---|
| Personal use on small/medium codebases | ✅ Ready (219 tests, live-validated 2026-05-14) |
| Internal team shared use | 🟡 Alpha (rough edges on Bob format, CWE-policy, enum aliases) |
| External customer / board delivery | ❌ Not yet (auto-gen business-risk is draft-tier; needs hand-written executive narrative) |
| CI/CD integration | ❌ Not yet |
AndersSol/agentic-security-review — only release artefacts and install script are public.
Requirements
- Python 3.13+
- pipx (
brew install pipx) - Codex CLI (required) — primary code-reasoner agent
- Claude Code (required) — threat-model reviewer (sub-agent)
- Bob / IBM Bob Shell (optional) — adversarial reviewer
- Optional scanners: gitleaks, bandit, semgrep, pip-audit
Verifying a release
curl -fsSL https://anderssol.github.io/CodeReview/latest.json
# version, wheel_url, sha256, size_bytes, python_requires, released
Install script downloads from wheel_url, recomputes SHA-256, fails hard on mismatch.
Releases
See GitHub Releases for all versions.