I gave my AI an integrity score.
It's at 54/100. Here's what that means.
After 10 months building a sovereign AI stack, I realized I had no way to measure whether my agent was actually behaving well. So I built five metrics. Here's what they showed — including the uncomfortable parts.
The problem nobody's talking about
I've been running Claude Code as my primary AI agent since May 2025. In that time it's helped me ship over 100 repos, write research papers, build a mobile app, and manage a 5,000-note knowledge vault. It does real work.
But I had no way to answer the most basic governance question: is it actually behaving the way I want it to?
Not "does it complete tasks" — it does. But does it verify before it acts? Does it make the same mistakes repeatedly? Is its behavior consistent across sessions, or is it drifting?
Every AI governance conversation I've seen focuses on capability (can it reason?) or safety (will it go rogue?). Nobody is measuring behavioral integrity — the day-to-day discipline of an agent doing real work.
So I built a framework to measure it. Five metrics, computed from real session data, running continuously. Here's what I found.
The five metrics
These are computed from three data streams I already had: a tool call log, a gate decision log, and per-session self-critique entries the agent writes at session end.
Four of five metrics are in alert state. That's not comfortable to publish. But that's the point — if you're not measuring, you don't know.
What each metric actually means
Integrity Index (54/100)
A composite 0–100 score that penalizes: writing without reading first, gate blocks and warnings, and recurring mistake patterns. It's the single number that answers "is the agent behaving well right now?"
54 means RISK. The main driver: the Recurrence Rate. The agent keeps generating the same classes of mistakes.
Drift Coefficient (0.259)
The coefficient of variation (σ/μ) of session quality scores across 11 sessions. A score of 0.259 means behavior varies 26% relative to its mean — "drifting." Sessions range from 3/10 to 8/10.
Stable agent behavior should show D ≤ 0.15. At D > 0.30, I've defined an automatic autonomy reduction protocol — the agent moves from auto-approve to require-confirmation for all edits.
Recurrence Rate (0.43)
43% of all documented mistakes are recurring — the same pattern appearing across multiple sessions. 15 recurring patterns out of 35 total documented mistakes.
This is the most important metric. A mistake that recurs twice is no longer a mistake — it's a structural failure. The agent needs a hook, not a note. My RR says I've been writing notes when I should have been writing code.
Verification Ratio (0.57)
The fraction of file operations that are reads vs. writes. 0.57 means 57% of operations are reads — I'm writing more than the 2:1 target ratio. An agent that writes without reading is acting from memory rather than grounding in current state. That's how you get hallucination-driven edits.
Stability Half-Life (1.0 sessions)
The only healthy metric. When a recurring pattern is identified, it's resolved within 1 session on average. The agent fixes things fast — it just keeps generating new instances of the same classes of problems.
The key insight: T½=1.0 alongside RR=0.43 means the agent is genuinely responsive to correction. The failure is structural, not behavioral. The solution isn't better prompting — it's automation. Every pattern with RR contribution > 2 sessions needs to become a PreToolUse hook.
The enforcement layer
The metrics don't mean much without enforcement. I built MirrorGate — a PreToolUse hook system that intercepts every tool call before execution. Current hooks:
{"hook": "fact_check_hook", "decision": "block", "reason": "Known-wrong hardware spec: 48GB RAM", "epoch": 1740624000}
{"hook": "rules_compliance_check", "decision": "warn", "reason": "Deploy claim without verification", "epoch": 1740624120}
{"hook": "anti_rationalization", "decision": "block", "reason": "Spec claim without source", "epoch": 1740624240}
Every block and warn is counted against the Integrity Index. The gate is the enforcement layer; the metrics are the health layer. Together they form a closed loop:
└── Load behavioral baseline (CONTINUITY.md, MISTAKES.md, last 5 critiques)
└── Compute D, RR, II — if D > 0.30: enter high-verification mode
During Session
└── Every tool call → PreToolUse gate → decision logged
└── cc_events.jsonl tracks all tool calls for VR computation
Session End
└── Agent writes self-critique: score, mistakes, recurring, automated
└── Recurring for 2+ sessions → mandatory hook automation
└── Metrics recomputed → dashboard updates
The Glass Box dashboard
All five metrics render live in a terminal dashboard I call the Glass Box — built with Python Rich, running at 4fps with blinking panels when any metric enters alert state.
━━━━━━━━━━━━━━━━━━━━━━━━━━ BEHAVIORAL METRICS ━━━━━━━━━━━━━━━━━━━━━━━━━━━
11 sessions · 15 patterns tracked
Integrity Index 54/100 RISK Risk score. Target ≥80.
Drift Coefficient 0.259 drifting σ/μ of session scores. Target ≤0.15.
Recurrence Rate 0.43 high recurring/mistakes (15/35). Target ≤0.20.
Verification Ratio 0.57 ok read/(read+write) (54/95). Target ≥0.67.
Stability Half-Life 1.0s fast 15 patterns tracked. Target ≤1.5 sessions.
The dashboard is open-source at github.com/MirrorDNA-Reflection-Protocol/mirrordash. Run it with any YAML profile — Glass Box for AI transparency, ADHD for focus mode, SysAdmin for ops, Founder OS for KPIs.
Why this matters beyond my setup
Every team deploying AI agents for real work faces the same invisible problem: you can see what the agent did, but not how well it behaved. Task completion rates tell you nothing about integrity.
The five metrics I've defined are agent-agnostic and computable from two JSONL log files. Any team logging tool calls and gate decisions can compute them. The hook schema is simple enough that any preexisting observability pipeline can produce it:
{
"hook": "string", // which rule fired
"decision": "allow|warn|deny|block",
"reason": "string", // human-readable
"target": "string", // what action was intercepted
"epoch": number // unix timestamp
}
I'm proposing this as an open standard — ai-behavioral-governance — so behavioral metrics become comparable across teams and agents, not just within a single stack.
What's next
The live metrics are published at activemirror.ai/governance-live — updated each session. You can see whether my agent is improving or not, in public.
The immediate fix: convert my top-3 recurring patterns into PreToolUse hooks. A single afternoon of work should drop RR from 0.43 to below 0.25 and bring II back above 70.
The longer goal: 30 sessions of data for a proper longitudinal study. The question I want to answer: does governed AI actually outperform ungoverned AI on real work tasks over time? I believe yes. Now I have a way to measure it.
If you're running AI agents on real workloads and want to try the framework: the schema is simple, the computation is pure Python, and the dashboard is open-source. The only prerequisite is logging tool calls.
Paul Desai builds sovereign AI infrastructure at Active Mirror. activemirror.ai · GitHub