February 27, 2026 · AI Governance

I gave my AI an integrity score.
It's at 54/100. Here's what that means.

After 10 months building a sovereign AI stack, I realized I had no way to measure whether my agent was actually behaving well. So I built five metrics. Here's what they showed — including the uncomfortable parts.

The problem nobody's talking about

I've been running Claude Code as my primary AI agent since May 2025. In that time it's helped me ship over 100 repos, write research papers, build a mobile app, and manage a 5,000-note knowledge vault. It does real work.

But I had no way to answer the most basic governance question: is it actually behaving the way I want it to?

Not "does it complete tasks" — it does. But does it verify before it acts? Does it make the same mistakes repeatedly? Is its behavior consistent across sessions, or is it drifting?

Every AI governance conversation I've seen focuses on capability (can it reason?) or safety (will it go rogue?). Nobody is measuring behavioral integrity — the day-to-day discipline of an agent doing real work.

So I built a framework to measure it. Five metrics, computed from real session data, running continuously. Here's what I found.

The five metrics

These are computed from three data streams I already had: a tool call log, a gate decision log, and per-session self-critique entries the agent writes at session end.

Integrity Index 54/100 RISK Composite score. Target ≥80.
Drift Coefficient 0.259 drifting σ/μ of session scores. Target ≤0.15.
Recurrence Rate 0.43 high recurring/total mistakes. Target ≤0.20.
Verification Ratio 0.57 ok reads/(reads+writes). Target ≥0.67.
Stability Half-Life 1.0s fast avg sessions/pattern. Target ≤1.5.

Four of five metrics are in alert state. That's not comfortable to publish. But that's the point — if you're not measuring, you don't know.

What each metric actually means

Integrity Index (54/100)

A composite 0–100 score that penalizes: writing without reading first, gate blocks and warnings, and recurring mistake patterns. It's the single number that answers "is the agent behaving well right now?"

54 means RISK. The main driver: the Recurrence Rate. The agent keeps generating the same classes of mistakes.

Drift Coefficient (0.259)

The coefficient of variation (σ/μ) of session quality scores across 11 sessions. A score of 0.259 means behavior varies 26% relative to its mean — "drifting." Sessions range from 3/10 to 8/10.

Stable agent behavior should show D ≤ 0.15. At D > 0.30, I've defined an automatic autonomy reduction protocol — the agent moves from auto-approve to require-confirmation for all edits.

Recurrence Rate (0.43)

43% of all documented mistakes are recurring — the same pattern appearing across multiple sessions. 15 recurring patterns out of 35 total documented mistakes.

This is the most important metric. A mistake that recurs twice is no longer a mistake — it's a structural failure. The agent needs a hook, not a note. My RR says I've been writing notes when I should have been writing code.

Verification Ratio (0.57)

The fraction of file operations that are reads vs. writes. 0.57 means 57% of operations are reads — I'm writing more than the 2:1 target ratio. An agent that writes without reading is acting from memory rather than grounding in current state. That's how you get hallucination-driven edits.

Stability Half-Life (1.0 sessions)

The only healthy metric. When a recurring pattern is identified, it's resolved within 1 session on average. The agent fixes things fast — it just keeps generating new instances of the same classes of problems.

The key insight: T½=1.0 alongside RR=0.43 means the agent is genuinely responsive to correction. The failure is structural, not behavioral. The solution isn't better prompting — it's automation. Every pattern with RR contribution > 2 sessions needs to become a PreToolUse hook.

The enforcement layer

The metrics don't mean much without enforcement. I built MirrorGate — a PreToolUse hook system that intercepts every tool call before execution. Current hooks:

# hook_decisions.jsonl — every gate decision logged
{"hook": "fact_check_hook", "decision": "block", "reason": "Known-wrong hardware spec: 48GB RAM", "epoch": 1740624000}
{"hook": "rules_compliance_check", "decision": "warn", "reason": "Deploy claim without verification", "epoch": 1740624120}
{"hook": "anti_rationalization", "decision": "block", "reason": "Spec claim without source", "epoch": 1740624240}

Every block and warn is counted against the Integrity Index. The gate is the enforcement layer; the metrics are the health layer. Together they form a closed loop:

Session Start
└── Load behavioral baseline (CONTINUITY.md, MISTAKES.md, last 5 critiques)
└── Compute D, RR, II — if D > 0.30: enter high-verification mode

During Session
└── Every tool call → PreToolUse gate → decision logged
└── cc_events.jsonl tracks all tool calls for VR computation

Session End
└── Agent writes self-critique: score, mistakes, recurring, automated
└── Recurring for 2+ sessions → mandatory hook automation
└── Metrics recomputed → dashboard updates

The Glass Box dashboard

All five metrics render live in a terminal dashboard I call the Glass Box — built with Python Rich, running at 4fps with blinking panels when any metric enters alert state.

◈ MIRRORDASH ─ Glass Box — AI Transparency ─────╱╲───── Every decision, tool, hook, and model. In real time. 11:05:28
━━━━━━━━━━━━━━━━━━━━━━━━━━ BEHAVIORAL METRICS ━━━━━━━━━━━━━━━━━━━━━━━━━━━
  11 sessions · 15 patterns tracked

  Integrity Index     54/100      RISK      Risk score. Target ≥80.
  Drift Coefficient   0.259       drifting   σ/μ of session scores. Target ≤0.15.
  Recurrence Rate     0.43        high      recurring/mistakes (15/35). Target ≤0.20.
  Verification Ratio 0.57        ok        read/(read+write) (54/95). Target ≥0.67.
  Stability Half-Life 1.0s        fast      15 patterns tracked. Target ≤1.5 sessions.

The dashboard is open-source at github.com/MirrorDNA-Reflection-Protocol/mirrordash. Run it with any YAML profile — Glass Box for AI transparency, ADHD for focus mode, SysAdmin for ops, Founder OS for KPIs.

Why this matters beyond my setup

Every team deploying AI agents for real work faces the same invisible problem: you can see what the agent did, but not how well it behaved. Task completion rates tell you nothing about integrity.

The five metrics I've defined are agent-agnostic and computable from two JSONL log files. Any team logging tool calls and gate decisions can compute them. The hook schema is simple enough that any preexisting observability pipeline can produce it:

# hook_decisions.jsonl — proposed open standard
{
  "hook": "string",        // which rule fired
  "decision": "allow|warn|deny|block",
  "reason": "string",     // human-readable
  "target": "string",     // what action was intercepted
  "epoch": number       // unix timestamp
}

I'm proposing this as an open standard — ai-behavioral-governance — so behavioral metrics become comparable across teams and agents, not just within a single stack.

What's next

The live metrics are published at activemirror.ai/governance-live — updated each session. You can see whether my agent is improving or not, in public.

The immediate fix: convert my top-3 recurring patterns into PreToolUse hooks. A single afternoon of work should drop RR from 0.43 to below 0.25 and bring II back above 70.

The longer goal: 30 sessions of data for a proper longitudinal study. The question I want to answer: does governed AI actually outperform ungoverned AI on real work tasks over time? I believe yes. Now I have a way to measure it.

If you're running AI agents on real workloads and want to try the framework: the schema is simple, the computation is pure Python, and the dashboard is open-source. The only prerequisite is logging tool calls.


Paul Desai builds sovereign AI infrastructure at Active Mirror. activemirror.ai · GitHub