Research & Findings

Measuring AI Coding
Tool Effectiveness

A hook-based instrumentation framework across
Claude Code, Codex CLI, and Cursor

agentisd

April 2026

Burak Mert Köseoğlu — @mksglu

Prepared for Berkay Mollamustafaoğlu

The Problem

Research Question

"Engineering teams spend $42K/month on AI coding tools.
80% have zero usage metrics.
Can we measure what these tools actually do?"

$42K
avg. monthly spend
(50-seat engineering org)
80%
of teams with
zero usage metrics
63
internal analytics events
(Anthropic tengu_*, not exposed)

Source: Anthropic internal telemetry analysis — 63 tengu_* events tracked in bridge/replBridge.ts; total_cost_usd computed in cli/print.ts:SDKResult

Literature Review

The Productivity Paradox

PUBLISHED FINDINGS

METR 2025 — Randomized Controlled Trial

-19% slower with AI · +20% believed faster

arXiv:2507.09089 — n=16, 246 tasks, mature OSS repos

Faros AI 2025 — The Verification Bottleneck

+21% tasks, +98% PRs — but +91% review time

10K+ developers. Organizational throughput: flat.

SonarSource Jan 2026

42% of code is AI-generated · 96% don't trust it

Also: DORA 2024 — -1.5% throughput, -7.2% stability with AI adoption (dora.dev, Figure 7, 39K respondents). METR Feb 2026 — RCT abandoned, 30-50% refused tasks without AI.

"Vendors track everything.
Customers track nothing."

THE MEASUREMENT GAP

What Anthropic Measures Internally

63 x tengu_* analytics events
total_cost_usd per session
token counts (input/output/cache)
model routing decisions
tool invocation patterns
session duration & outcomes

bridge/replBridge.ts, cli/print.ts

What Customers See
nothing

Market Validation

context-mode — We Already Have the Data Layer

context-mode is an open-source MCP plugin that saves 98% of the context window.
It already captures every tool call, every session, every outcome.

56K+
installs
(npm + marketplace)
6.5K
GitHub stars
in 3 months
12
platform adapters
Claude, Codex, Cursor, +9
Active community: 415 forks, 6.5K stars in 3 months, weekly npm downloads trending up. Open source under Elastic-2.0.

WHAT context-mode ALREADY TRACKS (per session)

• Tool call count & distribution
• Bytes processed vs context
• Session duration & compactions
• Error rate per tool
• Git operations (commit, push)
• File read/write/edit patterns
• Context savings ratio (98%)
• FTS5 search index usage
• Session lifecycle events

"The data already exists. ctx_agentisd makes it visible."

Methodology

Hook-Based Instrumentation

"AI coding tools expose hook events at every tool call."

SessionStart
PreToolUse
[Tool Executes]
PostToolUse
Stop
UserPromptSubmit
← async, captures prompts before processing

WIRE PROTOCOL

JSON stdin hook script JSON stdout

Same protocol across Claude Code, Codex CLI, and Cursor

cli/structuredIO.ts — Claude Code hook I/O types and processing

codex-rs/hooks/hook_runtime.rs — Codex CLI hook execution runtime

cursor/plugins/schemas — Cursor plugin hook schemas

Data Model

What Data Flows Through Hooks

Field Claude Code Codex CLI Cursor
tool_name PreToolUse PreToolUse preToolUse
tool_input full args full args full args
tool_output PostToolUse PostToolUse postToolUse
session_id all hooks all hooks conversation_id
cost / tokens SDKResult OTEL metrics not exposed
exit_code PostToolUse PostToolUse postToolUse

FIELD COVERAGE

Claude Code
95%
Codex CLI
93%
Cursor
62%
Sources: cli/structuredIO.ts (Claude Code), codex-rs/hooks/hook_runtime.rs (Codex), cursor/plugins/schemas (Cursor)

Core Analysis — Business Personas

Persona → Metric → Evidence → Action → ROI

CTO
Seat Utilization 69% (127/184)
AI ROI Score 2.8x
Cross-tool Cost $2.40 / $1.80 / $0
Hook: SessionStart session_id + SDKResult.total_cost_usd
→ Cut 57 idle seats → $27K/yr saved. Board deck: "AI returns 2.8x"
ROI: $27K/yr license savings + board-ready ROI proof
Engineering Manager
Session Effectiveness 72% deliverable
Edit→Test→Edit Cycles 12 vs 3 iter
Rework Rate 38% sessions
Hook: PostToolUse git commit + Edit file_path count
→ Coach Dev A (12 iter) toward Dev B (3 iter). Data-driven retros.
ROI: 2.6x sprint velocity + 60% fewer rework cycles
DevEx Lead
Onboarding 1.8 days (-62%)
Context Quality Score 78/100
"Explain it twice" 47 → 12 skills
Hook: SessionStart hash + UserPromptSubmit clustering
→ 47 repeated prompts converted to 12 reusable skills. Onboarding 6wk→2wk.
ROI: -62% onboarding time + zero repeated context

Core Analysis — Operational Personas

Persona → Metric → Evidence → Action → ROI

Security Officer
Permission Denials 342/mo + 12 MCP
Dangerous Cmd Blocks 89 blocked
Tool Access Audit Full trail
Hook: PreToolUse deny + tool_input pattern match
→ Complete audit trail per developer. Anomaly alerts <24h.
ROI: 12 MCP violations flagged + zero unauthorized access
FinOps Manager
Cost per Session $1.42 (-18%)
Model Mix 62S/28H/10O
Budget Utilization 3 over / 2 under
Hook: SDKResult.total_cost_usd + model metadata
→ Shift Opus→Sonnet saves $8K/mo. Right-size team budgets.
ROI: $8K/mo model optimization + 3 teams rebalanced
QA Lead
AI Bug Density 38% debugging
Test Pass Rate 67% (was 52%)
Error Rate by Platform 3.2/4.1/7.8%
Hook: PostToolUse exit_code + is_error per adapter
→ Deploy TDD skill → +15% pass rate. Data-backed platform choice.
ROI: +15% first-run pass + platform error benchmarks

Full Metric Coverage

52 Metrics Across 8 Personas

Persona Metrics Claude Code Codex CLI Cursor Top Impact
CTO 8 7 7 5 $27K/yr license savings
Engineering Manager 7 7 7 5 2.6x sprint velocity
DevEx Lead 7 6 6 4 6wk → 2wk onboarding
Security Officer 5 5 5 3 Full compliance audit
FinOps Manager 5 5 5 1 15-30% cost optimization
QA Lead 5 5 5 4 Targeted tech debt sprints
Developer 5 5 5 4 Personal mastery curve
Onboarding 5 5 5 3 4x faster ramp time
Context Sharing 5 4 4 2 Knowledge compounds
TOTAL 52 88% 87% 57%

Full catalog: github.com/mksglu/agentisd/blob/main/docs/vc/metric-catalog-full.md

Analysis

Context Sharing Intelligence

"Teams explain the same thing to AI 47 times a week."

CLAUDE.md Freshness

"3 teams haven't updated project instructions in 45 days"

SessionStart content hash tracking detects stale context files

SessionStart hook → SHA-256 hash comparison

Skill Effectiveness

QA skill → 92% commit rate
Finance skill → 41% commit rate

PostToolUse skill invocation → outcome analysis

PostToolUse tool_name + git commit correlation

"Explain It Twice" Detection

"5 engineers typed similar deploy instructions"

UserPromptSubmit cosine similarity clustering → auto-suggest skill creation

UserPromptSubmit prompt embedding analysis

Reference: Brian Scanlan, VP Engineering at Intercom — @brian_scanlan
30+ custom skills in production. JAMF-managed deployment. Weekly usage reports. Built over 6+ months of platform engineering. agentisd automates the detection.

Product Feature

Skill Marketplace & Team Management

Teams upload, share, and measure skills across Claude Code, Codex CLI, and Cursor.

Skill Management

• Upload skills to org marketplace (Git-backed)
• Domain-scoped: iOS, backend, shared, design-system
• Version control + automatic update propagation
• JAMF/MDM push for enterprise (non-optional skills)
• Cross-platform: same skill works on all 3 platforms

Skill Analytics (agentisd measures)

• Adoption rate per skill per team
• Effectiveness score (commit rate after skill use)
• Discovery latency (days from install to first use)
• Cross-team overlap detection → promote to shared
• Staleness alert (last updated > 30 days + high usage)

PLATFORM SUPPORT

Capability Claude Code Codex CLI Cursor
Skill invocation detection PostToolUse PostToolUse postToolUse
Marketplace integration plugins.ts plugins/ cursor-team-kit
Enterprise push (managed) admin settings config.toml Business plan
Skill effectiveness tracking PostToolUse → outcome PostToolUse → outcome postToolUse → outcome

Industry Reference

Intercom: 6 Months of Manual Effort

Brian Scanlan, VP Engineering at Intercom — @brian_scanlan, March 2026

What Intercom Built Effort agentisd Metric
30+ analytics skills (Snowflake, Gong, finance, QA) 6+ months Skill adoption tracking
JAMF deployment to 200+ engineers Enterprise MDM Marketplace health score
Weekly usage reports & quality evals Custom Snowflake Skill effectiveness score
QA skill: 7-stage pipeline → GitHub issues Weeks Test pass rate tracking
Code review agents with quality filters Custom dev Session effectiveness
Weekly CLAUDE.md fact-check GitHub Action Automation Context freshness score
Incident/troubleshooting with progressive disclosure Months "Explain it twice" detection
All runbooks followable by Claude in 6 weeks Systematic program Runbook→skill coverage

Intercom spent 6+ months of platform engineering.
agentisd automates the measurement from day one.

Platform Analysis

Platform Coverage Matrix

Capability Claude Code Codex CLI Cursor
Hook Events 5 5 + Stop 5 (stop, afterAgentResponse)
PreToolUse block + modify block + modify block only
PostToolUse full output full output read-only
SessionStart inject context inject context unreliable
UserPromptSubmit
Token / Cost native OTEL
Coverage 95% (28/30) 93% (27/30) 62% (16/30)

Claude Code: cli/structuredIO.ts, cli/print.ts:SDKResult

Codex CLI: codex-rs/hooks/hook_runtime.rs, OTEL codex.cost_usd

Cursor: cursor/plugins/schemas, cursor/coreCommands

Adoption Intelligence

Adoption by Team & Seniority Level

ADOPTION BY SENIORITY

Senior Engineers 89%
Mid-Level Engineers 72%
Junior Engineers 45%
Design Engineers 31%

Juniors: 45% adoption but 2.1x slower time-to-solution → training opportunity

ADOPTION BY TEAM

Team Score Sessions/wk
Platform 92 1,247
API 87 983
Frontend 71 756
Mobile 54 312
Design Eng 43 89

Mobile: score 54, low session count → investigate blockers or tool fit

Engineering Teams (Platform, API, Backend)

Score: 87-92 · Sessions: 983-1,247/wk
High adoption, high effectiveness. Focus: optimize model mix, reduce cost.

Source: PostToolUse session frequency + commit rate

Design & Non-Engineering Teams

Score: 31-54 · Sessions: 89-312/wk
Low adoption. Action: specialized skills, onboarding, or re-evaluate tool fit.

Source: Same hooks — low adoption is the signal itself

DATA PIPELINE — HOW THIS WORKS

From Hooks (automatic)

• session_id → unique developer
• project_dir → repo/project
• tool calls → usage patterns
• error rate → effectiveness
• commit rate → productivity

Onboarding Form (developer self-selects)

• Email from git config
• Team: Platform / API / Frontend / ...
• Level: Junior / Mid / Senior / Staff
Filled once at plugin install.
Email links Claude/Codex/Cursor identity.
Org inferred from email domain.

Computed Metrics

adoption = active / total
score = adoption×0.3 +
  productive_rate×0.3 +
  (1-error_rate)×0.2 +
  tool_diversity×0.2

All 3 platforms (Claude Code, Codex CLI, Cursor). Seniority can be admin-configured or auto-inferred after 4 weeks of session data.

Competitive Landscape

Why Existing Tools Can't Do This

Every developer analytics tool today is git-based.
They measure the output. They can't see the process.

Capability Jellyfish DX Swarmia LinearB Sleuth agentisd
Data source Git + Jira Git + Surveys Git + Jira Git only Config layer Hook-level sessions
AI session observation governance
Edit→Test→Edit cycles
Cost per AI session per-skill
Context quality scoring catalog versioning
"Explain it twice" detection
Cross-tool comparison inferred inferred inferred distribution
AI tool ROI git-derived git-derived adoption only

Git-based tools

Measure commits, PRs, cycle time.
Can't see what happens inside an AI session.

agentisd (hook-based)

Observes every tool call, every iteration, every outcome.
Data that is structurally impossible for git-based tools.

Competitor data verified via jellyfish.co, getdx.com, swarmia.com, linearb.io, sleuth.io (April 2026). Jellyfish/DX "AI Impact" modules infer AI usage from git metadata, not session data.

System Design

Architecture

Two sibling products, not parent-child. ctx_agentisd sends nothing to cloud.

ctx_agentisd — Local MCP Tool (inside context-mode)

Claude Code / Codex / Cursor
↓ hook events
hooks (SessionStart, PreToolUse, PostToolUse, Stop)
↓ structured writes
SessionDB (SQLite)
↓ ctx_agentisd MCP tool
localhost browser dashboard
shadcn UI

Purely local. Free. No cloud dependency.

agentisd Cloud — Separate Product

Cloudflare D1
Team dashboards
Org-wide analytics

Paid, Separate Infrastructure

k-anonymity ≥ 5
No raw events, no code, no prompts

Developer Experience

What Developers See

Developer types ctx_agentisd or /agentisd — browser opens with local SessionDB metrics

Session Duration
45m
Commits
2
Files Touched
7
Error Rate
8.5%

Tool Distribution

Bash 34
Edit 12
Read 8
Iteration Cycles (Edit → Test → Edit)
3
Context Quality
CLAUDE.md updated 2d ago ✓

"ctx_agentisd is a local MCP tool. No separate install. No cloud. Just your data in your browser."

Team analytics is a separate product: agentisd cloud (Cloudflare D1).

Evidence

Evidence Summary

Finding 1
"The Measurement Gap"

63 tengu_* analytics events tracked internally by Anthropic. total_cost_usd computed per session.

Zero exposed to customers.

bridge/replBridge.ts

cli/print.ts:SDKResult

Finding 2
"Production-Ready Hook Protocol"

5 hook events, Zod-validated JSON I/O, blocking + async modes. Identical wire protocol across 3 platforms.

cli/structuredIO.ts

codex-rs/hooks/hook_runtime.rs

Finding 3
"Enterprise Demand Signal"

forceLoginMethod (SSO), organization.uuid, maxBudgetUsd, allowedMcpServers, RBAC roles, trusted devices.

bridge/types.ts

cli/auth/

Business Model

Pricing & Financial Projection

Free — ctx_agentisd

$0

MCP tool inside context-mode. Local dashboard.

• Personal session analytics
• Tool distribution & error rate
• Iteration cycles & commit tracking
• Context quality score
• All data stays on your machine

56K installs, 6.5K GitHub stars

agentisd — Enterprise Product

$18/seat/mo

Team

$34/seat/mo

Enterprise

• Cross-developer & cross-tool comparison
• CTO board view & AI ROI dashboard
• Onboarding velocity tracking
• Context sharing intelligence
• SSO/SAML, audit logs, SLA

Financial Projection (50-seat engineering org)

$42K
current AI spend/mo
$27K
annual license waste found
$900
agentisd cost/mo (50×$18)
30x
ROI (savings / cost)

Conclusion

52 metrics. 8 personas.
3 platforms. Source-code proven.

context-mode (open source)

Data collection layer. 56K installs.
ctx_agentisd: free local dashboard.

agentisd (enterprise product)

Team analytics. Cloudflare D1.
$18-34/seat. Privacy-first.

"You measure everything about your code.
Why not the AI writing it?"

Research & documentation: github.com/mksglu/agentisd