Research & Findings

Measuring AI Coding
Tool Effectiveness

A hook-based instrumentation framework across
Claude Code, Codex CLI, and Cursor

agentisd

April 2026

Burak Mert Köseoğlu — @mksglu

Prepared for Berkay Mollamustafaoğlu

The Problem

Research Question

"Engineering teams spend $42K/month on AI coding tools.
80% have zero usage metrics.
Can we measure what these tools actually do?"

$42K

avg. monthly spend
(50-seat engineering org)

80%

of teams with
zero usage metrics

internal analytics events
(Anthropic tengu_*, not exposed)

Source: Anthropic internal telemetry analysis — 63 tengu_* events tracked in bridge/replBridge.ts; total_cost_usd computed in cli/print.ts:SDKResult

Literature Review

The Productivity Paradox

PUBLISHED FINDINGS

METR 2025 — Randomized Controlled Trial

-19% slower with AI · +20% believed faster

arXiv:2507.09089 — n=16, 246 tasks, mature OSS repos

Faros AI 2025 — The Verification Bottleneck

+21% tasks, +98% PRs — but +91% review time

10K+ developers. Organizational throughput: flat.

SonarSource Jan 2026

42% of code is AI-generated · 96% don't trust it

Also: DORA 2024 — -1.5% throughput, -7.2% stability with AI adoption (dora.dev, Figure 7, 39K respondents). METR Feb 2026 — RCT abandoned, 30-50% refused tasks without AI.

"Vendors track everything.
Customers track nothing."

THE MEASUREMENT GAP

What Anthropic Measures Internally

63 x tengu_* analytics events
total_cost_usd per session
token counts (input/output/cache)
model routing decisions
tool invocation patterns
session duration & outcomes

bridge/replBridge.ts, cli/print.ts

What Customers See

nothing

Market Validation

context-mode — We Already Have the Data Layer

context-mode is an open-source MCP plugin that saves 98% of the context window.
It already captures every tool call, every session, every outcome.

56K+

installs
(npm + marketplace)

6.5K

GitHub stars
in 3 months

platform adapters
Claude, Codex, Cursor, +9

Active community: 415 forks, 6.5K stars in 3 months, weekly npm downloads trending up. Open source under Elastic-2.0.

WHAT context-mode ALREADY TRACKS (per session)

• Tool call count & distribution
• Bytes processed vs context
• Session duration & compactions
• Error rate per tool
• Git operations (commit, push)
• File read/write/edit patterns
• Context savings ratio (98%)
• FTS5 search index usage
• Session lifecycle events

"The data already exists. ctx_agentisd makes it visible."

Methodology

Hook-Based Instrumentation

"AI coding tools expose hook events at every tool call."

SessionStart

→

PreToolUse

→

[Tool Executes]

→

PostToolUse

→

Stop

UserPromptSubmit

← async, captures prompts before processing

WIRE PROTOCOL

      JSON stdin
       → 
      hook script
       → 
      JSON stdout
    

Same protocol across Claude Code, Codex CLI, and Cursor

cli/structuredIO.ts — Claude Code hook I/O types and processing

codex-rs/hooks/hook_runtime.rs — Codex CLI hook execution runtime

cursor/plugins/schemas — Cursor plugin hook schemas

Data Model

What Data Flows Through Hooks

Field	Claude Code	Codex CLI	Cursor
tool_name	✓ PreToolUse	✓ PreToolUse	✓ preToolUse
tool_input	✓ full args	✓ full args	✓ full args
tool_output	✓ PostToolUse	✓ PostToolUse	✓ postToolUse
session_id	✓ all hooks	✓ all hooks	⚠ conversation_id
cost / tokens	✓ SDKResult	✓ OTEL metrics	✗ not exposed
exit_code	✓ PostToolUse	✓ PostToolUse	✓ postToolUse

FIELD COVERAGE

Claude Code

95%

Codex CLI

93%

Cursor

62%

Sources: cli/structuredIO.ts (Claude Code), codex-rs/hooks/hook_runtime.rs (Codex), cursor/plugins/schemas (Cursor)

Core Analysis — Business Personas

Persona → Metric → Evidence → Action → ROI

CTO

Seat Utilization 69% (127/184)

AI ROI Score 2.8x

Cross-tool Cost $2.40 / $1.80 / $0

Hook: SessionStart session_id + SDKResult.total_cost_usd

→ Cut 57 idle seats → $27K/yr saved. Board deck: "AI returns 2.8x"

ROI: $27K/yr license savings + board-ready ROI proof

Engineering Manager

Session Effectiveness 72% deliverable

Edit→Test→Edit Cycles 12 vs 3 iter

Rework Rate 38% sessions

Hook: PostToolUse git commit + Edit file_path count

→ Coach Dev A (12 iter) toward Dev B (3 iter). Data-driven retros.

ROI: 2.6x sprint velocity + 60% fewer rework cycles

DevEx Lead

Onboarding 1.8 days (-62%)

Context Quality Score 78/100

"Explain it twice" 47 → 12 skills

Hook: SessionStart hash + UserPromptSubmit clustering

→ 47 repeated prompts converted to 12 reusable skills. Onboarding 6wk→2wk.

ROI: -62% onboarding time + zero repeated context

Core Analysis — Operational Personas

Persona → Metric → Evidence → Action → ROI

Security Officer

Permission Denials 342/mo + 12 MCP

Dangerous Cmd Blocks 89 blocked

Tool Access Audit Full trail

Hook: PreToolUse deny + tool_input pattern match

→ Complete audit trail per developer. Anomaly alerts <24h.

ROI: 12 MCP violations flagged + zero unauthorized access

FinOps Manager

Cost per Session $1.42 (-18%)

Model Mix 62S/28H/10O

Budget Utilization 3 over / 2 under

Hook: SDKResult.total_cost_usd + model metadata

→ Shift Opus→Sonnet saves $8K/mo. Right-size team budgets.

ROI: $8K/mo model optimization + 3 teams rebalanced

QA Lead

AI Bug Density 38% debugging

Test Pass Rate 67% (was 52%)

Error Rate by Platform 3.2/4.1/7.8%

Hook: PostToolUse exit_code + is_error per adapter

→ Deploy TDD skill → +15% pass rate. Data-backed platform choice.

ROI: +15% first-run pass + platform error benchmarks

Full Metric Coverage

52 Metrics Across 8 Personas

Persona	Metrics	Claude Code	Codex CLI	Cursor	Top Impact
CTO	8	7	7	5	$27K/yr license savings
Engineering Manager	7	7	7	5	2.6x sprint velocity
DevEx Lead	7	6	6	4	6wk → 2wk onboarding
Security Officer	5	5	5	3	Full compliance audit
FinOps Manager	5	5	5	1	15-30% cost optimization
QA Lead	5	5	5	4	Targeted tech debt sprints
Developer	5	5	5	4	Personal mastery curve
Onboarding	5	5	5	3	4x faster ramp time
Context Sharing	5	4	4	2	Knowledge compounds
TOTAL	52	88%	87%	57%

Full catalog: github.com/mksglu/agentisd/blob/main/docs/vc/metric-catalog-full.md

Analysis

Context Sharing Intelligence

"Teams explain the same thing to AI 47 times a week."

CLAUDE.md Freshness

"3 teams haven't updated project instructions in 45 days"

SessionStart content hash tracking detects stale context files

SessionStart hook → SHA-256 hash comparison

Skill Effectiveness

QA skill → 92% commit rate
Finance skill → 41% commit rate

PostToolUse skill invocation → outcome analysis

PostToolUse tool_name + git commit correlation

"Explain It Twice" Detection

"5 engineers typed similar deploy instructions"

UserPromptSubmit cosine similarity clustering → auto-suggest skill creation

UserPromptSubmit prompt embedding analysis

Reference: Brian Scanlan, VP Engineering at Intercom — @brian_scanlan
30+ custom skills in production. JAMF-managed deployment. Weekly usage reports. Built over 6+ months of platform engineering. agentisd automates the detection.

Product Feature

Skill Marketplace & Team Management

Teams upload, share, and measure skills across Claude Code, Codex CLI, and Cursor.

Skill Management

• Upload skills to org marketplace (Git-backed)

• Domain-scoped: iOS, backend, shared, design-system

• Version control + automatic update propagation

• JAMF/MDM push for enterprise (non-optional skills)

• Cross-platform: same skill works on all 3 platforms

Skill Analytics (agentisd measures)

• Adoption rate per skill per team

• Effectiveness score (commit rate after skill use)

• Discovery latency (days from install to first use)

• Cross-team overlap detection → promote to shared

• Staleness alert (last updated > 30 days + high usage)

PLATFORM SUPPORT

Capability	Claude Code	Codex CLI	Cursor
Skill invocation detection	✓ PostToolUse	✓ PostToolUse	✓ postToolUse
Marketplace integration	✓ plugins.ts	✓ plugins/	✓ cursor-team-kit
Enterprise push (managed)	✓ admin settings	⚠ config.toml	⚠ Business plan
Skill effectiveness tracking	✓ PostToolUse → outcome	✓ PostToolUse → outcome	✓ postToolUse → outcome

Industry Reference

Intercom: 6 Months of Manual Effort

Brian Scanlan, VP Engineering at Intercom — @brian_scanlan, March 2026

What Intercom Built	Effort	agentisd Metric
30+ analytics skills (Snowflake, Gong, finance, QA)	6+ months	Skill adoption tracking
JAMF deployment to 200+ engineers	Enterprise MDM	Marketplace health score
Weekly usage reports & quality evals	Custom Snowflake	Skill effectiveness score
QA skill: 7-stage pipeline → GitHub issues	Weeks	Test pass rate tracking
Code review agents with quality filters	Custom dev	Session effectiveness
Weekly CLAUDE.md fact-check GitHub Action	Automation	Context freshness score
Incident/troubleshooting with progressive disclosure	Months	"Explain it twice" detection
All runbooks followable by Claude in 6 weeks	Systematic program	Runbook→skill coverage

Intercom spent 6+ months of platform engineering.
agentisd automates the measurement from day one.

Platform Analysis

Platform Coverage Matrix

Capability	Claude Code	Codex CLI	Cursor
Hook Events	5	5 + Stop	5 (stop, afterAgentResponse)
PreToolUse	✓ block + modify	✓ block + modify	⚠ block only
PostToolUse	✓ full output	✓ full output	✓ read-only
SessionStart	✓ inject context	✓ inject context	⚠ unreliable
UserPromptSubmit	✓	✓	✗
Token / Cost	✓ native	✓ OTEL	✗
Coverage	95% (28/30)	93% (27/30)	62% (16/30)

Claude Code: cli/structuredIO.ts, cli/print.ts:SDKResult

Codex CLI: codex-rs/hooks/hook_runtime.rs, OTEL codex.cost_usd

Cursor: cursor/plugins/schemas, cursor/coreCommands

Adoption Intelligence

Adoption by Team & Seniority Level

ADOPTION BY SENIORITY

Senior Engineers 89%

Mid-Level Engineers 72%

Junior Engineers 45%

Design Engineers 31%

Juniors: 45% adoption but 2.1x slower time-to-solution → training opportunity

ADOPTION BY TEAM

Team	Score	Sessions/wk
Platform	92	1,247
API	87	983
Frontend	71	756
Mobile	54	312
Design Eng	43	89

Mobile: score 54, low session count → investigate blockers or tool fit

Engineering Teams (Platform, API, Backend)

Score: 87-92 · Sessions: 983-1,247/wk
High adoption, high effectiveness. Focus: optimize model mix, reduce cost.

Source: PostToolUse session frequency + commit rate

Design & Non-Engineering Teams

Score: 31-54 · Sessions: 89-312/wk
Low adoption. Action: specialized skills, onboarding, or re-evaluate tool fit.

Source: Same hooks — low adoption is the signal itself

DATA PIPELINE — HOW THIS WORKS

From Hooks (automatic)

• session_id → unique developer
• project_dir → repo/project
• tool calls → usage patterns
• error rate → effectiveness
• commit rate → productivity

Onboarding Form (developer self-selects)

• Email from git config
• Team: Platform / API / Frontend / ...
• Level: Junior / Mid / Senior / Staff
Filled once at plugin install.
Email links Claude/Codex/Cursor identity.
Org inferred from email domain.

Computed Metrics

          adoption = active / total

          score = adoption×0.3 +

            productive_rate×0.3 +

            (1-error_rate)×0.2 +

            tool_diversity×0.2

All 3 platforms (Claude Code, Codex CLI, Cursor). Seniority can be admin-configured or auto-inferred after 4 weeks of session data.

Competitive Landscape

Why Existing Tools Can't Do This

Every developer analytics tool today is git-based.
They measure the output. They can't see the process.

Capability	Jellyfish	DX	Swarmia	LinearB	Sleuth	agentisd
Data source	Git + Jira	Git + Surveys	Git + Jira	Git only	Config layer	Hook-level sessions
AI session observation	✗	✗	✗	✗	⚠ governance	✓
Edit→Test→Edit cycles	✗	✗	✗	✗	✗	✓
Cost per AI session	✗	✗	✗	✗	⚠ per-skill	✓
Context quality scoring	✗	⚠ catalog	✗	✗	⚠ versioning	✓
"Explain it twice" detection	✗	✗	✗	✗	✗	✓
Cross-tool comparison	⚠ inferred	⚠ inferred	⚠ inferred	✗	⚠ distribution	✓
AI tool ROI	⚠ git-derived	⚠ git-derived	⚠ adoption only	✗	✗	✓

Git-based tools

Measure commits, PRs, cycle time.
Can't see what happens inside an AI session.

agentisd (hook-based)

Observes every tool call, every iteration, every outcome.
Data that is structurally impossible for git-based tools.

Competitor data verified via jellyfish.co, getdx.com, swarmia.com, linearb.io, sleuth.io (April 2026). Jellyfish/DX "AI Impact" modules infer AI usage from git metadata, not session data.

System Design

Architecture

Two sibling products, not parent-child. ctx_agentisd sends nothing to cloud.

ctx_agentisd — Local MCP Tool (inside context-mode)

Claude Code / Codex / Cursor

↓ hook events

hooks (SessionStart, PreToolUse, PostToolUse, Stop)

↓ structured writes

SessionDB (SQLite)

↓ ctx_agentisd MCP tool

localhost browser dashboard

shadcn UI

Purely local. Free. No cloud dependency.

agentisd Cloud — Separate Product

Cloudflare D1

↓

Team dashboards

↓

Org-wide analytics

Paid, Separate Infrastructure

k-anonymity ≥ 5
No raw events, no code, no prompts

Developer Experience

What Developers See

Developer types ctx_agentisd or /agentisd — browser opens with local SessionDB metrics

Session Duration

45m

Commits

Files Touched

Error Rate

8.5% ↓

Tool Distribution

Bash 34
Edit 12
Read 8

Iteration Cycles (Edit → Test → Edit)

Context Quality

CLAUDE.md updated 2d ago ✓

"ctx_agentisd is a local MCP tool. No separate install. No cloud. Just your data in your browser."

Team analytics is a separate product: agentisd cloud (Cloudflare D1).

Evidence

Evidence Summary

Finding 1

"The Measurement Gap"

63 tengu_* analytics events tracked internally by Anthropic. total_cost_usd computed per session.

Zero exposed to customers.

bridge/replBridge.ts

cli/print.ts:SDKResult

Finding 2

"Production-Ready Hook Protocol"

5 hook events, Zod-validated JSON I/O, blocking + async modes. Identical wire protocol across 3 platforms.

cli/structuredIO.ts

codex-rs/hooks/hook_runtime.rs

Finding 3

"Enterprise Demand Signal"

forceLoginMethod (SSO), organization.uuid, maxBudgetUsd, allowedMcpServers, RBAC roles, trusted devices.

bridge/types.ts

cli/auth/

Business Model

Pricing & Financial Projection

Free — ctx_agentisd

MCP tool inside context-mode. Local dashboard.

• Personal session analytics

• Tool distribution & error rate

• Iteration cycles & commit tracking

• Context quality score

• All data stays on your machine

56K installs, 6.5K GitHub stars

agentisd — Enterprise Product

$18/seat/mo

Team

$34/seat/mo

Enterprise

• Cross-developer & cross-tool comparison

• CTO board view & AI ROI dashboard

• Onboarding velocity tracking

• Context sharing intelligence

• SSO/SAML, audit logs, SLA

Financial Projection (50-seat engineering org)

$42K

current AI spend/mo

$27K

annual license waste found

$900

agentisd cost/mo (50×$18)

30x

ROI (savings / cost)

Conclusion

52 metrics. 8 personas.
3 platforms. Source-code proven.

context-mode (open source)

Data collection layer. 56K installs.
ctx_agentisd: free local dashboard.

agentisd (enterprise product)

Team analytics. Cloudflare D1.
$18-34/seat. Privacy-first.

"You measure everything about your code.
Why not the AI writing it?"

Research & documentation: github.com/mksglu/agentisd

Measuring AI CodingTool Effectiveness

Research Question

The Productivity Paradox

context-mode — We Already Have the Data Layer

Hook-Based Instrumentation

What Data Flows Through Hooks

Persona → Metric → Evidence → Action → ROI

Persona → Metric → Evidence → Action → ROI

52 Metrics Across 8 Personas

Context Sharing Intelligence

Skill Marketplace & Team Management

Intercom: 6 Months of Manual Effort

Platform Coverage Matrix

Adoption by Team & Seniority Level

Why Existing Tools Can't Do This

Architecture

What Developers See

Evidence Summary

Pricing & Financial Projection

52 metrics. 8 personas.3 platforms. Source-code proven.

Measuring AI Coding
Tool Effectiveness

52 metrics. 8 personas.
3 platforms. Source-code proven.