Series: Cursor Agent Harness (Part 3 of 3)
Part 1: Lightweight harness — you keep model and mode control
Part 2: Four-tier memory loop
Cost context: Copilot vs OpenRouter pricing
The problem: you cannot optimize what you do not log
Leaderboard harnesses report accuracy on fixed tasks. Your repo has different failure modes: wrong fix after tests were skipped, batch subagent on a one-file tweak, deploy without npm test, OpenRouter spend creeping up while cache hit drops.
Without a one-week baseline, every new tool looks rational. With Layer 4 instrumentation — CSV rows, footer Agents line, CI eval — defer lists write themselves. You still decide whether to build more; the harness only surfaces evidence.
Why Phase 0 before Phase 2
docs/HARNESS-DEFER.md says: do not build scripts/agent-harness.mjs or plug Harbor until all gates pass — repeated wrong-fix rows with tests already run, subagent overuse on small tasks, and engineering time budget.
Measurement is cheaper than a microservice. It also respects your control: no auto-install of orchestrators when the CSV says failure modes are none.
What to log (week 1)
docs/harness-session-log.csv columns:
date, task_type, subagent_used, tests_before_deploy, openrouter_usd_notes, failure_mode, notes
Log every real session. No synthetic benchmark runs.
This is Layer 4 feedback in spreadsheet form — the same tier as footer contracts and eval:gate, not Layer 2 narrative. Compare weekly; promote patterns into rules when the same failure_mode repeats.
Week 1 review checklist
- Subagents only on rows tagged
batch? - Every deploy row has
tests_before_deploy=yes? -
failure_modemostlynone? - OpenRouter Activity flat or down vs prior week (same model)?
CI harness (zero chat tokens)
npm run eval:validate # workflows.json paths exist
npm run eval:gate # score vs baseline.json
Add cases for your repeat regressions. Examples from my Shopify app work: Prisma migration discipline, feed-widget sync, storefront API debug rule.
ROI formula (honest, not leaderboard)
net_benefit = (failed_sessions_avoided × avg_session_cost)
+ (deploy_rollbacks_avoided × rollback_cost)
− (extra_orchestration_tokens × price)
− (engineering_hours × your_rate)
If net_benefit is negative after four weeks and failure modes are none, the harness is already enough. Stop shopping for Harbor.
Footer and summaries (Layer 4 — no new contract lines)
Use existing v3.1 footer fields from the Session Continuity System:
- Agents:
none — direct Composer,<project>-batch, etc. — proves which harness path ran; does not change your model - Verified:
harness: tests_before_ship=yes|no
Session Summaries prefix (Part 2) is Layer 2. Footer + CSV + eval:gate are Layer 4. Together they close the feedback loop: session → evidence → weekly review → defer or promote to rules.
Reader action
Create docs/harness-session-log.csv today. Log this week's sessions. Run eval:gate once. If you have three or more wrong-fix rows with tests run, re-read HARNESS-DEFER.md. Otherwise ship the policy and stop building orchestrators.
Get practical posts on enterprise AI and transformation. Only useful updates, sent as a weekly digest.
One practical digest each week. Unsubscribe anytime.







