Series: Cursor Agent Harness (Part 1 of 3)
Companion: Token saving post-mortem (proxy retired; direct OpenRouter + MCP)
Memory stack: Four tiers of external memory (Layer 4 = feedback loop)
Next: Part 2 memory loop · Part 3 measurement
The problem: benchmark harnesses are not your repo harness
Terminal-Bench ranks agents like LemonHarness and Harbor on hard CLI tasks in Docker. The leaderboard rewards multi-step tool loops and verification. That is useful research. It is also easy to misread as a shopping list: install another orchestrator, add Redis, route every Composer call through a microservice.
On my production Shopify app (React Router 7, Fly.io, Cursor Composer 2.5 direct), the failure mode was different. I did not need a sandbox benchmark. I needed fewer bad deploys, fewer retry threads, and fewer tokens wasted on subagents and session bootstrap when a one-line question would do.
I had already built most of a harness. I had not named it or written a one-page routing policy.
Why this matters
A second HTTP orchestrator in front of Cursor usually duplicates what Agent mode already does: plan, call tools, observe, revise. Each extra planner or verifier LLM round adds cost. On my traffic, a token proxy experiment hurt OpenRouter cache hit rate without meaningful dollar savings (companion post).
What did help was discipline:
- When to use direct Composer vs a batch subagent
- When to run tests before ship
- When to skip a 14k-token brain-pack for a trivial question
- CI checks that cost zero chat tokens
That is a product harness: rules, optional subagents, deterministic verifiers. Not Harbor.
You keep the controls — the harness supports, not overrides
A harness is easy to misread as something that takes over: auto-routing your model, forcing Plan mode, or swapping GLM for Composer behind your back. That is not what this stack does.
| Control | Who owns it | What the harness does instead |
|---|---|---|
| Model (Composer 2.5, GLM 5.2, etc.) | You — Cursor dropdown | Nothing. No proxy, no model router. Your baseline model policy is a workflow choice, not code that blocks other models. |
| Mode (Plan vs Agent) | You — Cursor UI | Rules still apply in Plan (alwaysApply .cursor/rules). Plan-only turns stay advisory: no deploy, lighter bootstrap for simple Q&A. |
| When to spawn subagents | Policy table — agent follows HARNESS-POLICY | Suggests <project>-batch only for 3+ tasks; direct Composer for 1–2 file fixes. You can override by saying "do this inline." |
| When to load memory | Harness gates (Part 2) | Skips brain-pack on Mode A; loads once on code sessions. You still own what is in memories/ and your vault. |
| When to ship | You approve deploy | Harness insists on npm test / <project>-test before release — support, not autopilot. |
Re-assess means the agent asks "Does this task need a subagent, full bootstrap, or tests?" — not "Should I change your model?"
If you pick Plan + GLM 5.2, Cursor runs Plan with GLM 5.2. The harness shapes procedure (footer Mode B, skip heavy vault sweep for a one-line question, no batch agent on a single file). It does not silently revert you to Composer 2.5 or Agent mode.
Power stays with you. The harness is the operating manual the agent reads — routing table, test gates, memory discipline, CI checks — so support is consistent even when the model or mode changes.
What was already in the repo
Before I wrote docs/HARNESS-POLICY.md, my main app repo already had:
| Pattern | Location | Job |
|---|---|---|
| Orchestrator | .github/agents/<project>-batch.agent.md | Split 3+ tasks to workers |
| Worker | <project>-impl.agent.md | One scoped change |
| Verifier | <project>-test.agent.md + npm test | Pass/fail |
| Release gate | <project>-release-manager + pre-deploy script | No dirty deploys |
| CI harness | agent-quality/evals/workflows.json | Static regression checks on push |
| Session bootstrap | sessionStart hook → brain-pack.md | Cross-session context |
| Routing prose | .github/copilot-instructions.md | When to invoke agents |
Composer 2.5 direct is the default path. Subagents are the heavy path when the table says so.
How the lightweight harness works
1. One policy page (git, not Obsidian)
docs/HARNESS-POLICY.md is a decision table:
| Trigger | Action | Avoid |
|---|---|---|
| Quick question | Direct Composer, skip heavy bootstrap | Batch subagent |
| 1–2 file fix | Direct + docs/symbol MCP | <project>-batch |
| 3+ independent tasks | <project>-batch | Implement whole list inline |
| Before deploy | npm test or <project>-test | Trust agent claim alone |
| Deep debug | Main agent + test loop | LangChain on week one |
2. Context budget rule (Cursor)
.cursor/rules/context-budget.mdc (~25 lines, alwaysApply: true) points at the policy. Mode A for Q&A. No subagents on single-file fixes.
3. CI verifier (push, not chat)
npm run eval:gate runs workflow cases (fileExists, textContains) against routes and rules that regressed in production. Chat tokens: zero.
4. Defer list
docs/HARNESS-DEFER.md states what not to build until a CSV baseline proves a gap: LangChain service, Harbor as daily driver, multi-model ensemble.
What I measured (and what I skipped)
Benchmarked claims from Terminal-Bench agents optimize leaderboard score, often with more tokens. I skipped plugging those agents into Cursor; they expect Harbor plus Docker task layouts, not a monolith Shopify repo.
I also skipped a LangChain microservice after the proxy post-mortem: complexity without ROI on chat-heavy sessions.
What you can copy in one afternoon
- Add
docs/HARNESS-POLICY.md(one table). - Add
.cursor/rules/context-budget.mdc(link to policy). - Optional: one batch agent + one test agent, not five on day one.
- Optional:
agent-quality/or a singlefileExistsscript on CI. - Add
docs/harness-session-log.csvfor weekly review (Part 3 of this series). - Add
docs/HARNESS-DEFER.mdso future-you does not install Redis out of FOMO.
Policy files live in git so they diff in PRs. Session narrative stays in Obsidian (Part 2).
Reader action
Open your main repo. List what you already have: subagents, test script, deploy gate, CI check, session rules. If the list has three or more rows, you likely have a harness. Write the policy page before you write another service.
Part 2 covers the four-tier memory loop (including Layer 4 feedback) without loading everything every turn. Part 3 covers the CSV, OpenRouter dollars, and when to build more.
Get practical posts on enterprise AI and transformation. Only useful updates, sent as a weekly digest.
One practical digest each week. Unsubscribe anytime.







