← Back to all posts

AI & Building

You Already Have an AI Harness in Cursor (Without LangChain)

Terminal-Bench harnesses look like separate products. On a production Shopify app I already had subagents, CI gates, and session rules. You keep model and mode control — the harness supports routing, tests, and memory gates, not autopilot.

·7 min read
Agentic AIDeveloper ToolsAI Quality
You Already Have an AI Harness in Cursor (Without LangChain)

Series: Cursor Agent Harness (Part 1 of 3)
Companion: Token saving post-mortem (proxy retired; direct OpenRouter + MCP)
Memory stack: Four tiers of external memory (Layer 4 = feedback loop)
Next: Part 2 memory loop · Part 3 measurement

The problem: benchmark harnesses are not your repo harness

Terminal-Bench ranks agents like LemonHarness and Harbor on hard CLI tasks in Docker. The leaderboard rewards multi-step tool loops and verification. That is useful research. It is also easy to misread as a shopping list: install another orchestrator, add Redis, route every Composer call through a microservice.

On my production Shopify app (React Router 7, Fly.io, Cursor Composer 2.5 direct), the failure mode was different. I did not need a sandbox benchmark. I needed fewer bad deploys, fewer retry threads, and fewer tokens wasted on subagents and session bootstrap when a one-line question would do.

I had already built most of a harness. I had not named it or written a one-page routing policy.

Why this matters

A second HTTP orchestrator in front of Cursor usually duplicates what Agent mode already does: plan, call tools, observe, revise. Each extra planner or verifier LLM round adds cost. On my traffic, a token proxy experiment hurt OpenRouter cache hit rate without meaningful dollar savings (companion post).

What did help was discipline:

  • When to use direct Composer vs a batch subagent
  • When to run tests before ship
  • When to skip a 14k-token brain-pack for a trivial question
  • CI checks that cost zero chat tokens

That is a product harness: rules, optional subagents, deterministic verifiers. Not Harbor.

You keep the controls — the harness supports, not overrides

A harness is easy to misread as something that takes over: auto-routing your model, forcing Plan mode, or swapping GLM for Composer behind your back. That is not what this stack does.

ControlWho owns itWhat the harness does instead
Model (Composer 2.5, GLM 5.2, etc.)You — Cursor dropdownNothing. No proxy, no model router. Your baseline model policy is a workflow choice, not code that blocks other models.
Mode (Plan vs Agent)You — Cursor UIRules still apply in Plan (alwaysApply .cursor/rules). Plan-only turns stay advisory: no deploy, lighter bootstrap for simple Q&A.
When to spawn subagentsPolicy table — agent follows HARNESS-POLICYSuggests <project>-batch only for 3+ tasks; direct Composer for 1–2 file fixes. You can override by saying "do this inline."
When to load memoryHarness gates (Part 2)Skips brain-pack on Mode A; loads once on code sessions. You still own what is in memories/ and your vault.
When to shipYou approve deployHarness insists on npm test / <project>-test before release — support, not autopilot.

Re-assess means the agent asks "Does this task need a subagent, full bootstrap, or tests?" — not "Should I change your model?"

If you pick Plan + GLM 5.2, Cursor runs Plan with GLM 5.2. The harness shapes procedure (footer Mode B, skip heavy vault sweep for a one-line question, no batch agent on a single file). It does not silently revert you to Composer 2.5 or Agent mode.

Youmodel + mode dropdownCursor runtimePlan or AgentHarness support(rules + policy)OpenRouter /providerSubagent vsdirectMemorygatesTests beforeship procedure onlyno model swap
Youmodel + mode dropdownCursor runtimePlan or AgentHarness support(rules + policy)OpenRouter /providerSubagent vsdirectMemorygatesTests beforeship procedure onlyno model swap

Power stays with you. The harness is the operating manual the agent reads — routing table, test gates, memory discipline, CI checks — so support is consistent even when the model or mode changes.

What was already in the repo

Before I wrote docs/HARNESS-POLICY.md, my main app repo already had:

PatternLocationJob
Orchestrator.github/agents/<project>-batch.agent.mdSplit 3+ tasks to workers
Worker<project>-impl.agent.mdOne scoped change
Verifier<project>-test.agent.md + npm testPass/fail
Release gate<project>-release-manager + pre-deploy scriptNo dirty deploys
CI harnessagent-quality/evals/workflows.jsonStatic regression checks on push
Session bootstrapsessionStart hook → brain-pack.mdCross-session context
Routing prose.github/copilot-instructions.mdWhen to invoke agents

Composer 2.5 direct is the default path. Subagents are the heavy path when the table says so.

How the lightweight harness works

1. One policy page (git, not Obsidian)

docs/HARNESS-POLICY.md is a decision table:

TriggerActionAvoid
Quick questionDirect Composer, skip heavy bootstrapBatch subagent
1–2 file fixDirect + docs/symbol MCP<project>-batch
3+ independent tasks<project>-batchImplement whole list inline
Before deploynpm test or <project>-testTrust agent claim alone
Deep debugMain agent + test loopLangChain on week one

2. Context budget rule (Cursor)

.cursor/rules/context-budget.mdc (~25 lines, alwaysApply: true) points at the policy. Mode A for Q&A. No subagents on single-file fixes.

3. CI verifier (push, not chat)

npm run eval:gate runs workflow cases (fileExists, textContains) against routes and rules that regressed in production. Chat tokens: zero.

4. Defer list

docs/HARNESS-DEFER.md states what not to build until a CSV baseline proves a gap: LangChain service, Harbor as daily driver, multi-model ensemble.

User messagecontext-budget.mdc+ HARNESS-POLICYComposer 2.5directSubagent ortest gateeval:gateon pushMode AQ&ACode work
User messagecontext-budget.mdc+ HARNESS-POLICYComposer 2.5directSubagent ortest gateeval:gateon pushMode AQ&ACode work

What I measured (and what I skipped)

Benchmarked claims from Terminal-Bench agents optimize leaderboard score, often with more tokens. I skipped plugging those agents into Cursor; they expect Harbor plus Docker task layouts, not a monolith Shopify repo.

I also skipped a LangChain microservice after the proxy post-mortem: complexity without ROI on chat-heavy sessions.

What you can copy in one afternoon

  1. Add docs/HARNESS-POLICY.md (one table).
  2. Add .cursor/rules/context-budget.mdc (link to policy).
  3. Optional: one batch agent + one test agent, not five on day one.
  4. Optional: agent-quality/ or a single fileExists script on CI.
  5. Add docs/harness-session-log.csv for weekly review (Part 3 of this series).
  6. Add docs/HARNESS-DEFER.md so future-you does not install Redis out of FOMO.

Policy files live in git so they diff in PRs. Session narrative stays in Obsidian (Part 2).

Reader action

Open your main repo. List what you already have: subagents, test script, deploy gate, CI check, session rules. If the list has three or more rows, you likely have a harness. Write the policy page before you write another service.

Part 2 covers the four-tier memory loop (including Layer 4 feedback) without loading everything every turn. Part 3 covers the CSV, OpenRouter dollars, and when to build more.