← Back to all posts

AI & Building

Beyond Headroom: What I Tried to Save Cursor Tokens, What Failed, and What I Use Now

I ran Headroom, built a 300-line proxy, wired a Cloudflare tunnel, and added RTK. On my Cursor + OpenRouter workload the dollars did not move. Here is what is worth doing instead.

·8 min read
Agentic AIDeveloper ToolsAI Memory
Beyond Headroom: What I Tried to Save Cursor Tokens, What Failed, and What I Use Now

Cost context: GitHub Copilot vs OpenRouter pricing
Workflow context: Copilot → Cursor migration
What replaced the proxy: Lightweight agent harness (Part 1) · Four-tier memory

The problem: I wanted cheaper Cursor chats without changing how I work

Cursor bills through token volume. Compression proxies, shell filters, and cache tricks all promise 30% to 95% savings without touching your prompts. I tried that stack for real production work on a Shopify-style web app (React Router 7, hosted API, OpenRouter, and an OpenRouter model in a region where several frontier APIs are region-blocked).

As of July 2026 I turned it all off. The proxy scheduled tasks are disabled. The named Cloudflare tunnel hostname is decommissioned (DNS should not point at my machine; do not probe old URLs from superseded drafts). Cursor runs direct OpenRouter. The monthly dollar delta was negligible. Worse: running traffic through my proxy lowered OpenRouter prompt-cache hit rate (about 90%+ direct vs about 80% through the proxy).

This post is the digest: what I tried, what savings are actually realizable, why the experiment failed, what I learned, and what I run now (Context7, Serena, harness policy, four-tier memory).


Why this matters before you install anything

Token-saving tools are orthogonal layers. They stack only when each layer attacks a different part of the bill:

LayerWhat it shrinksTypical tools
SourceWhat the agent reads or runsSerena, Context7, LeanCTX, ripgrep aliases
PathBytes on the wire to the modelProxy postprocessor, RTK, Tamp
CacheRepeated prompt prefixesOpenRouter native cache, cache_control
ModelPrice per tokenComposer 2.5 baseline, routing
SessionContext bloat across turnsRules, harness gates, footer contract
ToolingShell output noiseRTK, jq/rg/fd instead of cat/grep/find

If your bill is mostly completion tokens and long chat history, a proxy that only tweaks the prompt stream will not move the needle. If OpenRouter already caches your prefixes, adding cache_control injection in front can change request shape and hurt cache keys. That is what happened here.


What I tried (timeline)

PhaseWhatGoal
1Headroom proxy on a local portHTTP-layer compression
2Custom Token Optimizer proxy (proxy.py, ~300 lines) on a second local portFive small layers instead of Headroom ML
3Cloudflare named tunnel (e.g. https://cursor-proxy.example.dev/v1 — use your hostname)Bypass Cursor SSRF so chat could reach localhost
4RTK preToolUse hookCompress shell output before the model sees it
5A/B harness (60 calls, fresh OpenRouter keys)Measure proxy on vs off
6Retired proxy, tunnel, RTK, Headroom MCPDirect OpenRouter + MCP input discipline

Headroom on my traffic (90-minute test, 37 requests): 5.7% average compression (from internal session tracking). Below the 30% bar I set before the test. The advertised cache hit path did not match OpenRouter-shaped traffic.

Token Optimizer looked better in smoke tests (prompt-cache markers, 1 ms local repeat hits, RTK on git status). Real dashboard math did not hold up once I counted OpenRouter Activity and cache hit % with the proxy in path.


What I measured (honest numbers)

SignalResult
A/B harness (tiny prompts, fresh keys)Proxy +4.7% on chat-shaped calls; OFF cohort sometimes higher cache %
RTK telemetry34.8% compression on git status — but only ~464 tokens across four shell runs
OpenRouter cache hit~80% with proxy vs ~90%+ direct (−10 percentage points)
Monthly spend (proxy key)~$8 on ~75M tokens — savings dominated by OpenRouter's own prompt cache, not my layers
Tunnel + scheduled tasksWorked when live; infra cost (reboots, ACLs, wrong proxy.py path for weeks) with no dollar payoff

Benchmark claims from tool READMEs (Headroom 60–95%, LeanCTX 88% sessions, RTK 30–80% shell) are workload-dependent. My workload is chat-heavy Cursor Agent, not file-read-heavy Claude Code sessions. Shell output was a small slice of total tokens.


Why the experiment failed (for my workload)

  1. OpenRouter already caches. My proxy injected cache_control and headers. That may have disturbed cache keys. Net: lower hit rate, not higher savings.
  2. RTK volume was tiny. Chat sessions burn tokens on prompts and completions, not git status output.
  3. Local LRU only helps byte-identical repeats. Multi-turn conversations rarely repeat the full body.
  4. Completion tokens dominate on agentic work. The proxy barely touched output except optional caps.
  5. Complexity tax. Two scheduled tasks, tunnel config, duplicated proxy.py paths, SSRF workarounds. Maintenance without ROI.
  6. Headroom mismatch was structural, not tuning: ML compression stage on a stream OpenRouter was already caching upstream.

The tunnel pattern is still useful if you must expose localhost to Cursor (SSRF blocks private IPs and localhost). Use a hostname you control, protect the proxy (auth, bind to loopback only), and delete the DNS record when you retire the stack. I no longer route chat through a local proxy.


What savings are actually realizable (what I believe now)

ApproachWorth it on my stack?Notes
Direct OpenRouter + native prompt cacheYes — defaultClear Cursor base URL override; watch Activity dashboard
Context7 (framework docs)Yes — main app reposStops wrong-guess refactors that burn retry tokens
Serena (symbols, not full files)Yes — large reposInput reduction at source layer
Harness policy (when to batch / test / skip bootstrap)YesSee harness series
Four-tier memory + footer contractYesExternal Memory series; Layer 4 = feedback loop
Composer 2.5 baselineYesPredictable rule compliance
RTK / shell aliasesMaybe if shell-heavyLow cost; I removed RTK when proxy retired
Proxy + tunnelNo for meOnly reconsider for tool-output-heavy, exact-repeat workloads
HeadroomNo for my traffic5.7% measured; high CPU on Kompress path
LeanCTXDeferredOverlaps Serena; validate on a pilot repo first

Power stays with you: model dropdown, Plan vs Agent, ship approval. The harness and memory stack support procedure (when to load brain-pack, when to run tests). They do not auto-switch models or override your choices.


What I run now (July 2026)

Cursor → direct OpenRouter (no base URL override)
MCP: Context7 + Serena (+ optional context-mode on the busiest repo)
Rules: context-budget.mdc + HARNESS-POLICY routing table
Memory: four tiers (L4 feedback = session footer + rules + eval:gate)
Metering: OpenRouter Activity + quarterly ghost-token audit
Lab: Tokensaver repo keeps A/B harness + archived dashboards

Next reads in order:

  1. Agent harness without a microservice — you keep controls; policy routes subagents and tests
  2. Harness memory loop (four tiers) — when to load operational vs evergreen vs feedback
  3. Measure the harness — CSV, CI, dollars before building more orchestration

Proxy source and dashboards are archived in a local lab repo (proxy.py, dashboards/legacy/). Not maintained as production infra.

What I triedMeasured(cache hit %)RetiredJuly 2026What works nowHeadroomProxy +tunnelRTK hookDirectOpenRouterContext7SerenaHarness +memory L4
What I triedMeasured(cache hit %)RetiredJuly 2026What works nowHeadroomProxy +tunnelRTK hookDirectOpenRouterContext7SerenaHarness +memory L4

What I would do from zero today

  1. Meter first. OpenRouter Activity for one week on direct routing. Know cache hit % and completion share before adding middleware.
  2. Input discipline. Context7 before framework guesses; Serena before full-file reads.
  3. Session contract. Four-tier memory with footer v3.1 at ship time.
  4. Harness policy. One-page table: direct vs batch vs test. No microservice until CSV proves a gap.
  5. Shell aliases (1 hour) if agents run noisy commands daily.
  6. Proxy/tunnel last — only if measurement shows tool-output or exact-repeat dominates and direct cache is already maxed.

Reader action

  • If you run a local proxy today: compare OpenRouter cache hit % direct vs through proxy. If hit rate drops, the proxy is costing money.
  • If you see Access to private networks is forbidden: that is Cursor SSRF, not OpenRouter. A public HTTPS tunnel fixes routing; it does not fix savings by itself.
  • If you want sustainable savings: start with direct routing, MCP input tools, and harness gates. Read Part 1 of the harness series.