Anthropic brought Fable 5 back to public Cursor model pickers. Cursor published fresh numbers on CursorBench 3.1: ambiguous, multi-file tasks drawn from real agent sessions. Fable 5 Max sits at the top of the score column.
I read the table the way I read infra bills: not who wins one column, but what you pay per accepted outcome.
On this benchmark, the highest score and the best buy are not the same model.
The problem: leaderboard scores hide unit economics
Vendor launches train us to look at rank. CursorBench reports four numbers that matter together:
- Score (task success rate on their battery)
- Cost per task (priced from published per-million-token rates)
- Tokens per task
- Steps per task (agent turns until close)
A model can score 73% and still be a bad default if it spends 76 steps and $18 to get there. Another model can land at 63% for $0.55. For daily shipping work, that gap changes how many tasks you can afford in a month.
Cursor notes that results have variance and small score gaps may not be statistically meaningful. Treat the table as directional, not gospel. It is still useful for tradeoff thinking.
What CursorBench 3.1 measures
Version 3.1 added tasks focused on codebase understanding, bugfinding, planning, and code review. Earlier 3.0 work centered on edit, refactor, and bugfix problems.
That matters when you compare models. A benchmark heavy on multi-step diagnosis rewards models that plan well. It also rewards models that stop instead of burning steps on the wrong file.
Fable 5: performance tier, pricing ladder
Fable 5 ships as a family. On CursorBench 3.1 the spread is wide:
| Model | Score | Cost / task | Tokens / task | Steps / task |
|---|---|---|---|---|
| Fable 5 Max | 72.9% | $18.02 | 63,842 | 76 |
| Fable 5 Extra High | 72.0% | $13.74 | 48,754 | 63 |
| Fable 5 High | 70.6% | $10.81 | 37,173 | 54 |
| Fable 5 Medium | 69.8% | $8.27 | 28,507 | 47 |
| Fable 5 Low | 64.2% | $5.70 | 18,882 | 36 |
Raw performance winner: Fable 5 Max at 72.9%.
Cost story: Max is roughly 33× more expensive per task than Composer 2.5 ($18.02 vs $0.55) for about 10 percentage points more score (72.9% vs 63.2%).
Steps story: Max takes 76 steps. Opus 4.7 Max takes 96. Sonnet 5 Max takes 93 steps for only 61.2% score. High step counts are not free. They add latency, context churn, and review fatigue even when the task eventually passes.
If your work is high stakes and failure is expensive, the top Fable tier can be rational. If you run dozens of agent tasks a week, the Max row is a specialty tool, not a default.
Three efficiency lenses (derived from the table)
I computed three simple ratios from the public CursorBench rows. They are not official Cursor metrics. They help compare models side by side.
Score per dollar (higher is better)
| Model | Score / $1 |
|---|---|
| Composer 2.5 | 114.9 |
| GPT-5.5 Medium | 26.7 |
| Kimi K2.7 Code | 27.4 |
| GPT-5.5 High | 17.4 |
| Fable 5 Low | 11.3 |
| GLM 5.2 High | 20.6 |
| GLM 5.2 Max | 17.6 |
| Fable 5 Max | 4.0 |
Composer 2.5 is an extreme outlier on cost efficiency. Nothing else in the top third of the scoreboard comes close on score per dollar.
Score per 1K tokens (higher is better)
| Model | Score / 1K tokens |
|---|---|
| GPT-5.5 Medium | 6.53 |
| GPT-5.5 High | 4.70 |
| Composer 2.5 | 4.17 |
| GPT-5.5 Extra High | 3.59 |
| Fable 5 Medium | 2.45 |
| GLM 5.2 Max | 1.06 |
| Sonnet 5 Max | 0.65 |
GPT-5.5 Medium uses only 9,065 tokens per task at 59.2% score. Sonnet 5 Max burns 93,485 tokens for 61.2%. That is a brutal token tax for a modest score bump.
Score per step (higher is better)
| Model | Score / step |
|---|---|
| Fable 5 Low | 1.78 |
| Composer 2.5 | 1.71 |
| GPT-5.5 Medium | 1.69 |
| Fable 5 Medium | 1.49 |
| Kimi K2.7 Code | 0.75 |
| GLM 5.2 Max | 0.66 |
| Opus 4.7 Max | 0.68 |
| Sonnet 5 Max | 0.66 |
Composer 2.5 and the lighter Fable / GPT-5.5 tiers finish in fewer steps. Heavy Max tiers and several Opus / Sonnet Max configs look expensive per step.
Composer 2.5: the Pareto surprise
Cursor's chart plots score against average cost. Composer 2.5 sits in the corner you want: 63.2% score, $0.55 per task, 15,152 tokens, 37 steps.
That is not a perfect score. It beats 27 of 36 listed models on the public table while costing less than a coffee per task in the benchmark's pricing model.
I run Composer 2.5 as my default coding model in Cursor for a related reason: predictable rule compliance and a tighter session bootstrap beat frontier roulette on the work I ship daily. CursorBench gives that habit a cost line item. The model is not just cheaper. It is efficient on this task mix.
GPT-5.5: the token miser
GPT-5.5 Medium (59.2%, $2.22, 9,065 tokens, 35 steps) is the best token budget story in the table. Extra High reaches 64.3% at $4.37 with still-reasonable tokens.
If your constraint is context window pressure or API token caps, GPT-5.5 Medium deserves a look. You give up peak score versus Fable Max. You buy back tokens and steps.
Open models on CursorBench: GLM 5.2 and Kimi K2.7 Code
CursorBench includes two open-weight families I watch closely.
GLM 5.2 (Z.ai launch post)
Z.ai positions GLM 5.2 for long-horizon agent work: 1M-token context, MIT license, effort levels (High vs Max), and strong vendor-reported scores on FrontierSWE, Terminal-Bench 2.1, and SWE-bench Pro. On their table GLM 5.2 Max lands near Opus 4.8 on several coding benches while leading open source.
On CursorBench 3.1 (Cursor's harness, not Z.ai's):
| Model | Score | Cost / task | Tokens / task | Steps / task |
|---|---|---|---|---|
| GLM 5.2 Max | 54.6% | $3.11 | 51,312 | 83 |
| GLM 5.2 High | 50.7% | $2.46 | 30,621 | 76 |
GLM 5.2 is cheap versus Fable Max and open. On this benchmark it scores ~9 points below Composer 2.5 while using more steps (83 vs 37 on Max). Vendor long-horizon numbers and Cursor session tasks are not the same test. GLM may shine on hour-scale runs that CursorBench does not simulate.
Practical read: GLM 5.2 is a serious open option when you need 1M context or self-hosting. For Cursor agent sessions on ambiguous repo tasks, the public table does not show it beating Composer 2.5 on score or step efficiency.
Kimi K2.7 Code (Moonshot resource page)
Kimi K2.7 Code is an open coding-focused agentic model. Moonshot reports gains over K2.6 on Kimi Code Bench v2, Program Bench, and MLS Bench Lite, plus roughly 30% lower thinking-token usage than K2.6. Thinking mode is required; non-thinking requests fall back to K2.6 in Kimi Code.
On CursorBench 3.1:
| Model | Score | Cost / task | Tokens / task | Steps / task |
|---|---|---|---|---|
| Kimi K2.7 Code | 52.7% | $1.92 | 32,902 | 70 |
K2.7 is inexpensive and lands mid-table on score. Score per dollar is strong (~27), but 70 steps for 52.7% is a lot of agent churn. Moonshot's own benches use Kimi Code CLI with thinking enabled; Cursor's agent loop may not map 1:1.
Practical read: K2.7 Code is compelling for open-source agent coding and terminal/IDE workflows Kimi controls end to end. On CursorBench it reads as a budget exploratory model, not a drop-in replacement for Composer 2.5 on score.
LongCat 2.0: strong vendor benches, no CursorBench row yet
LongCat 2.0 is Meituan's 1.6T-parameter MoE agentic coding model (MIT license, 1M context, ~48B active parameters per token). It ran on OpenRouter for months as Owl Alpha before the official launch.
Meituan publishes a full Evaluations table on the LongCat site. Unless marked with *, scores were measured in-house under a unified harness. Asterisk rows cite external vendor numbers. That is more disciplined than a single headline chart, but it is still not CursorBench. Read it as a different test battery on a different loop.
Code agent benchmarks (LongCat vs frontier)
| Benchmark | LongCat-2.0 | Gemini 3.1 Pro | GPT-5.5 | Opus 4.8 |
|---|---|---|---|---|
| Terminal-Bench 2.1 | 70.8 | 70.7* | 73.8* | 78.9* |
| SWE-bench Pro | 59.5 | 54.2* | 58.6* | 69.2* |
| SWE-bench Multilingual | 77.3 | 76.9* | — | 84.8* |
* = external score per LongCat's table.
How to read this:
- Opus 4.8 leads coding on Terminal-Bench and both SWE-bench variants in Meituan's comparison set. The gap on SWE-bench Pro is large (69.2 vs 59.5).
- LongCat beats GPT-5.5 on SWE-bench Pro (59.5 vs 58.6*) and sits one tick above Gemini on the same bench (54.2*).
- Terminal-Bench is tight: LongCat 70.8, Gemini 70.7*, GPT-5.5 73.8*, Opus 4.8 78.9*. LongCat is in the pack, not at the front.
- SWE-bench Multilingual is Opus territory (84.8*). LongCat 77.3 is respectable but not leading.
On pure software engineering rows, LongCat looks like a credible open-weight coding model. It does not look like it clears the Opus 4.8 bar on these charts.
General agent benchmarks
| Benchmark | LongCat-2.0 | Gemini 3.1 Pro | GPT-5.5 | Opus 4.8 |
|---|---|---|---|---|
| FORTE | 73.2 | 70.3 | 77.8 | 77.2 |
| BrowseComp | 79.9 | 85.9* | 84.4* | 84.3* |
| RWSearch | 78.8 | 76.3 | 85.3 | 77.3 |
LongCat is mid-pack on general agent work. GPT-5.5 wins FORTE and RWSearch in this table. Gemini leads BrowseComp. LongCat is consistent (high 70s) but not dominant outside coding-specific rows.
Foundational (selected rows)
| Benchmark | LongCat-2.0 | Gemini 3.1 Pro | GPT-5.5 | Opus 4.8 |
|---|---|---|---|---|
| IFEval | 90.0 | 96.1 | 95.0 | 86.0 |
| Writing Bench | 83.8 | 83.7 | 84.7 | 85.2 |
| IMO-AnswerBench | 81.8 | 90.0 | 79.5 | 75.3 |
| GPQA-diamond | 88.9 | 94.3* | 93.6* | 92.4* |
LongCat is competitive on writing and reasoning subsets. Gemini and GPT-5.5 still lead several foundational rows. This matters if you want one model for coding plus general research. LongCat's pitch is narrower: agentic coding at repo scale.
LongCat vs CursorBench (why both tables matter)
LongCat 2.0 does not appear in the CursorBench 3.1 table at the time of writing. You cannot yet line it up against Fable 5 or Composer 2.5 on cost, tokens, and steps per real Cursor session.
That gap is the whole point of comparing sources:
| Lens | What it measures | LongCat signal |
|---|---|---|
| Meituan Evaluations | SWE-bench, Terminal-Bench, FORTE, etc. | Strong open coding model; trails Opus 4.8 on code rows |
| CursorBench 3.1 | Ambiguous multi-file tasks from Cursor agent sessions | No row yet; Composer 2.5 at 63.2% / $0.55 / 37 steps |
A model can score 59.5 on SWE-bench Pro and still burn 70+ steps on a Cursor storefront bug. Or the reverse. Until LongCat ships on Cursor's eval page, treat the vendor charts as capability marketing and CursorBench as IDE economics.
Practical read:
- Try LongCat when you want open weights, 1M context, and OpenRouter-style API pricing for repo-scale agent work.
- Do not drop Composer 2.5 as a Cursor default because LongCat edges GPT-5.5 on one SWE-bench row.
- Watch for the CursorBench row. Compare steps and cost per task, not just score. Open models often look better on vendor harnesses than on agent-loop efficiency.
Early builder commentary outside Meituan's tables is mixed: strong infra story (large domestic training cluster), but some preview users report rough agent edges. Hands-on beats bar charts.
Models I would not default to (on this table)
These are benchmark-specific callouts, not permanent verdicts:
| Model | Why flag it |
|---|---|
| Sonnet 5 Max | 93K tokens, 93 steps, 61.2% score |
| Opus 4.7 Max | 96 steps, $11.02/task, 64.8% score |
| Fable 5 Max | ~10 points above Composer 2.5, ~33× the cost |
| GLM 5.2 Max (on CursorBench) | 83 steps, 54.6% score |
Max tiers can still make sense for one-shot critical work. They are poor defaults for high-volume agent loops.
How I pick models after reading the table
| Need | Model to try first | Why |
|---|---|---|
| Best value on Cursor tasks | Composer 2.5 | Top-third score, lowest cost, low steps |
| Tight token budget | GPT-5.5 Medium | 9K tokens/task, decent score |
| Highest score, budget open | Fable 5 Medium | 69.8% at $8.27 (not Max) |
| Must maximize pass rate | Fable 5 Max | 72.9% if cost is secondary |
| Open weights + 1M context | GLM 5.2 or LongCat 2.0 | GLM on CursorBench today; LongCat strong on vendor SWE-bench, unmeasured on Cursor cost/steps |
| Open Kimi stack | Kimi K2.7 Code | Cheap on CursorBench; validate in Kimi Code CLI |
Model choice is one lever. Context bootstrap is the other. I benchmark file memory separately because a $0.55 model with a 3K-token gotchas pack can beat a $18 model that reads the wrong layer first. Stack both levers.
What you can do next
- Open CursorBench 3.1 and sort by cost, not just score.
- Compute score per dollar and score per step for the two models you actually use.
- Run one real repo task twice: frontier Max vs your daily model. Log tokens, steps, and whether you shipped.
- For LongCat, read Meituan's Evaluations table (code vs general agent rows) separately from CursorBench. Wait for a Cursor row before judging IDE economics.
- If you test GLM 5.2 or K2.7, compare Cursor against the vendor's native CLI before trusting launch blog tables.
- Keep Composer 2.5 (or GPT-5.5 Medium) as default; promote Fable Max only when failure cost exceeds ~$17 per task of headroom.
The benchmark answered a question vendors often skip: what does the top score cost in steps and dollars? On CursorBench 3.1, Fable 5 wins the headline. Composer 2.5 wins the spreadsheet.
Sources
- Cursor, "CursorBench 3.1." https://cursor.com/evals
- Z.ai, "GLM-5.2: Built for Long-Horizon Tasks." https://z.ai/blog/glm-5.2
- Moonshot AI, "Kimi K2.7 Code." https://www.kimi.com/resources/kimi-k2-7-code
- Meituan, "Introducing LongCat-2.0" (Evaluations table). https://longcat.chat/blog/longcat-2.0/
Get practical posts on enterprise AI and transformation. Only useful updates, sent as a weekly digest.
One practical digest each week. Unsubscribe anytime.







