#engineering #infrastructure

napkin llm router math

Route hard or don’t route at all.

It’s only a day since Fable 5 has been suspended but damn don’t I miss it already. Other than OpenAI’s o3 and Anthropic’s Opus 4.5, few models have felt as big of a step change as Fable has. That said, it was token-hungry and pretty expensive. Reasonably, everyone’s only got one thing in their minds. What if we plan with a smart expensive model, implement with a cheap one, then review with the smart one. It’s something I spend alot of my braincells on while working with agents.1

There are variations of solutions to this. More on this below. I think if you are somewhat price-conscious, you probably do something similar. I figured I’d do some napkin math to get a sense of how the tradeoffs compare. The most significant consideration, other than output quality, is the cost of caching, especially with open-source models such as the DeepSeek V4 and MiMo v2.5 series.

The setup and some assumptions

A lot of variables are at play here so I’m assuming a few things to simplify the math. It’s napkin math, after all.

Some things I don’t factor in are:

With that out of the way, let’s get it. The one rule that matters: a model’s cache is its own. The first turn after you switch to a model, it has never seen your conversation, so it reads the whole prefix cold at full input price. Every turn after, that prefix is cached at about a tenth of the price.

Prices (per million tokens as of today)34567.

ModelInputCachedOutput
Claude Fable 5$10$1$50
Claude Opus 4.8 (and 4.6)$5$0.50$25
Claude Sonnet 4.6$3$0.30$15
GPT-5.5$5$0.50$30
GPT-5.4$2.50$0.25$15
MiMo v2.5 Pro$0.435$0.0036$0.87
DeepSeek V4 Pro$0.435$0.0036$0.87
Kimi K2.6$0.95$0.16$4

A turn has two prices.

Cold turn = (prefix + new) × input + output × output-price. Warm turn swaps the prefix to the cached rate. We’ll start at the first implementation turn, after two “plan” turns. So we are at about 38K prefix (30K start + 2 turns × 4K).

For Fable 5 at this turn plus 1K new, 2K output:

cold:  (0.038 + 0.001)×$10 + 0.002×$50 = $0.390 + $0.100 = $0.490
warm:   0.038×$1 + 0.001×$10 + 0.002×$50 = $0.038 + $0.010 + $0.100 = $0.148

The cold turn is over 3x the warm one. And for Opus 4.8:

cold:  (0.038 + 0.001)×$5  + 0.002×$25 = $0.195 + $0.050 = $0.245
warm:   0.038×$0.50 + 0.001×$5 + 0.002×$25 = $0.019 + $0.005 + $0.050 = $0.074

That gap is the entire cost of switching, and it scales with how expensive the model’s input is. These are at the 38K starting prefix; by turn 15 the prefix reaches ~98K, and each warm turn costs about 40% more than at the start.

What one implement turn costs: cold vs cached

Nothing too surprising here. The premium model is never as expensive as the sticker price once caching kicks in.

All three phases, end to end

Every strategy starts cold on its plan model (nobody’s cache is warm at the start of a session), so plan always costs 1 cold turn + 1 warm turn. Context grows through the session: ~30K during plan, ~38K at the start of implement growing to ~98K by the end, and ~98K during review. All totals below account for this growth.

Plan phase (2 turns, starting at 30K prefix).

Plan modelTurn 1 (cold, 30K)Turn 2 (warm, 34K)Plan total
Fable 5$0.410$0.144$0.55
Opus 4.8$0.205$0.072$0.28
GPT-5.5$0.215$0.082$0.30

Implement phase (15 turns, prefix growing 38K to 98K). If you stay on your plan model, its cache is still warm from planning, so all 15 turns pay the cached rate (growing each turn as context accumulates). If you switch in a new model, its first turn is a cold read at 38K and the remaining 14 are cached but growing.

Implement modelScenarioImpl total
Fable 5stay (all warm)$2.64
Opus 4.8/4.6stay (all warm)$1.32
Sonnet 4.6switch in (1 cold + 14 warm)$0.89
GPT-5.4switch in$0.82
Kimi K2.6switch in$0.32
MiMo v2.5 Proswitch in$0.05
DeepSeek V4 Proswitch in$0.05

Sonnet saves about $0.43 over staying on Opus. Hold that number.

Review phase (2 turns, prefix now at ~98K). If you stayed on the same model through implement, its cache is still warm and review is cheap. If you routed away for implement, your review model’s cache expired during those 15 turns (caches invalidate after 5 minutes by default2), so review pays a cold read of the now-fat 98K context.

Review modelScenarioTurn 1Turn 2 (warm)Review total
Fable 5stay (warm)$0.208$0.212$0.42
Fable 5switch back (cold)$1.090$0.212$1.30
Opus 4.8stay (warm)$0.104$0.106$0.21
Opus 4.8switch back (cold)$0.545$0.106$0.65
GPT-5.5switch back (cold)$0.555$0.116$0.67

The review cold read on Opus at 98K costs an extra 0.44 over staying warm. That’s just barely more than the 0.43 you saved on implement. Now adding all three phases up:

Session totals (plan + implement + review):

StrategyPlanImplementReviewTotal
All Fable 5$0.55$2.64$0.42 (warm)$3.61
Fable chimera (F/MiMo/F)$0.55$0.05$1.30 (cold)$1.90
Claude routed (Op/Son/Op)$0.28$0.89$0.65 (cold)$1.82
Codex routed (5.5/5.4/5.5)$0.30$0.82$0.67 (cold)$1.79
All Opus 4.8$0.28$1.32$0.21 (warm)$1.81
Kimi routed (Op/Kimi/Op)$0.28$0.32$0.65 (cold)$1.25
Cross-provider (Op/MiMo/5.5)$0.28$0.05$0.67 (cold)$1.00

Cost per session, cache-aware

Routing Opus to Sonnet and back loses money. The two cold reads (out and back) eat the thin savings, and the return trip is worse than it looks because by review time the context has grown to 98K so the cold read is fat. The only same-family routing that clearly pays is from Fable, because Fable’s warm rate is 2x Opus’s, so the gap is wide enough to survive the return trip.

Hmm. So when does routing actually pay?

Same formula: the cold read is a one-time tax, you save a bit each turn, break even after enough turns.

                       prefix × (B_input − B_cached)
turns to break even = -------------------------------------------
                       prefix×(A_cached − B_cached)
                     + output×(A_output − B_output)
                     + new_input×(A_input − B_input)

At the starting 38K prefix with 2K output per turn:

SwapCold taxSaved/turnBreak-even
Opus 4.8 → Sonnet 4.6$0.10$0.030~4 turns
GPT-5.5 → GPT-5.4$0.09$0.042~2 turns
Fable 5 → Sonnet 4.6$0.10$0.104~1 turn
Opus 4.8 → Kimi K2.6$0.030$0.059<1 turn
Opus 4.8 → DeepSeek V4 Pro$0.016$0.072<1 turn
Opus 4.8 → MiMo v2.5 Pro$0.016$0.072<1 turn
Fable 5 → MiMo v2.5 Pro$0.016$0.146<1 turn

Turns to break even

A one-notch downgrade (Opus to Sonnet) needs ~4 turns just to recoup one cold read, and you pay it twice if you bounce back. A cross-provider drop to MiMo, DeepSeek, or Kimi breaks even in under one turn — MiMo and DeepSeek output is now ~29x cheaper than Opus, so even a single turn of saved output more than covers the cold read.

And from Fable the gap is even bigger. Fable to Sonnet breaks even in ~1 turn instead of the ~4 it takes from Opus, because Fable’s cached rate is 1vsOpus1 vs Opus' 0.50, so you save more per warm turn. Fable to MiMo pays for itself immediately.

As the prefix grows through the session these break-evens actually improve slightly, because the per-turn savings scale with prefix size while the cold tax was already paid. But there’s a more important variable: output per turn.

Output/turnOpus → SonnetOpus → KimiOpus → MiMo
200 tokens (a rename)~9 turns~1 turn~1 turn
500 tokens (add a flag)~7 turns~1 turn~1 turn
2K tokens (new endpoint)~4 turns<1 turn<1 turn
4K tokens (CRUD + tests)~2 turns<1 turn<1 turn
8K tokens (scaffold a service)~1 turn<1 turn<1 turn

Output size is the break-even. A session of small edits favors staying put. A session of writing big chunks of new code favors routing hard.

A note on compaction

With a 200K window our 19-turn session peaks at ~106K and never hits the ~160K compaction threshold. When compaction fires it rewrites the prompt, which invalidates the cache for every model, including the one you stayed on. Your “stay warm” model eats a cold read just like a model you switched into. This actually makes routing look better, because it erodes the stay-warm advantage. If your sessions regularly compact, the numbers above are conservative and routing pays more readily.

Tool support

Routing is supported by some tools. Cursor’s auto mode8, Martian9, subagents, Droid’s Router10, OpenRouter11 (which seems to have a few types). Some other approaches implement routing in the harness layer via a meta-harness. Though the motivation there seems to be less on cost and more on leveraging frontier labs RL-ifying their harnesses for their models. An alternative approach that I’ve grown to like is an advisor tool/agent that can be referenced by a weaker model for the tricky cases either mid-generation or after N turns. Oh my Pi exposes this as a watchdog background agent12 while Claude13 and OpenRouter14 expose them as tools that the executor can call. OMP’s advisor model has read-only workspace access, severity-based interruption, and lets you define project-specific review rules that only the advisor sees, not the primary agent. It’s been one of the best ways I’ve been able to be productive with the MiMo and DeepSeek models, especially for doing autoresearch experiments because the advisor can catch things the executor overlooked.

Conclusion

Like everything else, it depends on your use case. Certainly, as we’ve seen, the price gap between your anchor model and your implement model matters. Small gap (Opus to Sonnet) means thin per-turn savings that the cold reads eat. Big gap (anything to MiMo/DeepSeek/Kimi, or Fable to Sonnet) means the savings dwarf the switch cost.

The phase length also matters. Short phases (plan, review at ~2 turns) don’t clear the break-even for same-family swaps. The implement phase at 10 to 20 turns clears everything. That said, if MiMo v2.5 Pro flubs it and you spend three Opus turns cleaning up, then the savings evaporate. Today’s cheap coders (Kimi K2.6, DeepSeek V4 Pro, MiMo v2.5 Pro) are strong on SWE-style work, so it is plausible to make it work.

Is per-phase routing mostly a trap? Currently, I don’t think anyone’s really solved it yet. Even the providers above try to maintain session-stickiness, so they mostly just end up routing the initial prompt, which might not work too well for easy-looking tricky bugs or ambiguous prompts. They also often don’t allow you granular enough visibility into what’s routed when and why. The two cases that I think work great for scoped tasks are routing from Fable (hopefully back soon), where the gap is wide enough to survive a round trip, or going cross-provider to a genuinely cheap implement model and staying there for the whole grind using a generated plan artifact.

Footnotes

  1. Original tweet / discussion — https://x.com/bnjorogedev/status/2050736215750496326?s=20

  2. Anthropic prompt caching — TTL, write/read multipliers — https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching 2 3

  3. Claude model and caching pricing — https://platform.claude.com/docs/en/about-claude/pricing

  4. OpenAI GPT-5.5 and GPT-5.4 pricing — https://openai.com/api/docs/pricing

  5. DeepSeek V4 Pro pricing — https://api-docs.deepseek.com/quick_start/pricing

  6. Kimi K2.6 pricing (Moonshot official API) — https://platform.kimi.com/docs/pricing

  7. MiMo v2.5 Pro pricing (Xiaomi API, prices reduced May 2026) — https://openrouter.ai/xiaomi/mimo-v2.5-pro

  8. Cursor auto mode — https://cursor.com/docs/models-and-pricing

  9. Martian RouterBench — https://withmartian.com/post/introducing-routerbench#1-%C2%B7-introduction

  10. Droid’s Router — https://factory.ai/product/router

  11. OpenRouter auto-router — https://openrouter.ai/docs/guides/routing/routers/auto-router

  12. oh-my-pi advisor-watchdog — https://github.com/can1357/oh-my-pi/blob/main/docs/advisor-watchdog.md

  13. Claude advisor tool — https://platform.claude.com/docs/en/agents-and-tools/tool-use/advisor-tool

  14. OpenRouter advisor server tool — https://openrouter.ai/docs/guides/features/server-tools/advisor

#llms #agents #mlops