April 25, 20266 min readSpineengineeringclaudecosts

Routing model choice by surface — how we use Opus, Sonnet, and Haiku in one product

Spine routes every Claude call to the smallest model that can do the job. Here is the routing table, the reasoning, and the cost numbers.

A common mistake when building on a frontier model is to treat it as a single endpoint. You sign up for the API, you point everything at the most capable model, you marvel at the results, and then your bill arrives. The shipping question is not "can we use Opus?" — it is "where does Opus actually pay for itself, and where would Sonnet or Haiku do the job equally well at a tenth the cost?"

Spine answers that question per surface. Every Claude call in the product is routed deliberately. The routing table is short, opinionated, and worth writing down.

The three tiers, and what each one does in Spine

Opus 4.7 is the synthesis tier. It reads many things at once, holds them all in context, and composes an answer that would be wrong without the full picture. We use it in three places, and only three.

The first is the editorial brief. The brief is what a senior developmental editor would hand back after a careful read of the manuscript: structural diagnoses, character-arc observations, pacing notes, plotline orphans, all anchored to specific scenes and citable. To do this honestly, the model needs the whole manuscript. Opus 4.7's million-token context lets us load the entire 130,000-word draft, give the model an eight-tool toolbox of read-only operations against the spine database, and let it iterate until it has cited evidence for every claim it is about to make. The output is roughly the length of a magazine review and roughly the quality of a senior dev editor's first pass. We could not do this with a smaller model. Believing that we could is the trap.

The second is character voicing in the chat. When the user opens a conversation with Lizzy Bennet on the morning after Darcy's first proposal, the model has to be Lizzy. Not "narrate as Lizzy" — be her, in character, drawing on every line she has spoken in every scene she has appeared in. Voice consistency is hard. Tone-matching across a long context is hard. Extended thinking — Opus's ability to reason silently before producing visible output — is what makes the chat replies actually land in voice. Sonnet is good but not in this league for sustained character work, and Haiku is not even a serious candidate.

The third is contradiction-scanning in the editorial intern panel. The intern does a multi-pass read of the manuscript looking for internal inconsistencies. Did Lady Catherine's age shift between chapter twelve and chapter forty? Did the Bennet sisters move from five to four sometime in the second act? Did the Pemberley estate gain a wing it never had before? Catching these requires the same operation Opus is uniquely good at: holding the whole book at once and noticing what does not match.

Sonnet 4.6 is the mid-weight reasoning tier. It is where most of the editorial intelligence lives, because most of editorial intelligence is mid-weight reasoning. The diff insights pass — explaining what changed structurally between draft v3 and draft v4 — runs on Sonnet. The plotline-naming pass — turning a cluster of scenes into a single human-readable label — runs on Sonnet. The voice / audience / comparable-titles passes that go into the intern report all run on Sonnet. The pattern is: take a structured input that has already been reduced to its essentials, reason about it, produce a structured output. No need for a million-token context here. Sonnet is six times cheaper per token than Opus and the quality difference on these mid-weight tasks is below the noise floor of human evaluation.

Haiku 4.5 is the hot path. Per-scene extraction during ingest — the operation that runs hundreds of times on a single book — runs on Haiku with prompt caching. The system prompt is identical across every scene; only the user message changes. When you cache the system prompt, the second call onward is almost free. We measure this directly. On a typical 320-scene novel, the ingest pipeline issues 320+ Haiku calls; the first one creates the cache, and the remaining 319 hit the cache. The effective per-call cost is one tenth of the uncached number. The same pattern applies to per-mention character disambiguation, ask-triage (the routing decision before the actual answer), and the chapter-segment fallback used when Pioneer's NER service is unreachable.

A small table of what runs where

For anyone who wants the cheat sheet, here is the surface-to-model mapping in the current build:

Editorial brief: Opus 4.7 with extended thinking, eight read-only tools, iterative.
Character chat: Opus 4.7 with extended thinking, in-voice grounding from scene corpus.
Contradiction scan: Opus 4.7 with extended thinking.
Diff insights between draft versions: Sonnet 4.6.
Plotline name synthesis: Sonnet 4.6.
Comp-title pass: Sonnet 4.6.
Per-scene ingest extraction: Haiku 4.5 with prompt caching.
Per-mention character disambiguation (Pioneer fallback): Haiku 4.5 with prompt caching.
Ask triage (does this need the manuscript or the open web?): Haiku 4.5.
Chapter-segment fallback: Haiku 4.5.

Why this matters more than it seems

There is a real cost to picking the wrong model for a surface. If you put Haiku on the editorial brief, the brief is shallow and the user can tell. If you put Opus on the per-scene extraction, the ingest cost goes from a few dollars per book to a few hundred. Most products do one of these two things, and both are visible to users — the first as bad output, the second as a price ceiling that excludes anyone who is not already paying enterprise rates.

The third move, which is the move we made, is to treat model choice as a per-surface engineering decision. It is no different from picking a database for a workload, or a queue for a job, or a cache strategy for a read-heavy endpoint. The temptation to pick one model and use it for everything is the same temptation engineers used to have to pick one database and use it for everything. The product is better, and the bill is smaller, when you stop doing that.

What we instrumented

Every Claude call in Spine goes through a single helper, callClaude, that does six things at once: routes to the chosen tier, applies prompt caching when the system prompt is reusable, validates the output against a Zod schema derived from the prompt's tool definition, retries once on schema failure with a sharper prompt, persists a hash-keyed copy of the response to disk so cold-cache calls become warm-cache calls, and emits a usage event with the model used, the input/output token counts, and the cache hit/miss numbers. The instrumented usage event is what populates the burn-rate dashboard in the admin panel. We can answer "how much did we spend this week, broken down by model and surface" in one query.

That visibility is what made the routing table possible. We did not pick the routing by guesswork; we picked it by watching the usage events stream in and asking, for each surface, whether the smaller model was producing acceptable output. When it was, we moved the surface down. When it was not, we held the line.

The shipping rule

Use the smallest model that can do the job. The job is not "produce text" — the job is "produce text that is good enough that the user does not notice the model picked it up." That definition has a different answer for every surface. The routing table is the answer for Spine.

If you build on Claude and you find yourself routing everything to Opus, write the table. The bill will get smaller and the product will get better, in that order.