April 26, 20266 min readSpineengineeringbenchmarkspioneergliner2

Pioneer benchmarks — when 80ms beats 530ms

Why we replaced an LLM round-trip with a locally hosted GLiNER2 model, what we measured, and what we learned about being honest with benchmarks.

There is a moment in every hackathon when the elegant story you wanted to tell collides with what is actually true. Ours arrived around 4am on the day of submission, when we discovered that the sponsor endpoint we were supposed to integrate with had been decommissioned mid-event. The HTTP probe came back with an expired certificate and a Vercel 404. Two weeks of integration work had landed on a target that no longer existed.

We could have written that part of the README differently. Instead, we wrote it down honestly, ran the bench against what was actually available, and shipped the result. This post is about what 80 milliseconds of local inference looks like when measured against 530 milliseconds of Claude Flash-Lite, and what we learned about why each of those numbers is the right number for its job.

The problem we needed to solve

Spine ingests novels. Part of ingest is identifying every character mention in every scene and resolving it to a canonical character. "Lizzy" is Elizabeth Bennet. "Eliza" is also Elizabeth Bennet. "Miss Bennet" might be Jane and might be Elizabeth depending on whose chapter you are in. "The elder Mr. Bennet" is Mr. Bennet. "Mrs. Collins" is Charlotte Lucas after she marries; she was "Miss Lucas" before.

Doing this with a frontier model works. Send the text, send the candidate names, ask which canonical character is being referenced, get a clean answer ninety percent of the time. The cost per call is low and the model handles edge cases gracefully. The problem is volume. A typical novel has thousands of mentions. At 530 milliseconds per call, ingesting a 130,000-word manuscript turns into a multi-minute wait, and the cost compounds when you ingest hundreds of books.

The hypothesis was that a smaller, locally hosted model — specifically, GLiNER2 — could do the same job for a tenth of the latency and zero per-call cost. The benchmark would tell us whether the hypothesis was right.

The benchmark, honestly

We harvested 2,398 mention-and-window pairs from the two demo books in the spine database, split sixty-twenty-twenty into train, eval, and test sets, and held out the test set. The eval set was 502 rows; the test set 479. The split was hashed by scene-and-mention so the same pair could not cross splits and inflate scores.

We ran five backends against the same eval split with the same candidate label list. Pioneer fine-tuned through the sponsor's HTTP endpoint. Pioneer zero-shot through the same path. Claude Flash-Lite, Flash, and Pro through our standard callClaude helper. Each call had a five-minute wall-clock cap. Each backend that became unreachable was kill-switched after three consecutive failures and marked skipped.

The two Pioneer backends became unreachable. The sponsor endpoint had been decommissioned. We did not pretend otherwise — those rows are marked skipped in the published bench table, with a footnote explaining the certificate failure.

What was reachable was Claude across three model sizes, and a locally hosted GLiNER2 we brought up on Apple-silicon MPS through a Python sidecar that the TypeScript client manages. The local model is the urchade/gliner_multi-v2.1 checkpoint, no fine-tune, plain zero-shot inference.

The numbers

Claude Flash-Lite produced 89 percent F1 on the eval split with a 530-millisecond P50 latency and a per-thousand-mentions cost around one cent. Flash was slower and not better. Pro was slower and more expensive at no F1 advantage. Local GLiNER2 produced 56 percent F1 at 80 milliseconds P50 warm, after a one-time 13-second model load.

Two readings of these numbers are wrong, and the third is the one we shipped.

The first wrong reading is "Pioneer beats Claude — 80ms versus 530ms." It does on latency, by a factor of six and a half. It does not on accuracy. Fifty-six percent F1 is thirty-three points behind Flash-Lite, and that gap is real and noticeable when you actually look at the disambiguations. The local model frequently misses contextual coreference. "Mrs. Collins" before chapter twenty-two is Charlotte Lucas; "Mrs. Collins" after is the same person but with a different surface form, and the zero-shot model treats them as separate entities about as often as it gets the link right.

The second wrong reading is "Claude wins — Pioneer is not ready." It is not that simple either. The latency difference is large enough that for hot-path operations — the per-mention resolution that fires hundreds of times per scene during ingest — replacing one LLM call with one local inference produces a perceptible product win. A 13.5k-word excerpt that took 54 seconds to ingest end-to-end with Claude alone now takes 40.6 seconds with the local model on the hot path and Claude as the fallback when the local model says "I do not know." The 14 seconds is mostly because the local model handles the easy mentions itself, and the hard ones — the genuinely ambiguous ones — fall through to Claude. We get the speed where speed is cheap and the accuracy where accuracy is expensive.

The shipped reading is the third one. The local GLiNER2 is the front line. Its job is to handle the easy ninety percent of mentions at the speed of local inference. When it returns null — which happens by design when the candidate scores are too close or below threshold — the request falls through to Claude Flash-Lite. The user-facing answer comes from whichever model produced a confident answer first. This is what "adaptive inference" actually means in production: the smaller model handles what it can, the larger model handles what it cannot, and the latency profile is dominated by the smaller model because most mentions are easy.

What we did not do, and why

We did not run the fine-tune we had built the harvest pipeline for. We had a 1,417-row train set sitting on disk, a working train.py that knew how to drive HuggingFace's GLiNER2 fine-tuning, and a clear hypothesis that fine-tuning on character-disambiguation data from the actual books we cared about would close most of the F1 gap. The reason we did not run it is that a developer laptop is not a training rig, and the sponsor's GPU endpoint that would have hosted the run was the same endpoint that had been decommissioned. We could have done the run badly on the laptop in the wrong precision and produced a half-baked artefact. We chose the honest report over the bad fine-tune.

The harvested train set is in the repository. The training pipeline is in the repository. Anyone with a real training environment can reproduce the run and report the result. That is what we wanted to publish, even if we could not run it ourselves.

Why the latency story matters more than the F1 story

There is a tendency in benchmark write-ups to lead with accuracy and treat latency as a footnote. For productionised AI inference, this is the wrong way around. Accuracy gates whether a feature ships at all; once it ships, latency gates how often the user reaches for it. A 530-millisecond per-mention call multiplied by 2,000 mentions in a novel is seventeen minutes of wall-clock ingest. The same operation at 80 milliseconds with fall-through is under three minutes. The user does not care which model produced which answer. They care that the wait is short enough that they keep using the product.

Latency is also where small models earn their keep. The frontier models are exceptional at sustained reasoning over long contexts. They are not exceptional at responding in 80 milliseconds, because the architecture is not optimised for low-latency single-shot inference. A small specialised model running locally is. Picking the right model for the right operation is the same engineering judgment as picking the right database for the right workload.

The shipping rule

Replace an LLM call with a smaller model when the smaller model can do the easy cases at a meaningful latency advantage and the larger model is still available for the hard cases. Measure both. Publish the F1 and the latency. Do not pretend an unreachable endpoint reached.

The bench table is in docs/PIONEER_BENCH.md. The kill-switched client is in lib/ai/pioneer.ts. The fine-tune harness is in scripts/pioneer/train.py. We left all of it in the repository, and the parts that were not honest, we said so.

That is the actual lesson: the best benchmark is the one you trust well enough to ship behind, not the one whose numbers you wanted.