I started my lab book entries when I was a physics graduate student. It's kind of amusing and kind of cool how far I've come. I have the equivalent of a grad student, (aka Claude Opus 4.7), working for me now. I spent some time over the weekend setting up an OCR framework for a book research project of mine. I've been coming up to speed on evals, so I decided to run one to determine which model was the most accurate and cost effective for doing OCR on travel manifest pages. I stepped the eval along rather than automating it and talked the results through with Opus as I went.
First, it turns out that Opus at low effort is the most accurate and the most cost effective choice! That was a surprise. The result has to do with Opus' ability to look at higher res images which means it needs to think less for OCR vs. Sonnet.
Second, at the end of the eval, as I was preparing to write up my results it occurred to me that I could ask my grad student to do it instead. Here's the log book entry created by Opus.
OCR for travel manifests — cost vs. model vs. effort
Lab notes, May 2026.
Objective
I'm building a pipeline to OCR archival travel manifests through the Anthropic API and turn each scanned page into structured JSON, one record per passenger. The plan is to run this against several thousand pages, so I need to understand what each page actually costs before committing to a model and effort setting. This entry collects the bench measurements I took on a single representative page, plus two small batch runs to sanity-check the per-page cost at scale.
Apparatus
- Document: U.S. Department of Labor Form 630A, List of United States Citizens, sheet 41 of the S.S. Queen Mary manifest sailing Cherbourg to New York, arrival 19 October 1936. Thirty passenger lines, plus two struck-through duplicates and a footer with purser signature and inspector stamp. A mix of typewritten and handwritten content, several overwrites, ditto marks throughout the address column. A reasonably hard but not pathological page.
- Image: Single JPEG, roughly 2000×2500 pixels.
- Models tested:
claude-opus-4-7($5/$25 per million input/output tokens) andclaude-sonnet-4-6($3/$15). - Thinking mode: Adaptive thinking with summarized
display. Effort levels swept across
max,high,medium, andlow. - Prompt: Short, identical across all runs. Asks for structured JSON, no research on passengers.
- Driver: Small Python script using the Anthropic SDK streaming interface, base64-encoded image, output token cap of 64000 tokens unless noted.
Runs
| pages | model | effort | cost (USD) | max tokens | input tokens | output tokens | avg cost/page | notes |
|---|---|---|---|---|---|---|---|---|
| 1 | opus | max | 0.22 | 64000 | 4847 | 7708 | 0.22 | |
| 1 | opus | high | 0.14 | 64000 | 4847 | 4942 | 0.14 | |
| 1 | sonnet | high | 0.39 | 64000 | 1630 | 25307 | 0.39 | |
| 1 | sonnet | high | 0.48 | 32000 | — | — | 0.48 | did not complete |
| 1 | opus | medium | 0.14 | 64000 | 4847 | 4631 | 0.14 | |
| 1 | opus | low | 0.13 | 64000 | 4847 | 3926 | 0.13 | |
| 1 | opus | low | 0.15 | 64000 | 4847 | 3929 | 0.15 | aborted call may have added ~$0.02 |
| 1 | opus | low | 0.14 | 64000 | 4847 | 4612 | 0.14 | 1.79 cumulative cost |
| 1 | opus | low | 0.11 | 64000 | 4881 | 2919 | 0.11 | |
| 5 | opus | low | 0.44 | 64000 | — | — | 0.088 | some pages have fewer passengers |
| 14 | opus | low | 1.26 | 64000 | — | — | 0.09 |
Observations
1. Opus at max effort is a baseline, not a target.
The first run used effort=max on Opus and came in at
$0.22. That worked but felt expensive enough to be worth chasing down.
The output JSON was correct and well-formatted; the question is whether
the model needed to think that hard.
2. Sonnet at the same effort cost more, not less.
This was the first surprise. Sonnet 4.6 is meant to be the cheaper
model — $3/$15 per million tokens versus Opus's $5/$25 — but
the same image and prompt at effort=high cost $0.39 on
Sonnet against $0.14 on Opus. That's roughly 2.8× more on the
"cheaper" model.
The token columns explain it. Sonnet produced 25,307 output tokens to Opus's 4,942 at the same effort setting — about 5× as much. Since the final JSON is essentially the same size regardless of which model writes it, those extra ~20,000 tokens are almost all thinking. Sonnet's nominal per-token discount evaporates when it burns five times the tokens to reach the same answer.
3. The input token columns reveal a different image.
Looking more carefully at the input column: Opus took 4,847 input tokens; Sonnet took 1,630. Same image. Same prompt. The prompt itself is small — the difference is in how each model handles the image.
From the Anthropic vision docs: Opus 4.7 is the first model with high-resolution image support, accepting up to 2576 pixels on the long edge and producing up to about 4,784 image tokens. Other current models, including Sonnet 4.6, cap at 1568 pixels and about 1,568 image tokens. A high-resolution JPEG gets downsampled by roughly 3× before Sonnet ever sees it.
For a dense archival document — small handwriting in margins, strikethroughs to distinguish from underlines, ditto marks, partially visible overwrites — that resolution loss isn't free. The extra thinking Sonnet does is plausibly compensation for a blurrier view of the page. It's not that Sonnet is "doing worse OCR"; it's that Sonnet is being asked to do harder OCR.
Practical consequence: for vision-heavy work on small or detailed content, the two models are not interchangeable. Sonnet at the same effort doesn't see the same input.
4. The Sonnet 32k run aborted and still cost money.
One Sonnet run with max_tokens=32000 didn't complete
— it appears to have hit the cap while still in the thinking phase
— and the call still showed $0.48 on the console. Implication:
output token caps don't protect you from cost. Tokens generated before
the abort are billed. Setting max_tokens too low isn't a
cost lever; it just produces failed runs you've already paid for.
5. On Opus, medium and high effort cost the same.
This was the most useful single finding of the day. Opus at
medium came in at $0.14 with 4,631 output tokens; Opus at
high came in at $0.14 with 4,942 output tokens. The
difference is noise.
Adaptive thinking is, well, adaptive. The effort level sets an upper
bound on thinking budget; the model decides how much it actually uses
based on the task. For this OCR job, Opus self-regulates somewhere
below the medium ceiling and stops. Giving it
high doesn't make it think more — it just leaves
unused headroom. Which means there's a natural floor for this task on
this model, somewhere in the 4,500–5,000 output token range, and
no amount of effort dial-cranking will push above it without changing
the prompt.
6. Opus at low saves cost, but partly from formatting.
Opus at low effort came in around $0.11–$0.15
across four runs (variance partly explained below), with output tokens
ranging from 2,919 to 4,612. The lowest-cost run dropped about 700
tokens compared to a medium run that wrote out the same
content.
Eyeballing the JSON, the answer is partly that low
effort produces more compact JSON formatting — fewer line breaks,
less indentation between object entries. For 30 passenger records with
about a dozen keys each, that's roughly 400 line breaks, and at one to
two tokens per \n + indentation, the formatting alone
accounts for maybe 500–800 tokens. Most of the cost difference
between medium and low is formatting density,
not thinking depth.
This is exploitable. My script pretty-prints the JSON client-side
anyway with json.dumps(parsed, indent=2), so I lose nothing
by asking the model to emit compact JSON. Adding that instruction to
the prompt should lock in the format savings independent of effort
level.
7. Variance run-to-run is real but small.
Four runs of Opus at low on the same page came in at
$0.13, $0.15, $0.14, and $0.11. Output tokens varied from 2,919 to
4,612 — about 60% spread. Adaptive thinking is stochastic, and
the model decides how much to think on each pass. This is fine for a
batch pipeline; on average the cost per page lands close to $0.13. But
it does mean a single benchmark isn't sufficient — a small run
of 3–5 measurements gives a better cost estimate than one.
The $0.15 outlier had a note attached: I aborted the call mid-stream and reran. The console total may include partial billing from the aborted attempt — about $0.02 worth of tokens before abort. Consistent with the Sonnet observation above: aborted calls cost something.
8. Batch runs confirm ~$0.09 per page at scale.
Two small batch runs to check the projection: 5 pages at $0.44 total
($0.088/page average) and 14 pages at $1.26 total ($0.09/page average).
That's slightly below the single-page Opus low measurement
of $0.13, because the bench page is dense (30 passengers); some pages
in the batch had fewer passengers and produced less output.
For a research budget projection, $0.09/page is the number to use. A thousand-page batch would land around $90; ten thousand pages around $900.
Discussion
The original intuition — that Sonnet would be the cheaper production choice — turned out to be wrong for this specific workload, for two compounding reasons. First, the per-token discount on Sonnet is smaller than the thinking-token multiplier on Sonnet: 60% of the per-token price doesn't help if the model uses 5× the tokens. Second, and more interestingly, Sonnet and Opus aren't doing the same task. Opus 4.7's high-resolution image path means it sees a fundamentally higher-fidelity version of the manifest. For text-heavy archival documents, that's not a marginal improvement — it's a different problem to solve.
The adaptive thinking finding is also worth restating. I had been
treating effort as a quality dial: higher effort means
better answers. On Opus, for this task, that's wrong. Effort is a
ceiling on thinking budget, and the model already self-regulates
below the ceiling. Dropping from max to medium
saved $0.08 per page with no visible quality difference, because the
extra thinking budget wasn't being used. The lesson generalizes: when
two effort levels produce the same cost, the model has found its
natural floor for the task. low may push below the floor
and degrade quality; medium through max may
all behave the same.
The JSON formatting effect was unexpected and is the kind of thing that only shows up when you instrument runs carefully. A real fraction of "thinking token" cost differences between effort levels was actually output formatting cost. Adding "emit compact JSON, no extra whitespace" to the prompt is a free optimization that decouples format from effort.
Conclusions
- Production setting: Opus 4.7 at
effort=lowfor routine pages. - Expected average cost: ~$0.09 per page at scale.
- Sonnet 4.6 is not the right choice for this workload, despite the nominally lower per-token price, primarily because of the lower image resolution path.
- Add an instruction to the prompt requesting compact JSON output.
max_tokens=64000is a safety ceiling, not a cost driver. Don't lower it to control spend; lowereffortor switch to a fixed thinking budget instead.
Next steps
- Quality eval: hand-transcribe ground truth for 5–10 pages
covering varying difficulty (clean, heavy strikethroughs, mostly
handwritten, faded scans) and diff against Opus
lowoutput. The cost work is meaningless if the cheap setting isn't accurate enough. - Try Opus
lowwith the compact-JSON instruction and see whether the formatting savings materialize as expected. - Investigate prompt caching for the system prompt. The instruction text is identical across every page in the batch; caching it should trim a noticeable slice of input cost on long runs.
- Build a per-page confidence and ambiguities field into the JSON schema so I can flag pages worth manually reviewing without re-OCRing the whole batch.
- If quality at
lowturns out to be marginal on hard pages, build an escalation path: re-run failed or low-confidence pages atmedium, accepting the higher cost on the small subset that needs it.
Comments
Post a Comment
Please leave your comments on this topic: