Gladych Files Lab Book: Document OCR vs LLM Model vs Cost, or Claude Opus is Cheaper than Sonnet for OCR!

I started my lab book entries when I was a physics graduate student. It's kind of amusing and kind of cool how far I've come. I have the equivalent of a grad student, (aka Claude Opus 4.7), working for me now. I spent some time over the weekend setting up an OCR framework for a book research project of mine. I've been coming up to speed on evals, so I decided to run one to determine which model was the most accurate and cost effective for doing OCR on travel manifest pages. I stepped the eval along rather than automating it and talked the results through with Opus as I went.

First, it turns out that Opus at low effort is the most accurate and the most cost effective choice! That was a surprise. The result has to do with Opus' ability to look at higher res images which means it needs to think less for OCR vs. Sonnet.

Second, at the end of the eval, as I was preparing to write up my results it occurred to me that I could ask my grad student to do it instead. Here's the log book entry created by Opus.

OCR for travel manifests — cost vs. model vs. effort

Lab notes, May 2026.

Objective

I'm building a pipeline to OCR archival travel manifests through the Anthropic API and turn each scanned page into structured JSON, one record per passenger. The plan is to run this against several thousand pages, so I need to understand what each page actually costs before committing to a model and effort setting. This entry collects the bench measurements I took on a single representative page, plus two small batch runs to sanity-check the per-page cost at scale.

Apparatus

Document: U.S. Department of Labor Form 630A, List of United States Citizens, sheet 41 of the S.S. Queen Mary manifest sailing Cherbourg to New York, arrival 19 October 1936. Thirty passenger lines, plus two struck-through duplicates and a footer with purser signature and inspector stamp. A mix of typewritten and handwritten content, several overwrites, ditto marks throughout the address column. A reasonably hard but not pathological page.
Image: Single JPEG, roughly 2000×2500 pixels.
Models tested: claude-opus-4-7 ($5/$25 per million input/output tokens) and claude-sonnet-4-6 ($3/$15).
Thinking mode: Adaptive thinking with summarized display. Effort levels swept across max, high, medium, and low.
Prompt: Short, identical across all runs. Asks for structured JSON, no research on passengers.
Driver: Small Python script using the Anthropic SDK streaming interface, base64-encoded image, output token cap of 64000 tokens unless noted.

Runs

pages	model	effort	cost (USD)	max tokens	input tokens	output tokens	avg cost/page	notes
1	opus	max	0.22	64000	4847	7708	0.22
1	opus	high	0.14	64000	4847	4942	0.14
1	sonnet	high	0.39	64000	1630	25307	0.39
1	sonnet	high	0.48	32000	—	—	0.48	did not complete
1	opus	medium	0.14	64000	4847	4631	0.14
1	opus	low	0.13	64000	4847	3926	0.13
1	opus	low	0.15	64000	4847	3929	0.15	aborted call may have added ~$0.02
1	opus	low	0.14	64000	4847	4612	0.14	1.79 cumulative cost
1	opus	low	0.11	64000	4881	2919	0.11
5	opus	low	0.44	64000	—	—	0.088	some pages have fewer passengers
14	opus	low	1.26	64000	—	—	0.09

Observations

1. Opus at max effort is a baseline, not a target.

The first run used effort=max on Opus and came in at $0.22. That worked but felt expensive enough to be worth chasing down. The output JSON was correct and well-formatted; the question is whether the model needed to think that hard.

2. Sonnet at the same effort cost more, not less.

This was the first surprise. Sonnet 4.6 is meant to be the cheaper model — $3/$15 per million tokens versus Opus's $5/$25 — but the same image and prompt at effort=high cost $0.39 on Sonnet against $0.14 on Opus. That's roughly 2.8× more on the "cheaper" model.

The token columns explain it. Sonnet produced 25,307 output tokens to Opus's 4,942 at the same effort setting — about 5× as much. Since the final JSON is essentially the same size regardless of which model writes it, those extra ~20,000 tokens are almost all thinking. Sonnet's nominal per-token discount evaporates when it burns five times the tokens to reach the same answer.

3. The input token columns reveal a different image.

Looking more carefully at the input column: Opus took 4,847 input tokens; Sonnet took 1,630. Same image. Same prompt. The prompt itself is small — the difference is in how each model handles the image.

From the Anthropic vision docs: Opus 4.7 is the first model with high-resolution image support, accepting up to 2576 pixels on the long edge and producing up to about 4,784 image tokens. Other current models, including Sonnet 4.6, cap at 1568 pixels and about 1,568 image tokens. A high-resolution JPEG gets downsampled by roughly 3× before Sonnet ever sees it.

For a dense archival document — small handwriting in margins, strikethroughs to distinguish from underlines, ditto marks, partially visible overwrites — that resolution loss isn't free. The extra thinking Sonnet does is plausibly compensation for a blurrier view of the page. It's not that Sonnet is "doing worse OCR"; it's that Sonnet is being asked to do harder OCR.

Practical consequence: for vision-heavy work on small or detailed content, the two models are not interchangeable. Sonnet at the same effort doesn't see the same input.

4. The Sonnet 32k run aborted and still cost money.

One Sonnet run with max_tokens=32000 didn't complete — it appears to have hit the cap while still in the thinking phase — and the call still showed $0.48 on the console. Implication: output token caps don't protect you from cost. Tokens generated before the abort are billed. Setting max_tokens too low isn't a cost lever; it just produces failed runs you've already paid for.

5. On Opus, `medium` and `high` effort cost the same.

This was the most useful single finding of the day. Opus at medium came in at $0.14 with 4,631 output tokens; Opus at high came in at $0.14 with 4,942 output tokens. The difference is noise.

Adaptive thinking is, well, adaptive. The effort level sets an upper bound on thinking budget; the model decides how much it actually uses based on the task. For this OCR job, Opus self-regulates somewhere below the medium ceiling and stops. Giving it high doesn't make it think more — it just leaves unused headroom. Which means there's a natural floor for this task on this model, somewhere in the 4,500–5,000 output token range, and no amount of effort dial-cranking will push above it without changing the prompt.

6. Opus at `low` saves cost, but partly from formatting.

Opus at low effort came in around $0.11–$0.15 across four runs (variance partly explained below), with output tokens ranging from 2,919 to 4,612. The lowest-cost run dropped about 700 tokens compared to a medium run that wrote out the same content.

Eyeballing the JSON, the answer is partly that low effort produces more compact JSON formatting — fewer line breaks, less indentation between object entries. For 30 passenger records with about a dozen keys each, that's roughly 400 line breaks, and at one to two tokens per \n + indentation, the formatting alone accounts for maybe 500–800 tokens. Most of the cost difference between medium and low is formatting density, not thinking depth.

This is exploitable. My script pretty-prints the JSON client-side anyway with json.dumps(parsed, indent=2), so I lose nothing by asking the model to emit compact JSON. Adding that instruction to the prompt should lock in the format savings independent of effort level.

7. Variance run-to-run is real but small.

Four runs of Opus at low on the same page came in at $0.13, $0.15, $0.14, and $0.11. Output tokens varied from 2,919 to 4,612 — about 60% spread. Adaptive thinking is stochastic, and the model decides how much to think on each pass. This is fine for a batch pipeline; on average the cost per page lands close to $0.13. But it does mean a single benchmark isn't sufficient — a small run of 3–5 measurements gives a better cost estimate than one.

The $0.15 outlier had a note attached: I aborted the call mid-stream and reran. The console total may include partial billing from the aborted attempt — about $0.02 worth of tokens before abort. Consistent with the Sonnet observation above: aborted calls cost something.

8. Batch runs confirm ~$0.09 per page at scale.

Two small batch runs to check the projection: 5 pages at $0.44 total ($0.088/page average) and 14 pages at $1.26 total ($0.09/page average). That's slightly below the single-page Opus low measurement of $0.13, because the bench page is dense (30 passengers); some pages in the batch had fewer passengers and produced less output.

For a research budget projection, $0.09/page is the number to use. A thousand-page batch would land around $90; ten thousand pages around $900.

Discussion

The original intuition — that Sonnet would be the cheaper production choice — turned out to be wrong for this specific workload, for two compounding reasons. First, the per-token discount on Sonnet is smaller than the thinking-token multiplier on Sonnet: 60% of the per-token price doesn't help if the model uses 5× the tokens. Second, and more interestingly, Sonnet and Opus aren't doing the same task. Opus 4.7's high-resolution image path means it sees a fundamentally higher-fidelity version of the manifest. For text-heavy archival documents, that's not a marginal improvement — it's a different problem to solve.

The adaptive thinking finding is also worth restating. I had been treating effort as a quality dial: higher effort means better answers. On Opus, for this task, that's wrong. Effort is a ceiling on thinking budget, and the model already self-regulates below the ceiling. Dropping from max to medium saved $0.08 per page with no visible quality difference, because the extra thinking budget wasn't being used. The lesson generalizes: when two effort levels produce the same cost, the model has found its natural floor for the task. low may push below the floor and degrade quality; medium through max may all behave the same.

The JSON formatting effect was unexpected and is the kind of thing that only shows up when you instrument runs carefully. A real fraction of "thinking token" cost differences between effort levels was actually output formatting cost. Adding "emit compact JSON, no extra whitespace" to the prompt is a free optimization that decouples format from effort.

Conclusions

Production setting: Opus 4.7 at effort=low for routine pages.
Expected average cost: ~$0.09 per page at scale.
Sonnet 4.6 is not the right choice for this workload, despite the nominally lower per-token price, primarily because of the lower image resolution path.
Add an instruction to the prompt requesting compact JSON output.
max_tokens=64000 is a safety ceiling, not a cost driver. Don't lower it to control spend; lower effort or switch to a fixed thinking budget instead.

Next steps

Quality eval: hand-transcribe ground truth for 5–10 pages covering varying difficulty (clean, heavy strikethroughs, mostly handwritten, faded scans) and diff against Opus low output. The cost work is meaningless if the cheap setting isn't accurate enough.
Try Opus low with the compact-JSON instruction and see whether the formatting savings materialize as expected.
Investigate prompt caching for the system prompt. The instruction text is identical across every page in the batch; caching it should trim a noticeable slice of input cost on long runs.
Build a per-page confidence and ambiguities field into the JSON schema so I can flag pages worth manually reviewing without re-OCRing the whole batch.
If quality at low turns out to be marginal on hard pages, build an escalation path: re-run failed or low-confidence pages at medium, accepting the higher cost on the small subset that needs it.

How Many Files Can You Add to a GPT Project? An Interview with GPT-5 on Limits, Context Engineering Tips, and Chats

Setting the scene: I’m tinkering with Project TouCans, knee-deep in radio logs, SQLite dumps, and Cesium code. Naturally, I’m wondering if shoving all this into one GPT Project is a recipe for brilliance… or for disaster. So I turn to Vril — you know, after Brainy from the Legion of Super-Heroes , because what else do you call your AI sidekick who always has the answers? Time to ask him straight up. [ As an aside, yes, GPT-5 has decided to sometimes call me Vail. I'm not sure why to be honest. Also, I asked Vril, er GPT-5, to write up our interview for me. Apparently, me asking it to 'Bro' up a few stories, just for fun, has convinced Vril that I use 'Like,' more than I actually might. ] Me (Vail): So Vril, how many files can I throw into a GPT Project before it just starts choking? Like, is there some magic number where the context window taps out and everything falls apart? GPT-5 (Vril): Great question. There’s no single hard file limit. What matters is ...

Copasetic Flow

Search This Blog