Skip to main content

Gladych Files Lab Book: Document OCR vs LLM Model vs Cost or Opus is Cheaper than Sonnet for OCR!

I started my lab book entries when I was a physics graduate student. It's kind of amusing and kind of cool how far I've come. I have the equivalent of a grad student, (aka Claude Opus 4.7), working for me now. I spent some time over the weekend setting up an OCR framework for a book research project of mine. I've been coming up to speed on evals, so I decided to run one to determine which model was the most accurate and cost effective for doing OCR on travel manifest pages. I stepped the eval along rather than automating it and talked the results through with Opus as I went. 

First, it turns out that Opus at low effort is the most accurate and the most cost effective choice! That was a surprise. The result has to do with Opus' ability to look at higher res images which means it needs to think less for OCR vs. Sonnet.

Second, at the end of the eval, as I was preparing to write up my results it occurred to me that I could ask my grad student to do it instead. Here's the log book entry created by Opus.

OCR for travel manifests — cost vs. model vs. effort

Lab notes, May 2026.

Objective

I'm building a pipeline to OCR archival travel manifests through the Anthropic API and turn each scanned page into structured JSON, one record per passenger. The plan is to run this against several thousand pages, so I need to understand what each page actually costs before committing to a model and effort setting. This entry collects the bench measurements I took on a single representative page, plus two small batch runs to sanity-check the per-page cost at scale.

Apparatus

  • Document: U.S. Department of Labor Form 630A, List of United States Citizens, sheet 41 of the S.S. Queen Mary manifest sailing Cherbourg to New York, arrival 19 October 1936. Thirty passenger lines, plus two struck-through duplicates and a footer with purser signature and inspector stamp. A mix of typewritten and handwritten content, several overwrites, ditto marks throughout the address column. A reasonably hard but not pathological page.
  • Image: Single JPEG, roughly 2000×2500 pixels.
  • Models tested: claude-opus-4-7 ($5/$25 per million input/output tokens) and claude-sonnet-4-6 ($3/$15).
  • Thinking mode: Adaptive thinking with summarized display. Effort levels swept across max, high, medium, and low.
  • Prompt: Short, identical across all runs. Asks for structured JSON, no research on passengers.
  • Driver: Small Python script using the Anthropic SDK streaming interface, base64-encoded image, output token cap of 64000 tokens unless noted.

Runs

pages model effort cost (USD) max tokens input tokens output tokens avg cost/page notes
1opusmax0.2264000484777080.22 
1opushigh0.1464000484749420.14 
1sonnethigh0.39640001630253070.39 
1sonnethigh0.48320000.48did not complete
1opusmedium0.1464000484746310.14 
1opuslow0.1364000484739260.13
1opuslow0.1564000484739290.15aborted call may have added ~$0.02
1opuslow0.1464000484746120.141.79 cumulative cost
1opuslow0.1164000488129190.11 
5opuslow0.44640000.088some pages have fewer passengers
14opuslow1.26640000.09

Observations

1. Opus at max effort is a baseline, not a target.

The first run used effort=max on Opus and came in at $0.22. That worked but felt expensive enough to be worth chasing down. The output JSON was correct and well-formatted; the question is whether the model needed to think that hard.

2. Sonnet at the same effort cost more, not less.

This was the first surprise. Sonnet 4.6 is meant to be the cheaper model — $3/$15 per million tokens versus Opus's $5/$25 — but the same image and prompt at effort=high cost $0.39 on Sonnet against $0.14 on Opus. That's roughly 2.8× more on the "cheaper" model.

The token columns explain it. Sonnet produced 25,307 output tokens to Opus's 4,942 at the same effort setting — about 5× as much. Since the final JSON is essentially the same size regardless of which model writes it, those extra ~20,000 tokens are almost all thinking. Sonnet's nominal per-token discount evaporates when it burns five times the tokens to reach the same answer.

3. The input token columns reveal a different image.

Looking more carefully at the input column: Opus took 4,847 input tokens; Sonnet took 1,630. Same image. Same prompt. The prompt itself is small — the difference is in how each model handles the image.

From the Anthropic vision docs: Opus 4.7 is the first model with high-resolution image support, accepting up to 2576 pixels on the long edge and producing up to about 4,784 image tokens. Other current models, including Sonnet 4.6, cap at 1568 pixels and about 1,568 image tokens. A high-resolution JPEG gets downsampled by roughly 3× before Sonnet ever sees it.

For a dense archival document — small handwriting in margins, strikethroughs to distinguish from underlines, ditto marks, partially visible overwrites — that resolution loss isn't free. The extra thinking Sonnet does is plausibly compensation for a blurrier view of the page. It's not that Sonnet is "doing worse OCR"; it's that Sonnet is being asked to do harder OCR.

Practical consequence: for vision-heavy work on small or detailed content, the two models are not interchangeable. Sonnet at the same effort doesn't see the same input.

4. The Sonnet 32k run aborted and still cost money.

One Sonnet run with max_tokens=32000 didn't complete — it appears to have hit the cap while still in the thinking phase — and the call still showed $0.48 on the console. Implication: output token caps don't protect you from cost. Tokens generated before the abort are billed. Setting max_tokens too low isn't a cost lever; it just produces failed runs you've already paid for.

5. On Opus, medium and high effort cost the same.

This was the most useful single finding of the day. Opus at medium came in at $0.14 with 4,631 output tokens; Opus at high came in at $0.14 with 4,942 output tokens. The difference is noise.

Adaptive thinking is, well, adaptive. The effort level sets an upper bound on thinking budget; the model decides how much it actually uses based on the task. For this OCR job, Opus self-regulates somewhere below the medium ceiling and stops. Giving it high doesn't make it think more — it just leaves unused headroom. Which means there's a natural floor for this task on this model, somewhere in the 4,500–5,000 output token range, and no amount of effort dial-cranking will push above it without changing the prompt.

6. Opus at low saves cost, but partly from formatting.

Opus at low effort came in around $0.11–$0.15 across four runs (variance partly explained below), with output tokens ranging from 2,919 to 4,612. The lowest-cost run dropped about 700 tokens compared to a medium run that wrote out the same content.

Eyeballing the JSON, the answer is partly that low effort produces more compact JSON formatting — fewer line breaks, less indentation between object entries. For 30 passenger records with about a dozen keys each, that's roughly 400 line breaks, and at one to two tokens per \n + indentation, the formatting alone accounts for maybe 500–800 tokens. Most of the cost difference between medium and low is formatting density, not thinking depth.

This is exploitable. My script pretty-prints the JSON client-side anyway with json.dumps(parsed, indent=2), so I lose nothing by asking the model to emit compact JSON. Adding that instruction to the prompt should lock in the format savings independent of effort level.

7. Variance run-to-run is real but small.

Four runs of Opus at low on the same page came in at $0.13, $0.15, $0.14, and $0.11. Output tokens varied from 2,919 to 4,612 — about 60% spread. Adaptive thinking is stochastic, and the model decides how much to think on each pass. This is fine for a batch pipeline; on average the cost per page lands close to $0.13. But it does mean a single benchmark isn't sufficient — a small run of 3–5 measurements gives a better cost estimate than one.

The $0.15 outlier had a note attached: I aborted the call mid-stream and reran. The console total may include partial billing from the aborted attempt — about $0.02 worth of tokens before abort. Consistent with the Sonnet observation above: aborted calls cost something.

8. Batch runs confirm ~$0.09 per page at scale.

Two small batch runs to check the projection: 5 pages at $0.44 total ($0.088/page average) and 14 pages at $1.26 total ($0.09/page average). That's slightly below the single-page Opus low measurement of $0.13, because the bench page is dense (30 passengers); some pages in the batch had fewer passengers and produced less output.

For a research budget projection, $0.09/page is the number to use. A thousand-page batch would land around $90; ten thousand pages around $900.

Discussion

The original intuition — that Sonnet would be the cheaper production choice — turned out to be wrong for this specific workload, for two compounding reasons. First, the per-token discount on Sonnet is smaller than the thinking-token multiplier on Sonnet: 60% of the per-token price doesn't help if the model uses 5× the tokens. Second, and more interestingly, Sonnet and Opus aren't doing the same task. Opus 4.7's high-resolution image path means it sees a fundamentally higher-fidelity version of the manifest. For text-heavy archival documents, that's not a marginal improvement — it's a different problem to solve.

The adaptive thinking finding is also worth restating. I had been treating effort as a quality dial: higher effort means better answers. On Opus, for this task, that's wrong. Effort is a ceiling on thinking budget, and the model already self-regulates below the ceiling. Dropping from max to medium saved $0.08 per page with no visible quality difference, because the extra thinking budget wasn't being used. The lesson generalizes: when two effort levels produce the same cost, the model has found its natural floor for the task. low may push below the floor and degrade quality; medium through max may all behave the same.

The JSON formatting effect was unexpected and is the kind of thing that only shows up when you instrument runs carefully. A real fraction of "thinking token" cost differences between effort levels was actually output formatting cost. Adding "emit compact JSON, no extra whitespace" to the prompt is a free optimization that decouples format from effort.

Conclusions

  • Production setting: Opus 4.7 at effort=low for routine pages.
  • Expected average cost: ~$0.09 per page at scale.
  • Sonnet 4.6 is not the right choice for this workload, despite the nominally lower per-token price, primarily because of the lower image resolution path.
  • Add an instruction to the prompt requesting compact JSON output.
  • max_tokens=64000 is a safety ceiling, not a cost driver. Don't lower it to control spend; lower effort or switch to a fixed thinking budget instead.

Next steps

  1. Quality eval: hand-transcribe ground truth for 5–10 pages covering varying difficulty (clean, heavy strikethroughs, mostly handwritten, faded scans) and diff against Opus low output. The cost work is meaningless if the cheap setting isn't accurate enough.
  2. Try Opus low with the compact-JSON instruction and see whether the formatting savings materialize as expected.
  3. Investigate prompt caching for the system prompt. The instruction text is identical across every page in the batch; caching it should trim a noticeable slice of input cost on long runs.
  4. Build a per-page confidence and ambiguities field into the JSON schema so I can flag pages worth manually reviewing without re-OCRing the whole batch.
  5. If quality at low turns out to be marginal on hard pages, build an escalation path: re-run failed or low-confidence pages at medium, accepting the higher cost on the small subset that needs it.

Comments

Popular posts from this blog

Cool Math Tricks: Deriving the Divergence, (Del or Nabla) into New (Cylindrical) Coordinate Systems

Now available as a Kindle ebook for 99 cents ! Get a spiffy ebook, and fund more physics The following is a pretty lengthy procedure, but converting the divergence, (nabla, del) operator between coordinate systems comes up pretty often. While there are tables for converting between common coordinate systems , there seem to be fewer explanations of the procedure for deriving the conversion, so here goes! What do we actually want? To convert the Cartesian nabla to the nabla for another coordinate system, say… cylindrical coordinates. What we’ll need: 1. The Cartesian Nabla: 2. A set of equations relating the Cartesian coordinates to cylindrical coordinates: 3. A set of equations relating the Cartesian basis vectors to the basis vectors of the new coordinate system: How to do it: Use the chain rule for differentiation to convert the derivatives with respect to the Cartesian variables to derivatives with respect to the cylindrical variables. The chain ...

The Alcubierre Warp Drive Tophat Function and Open Science with Sage

I transferred yesterday's Mathematica file with the Alcubierre warp drive[2] line element and space curvature calculations to the  +Sage Mathematical Software System  today, (the files been  added to the public repository [3]).  If you haven't used Sage before, it's a Python based software package that's similar in functionality to Mathematica.  Oh, and it' free.  I also worked a little more on understanding the theory, but frankly, I made far more progress with the software than the theory.  What follows will be a little more of the Alcubierre theory, plus, a cool Sage interactive demo of one of the Alcubierre functions[1], as well as a bit about my first experience with using Sage. Theory The theory is fun, but it's moving slowly.  Here's the chalk board from this morning's discussion Alcubierre setup the derivation using something called the 3+1 formalism which means we consider space to be flat, (in this case), slices that are labelled ...

The Valentine's Day Magnetic Monopole

There's an assymetry to the form of the two Maxwell's equations shown in picture 1.  While the divergence of the electric field is proportional to the electric charge density at a given point, the divergence of the magnetic field is equal to zero.  This is typically explained in the following way.  While we know that electrons, the fundamental electric charge carriers exist, evidence seems to indicate that magnetic monopoles, the particles that would carry magnetic 'charge', either don't exist, or, the energies required to create them are so high that they are exceedingly rare.  That doesn't stop us from looking for them though! Keeping with the theme of Fairbank[1] and his academic progeny over the semester break, today's post is about the discovery of a magnetic monopole candidate event by one of the Fairbank's graduate students, Blas Cabrera[2].  Cabrera was utilizing a loop type of magnetic monopole detector.  Its operation is in...