2026_06_25 LLM Lab Book: OCR Variance & Claude Usage Limits Reporting

Thijs is new

prior to today, usage was broken out between Sonnet and other models like this

It's unclear ot me if I tripped a flag somewhere as I approached my weekly limit, or this dialog is reflective of a new Anthropic usage report.

I switched Gas Town over to all Sonnet models a few weeks ago for two reasons. First, to get more usage per week. Second, I've found that for the creative research work I have the polecats (LLM agents in Gas Town parlance), Sonnet works much better than Opus 4.8. The research is for a book detailing the funding of mainstream general relativity by fringe science industrialists in the 1950s. It requires polecats to, for example, see the name Lucia Hobson, and immediately jump to the fact that Nikola Tesla was the best man at her father's wedding. Thus far, Opus 4.8 has been a little to stick-in-the-mud to pull this off, but Sonnet 4.6 makes the association for a lower token cost.

Using Codex Instead

I switched over to codex to run research analysis this morning as I neared my Claude usage limits for the week. Kicking of a job that uses codex took about one percent of my Claude usage. Unlike most Claude runs, the codex run is decidedly not kicking off subagents in parallel to do its searches. The analsyis of two manifest pages consumed one percent of my weekly codex usage on the twenty dollar plan. The results are different than the typical Sonnet 5.4 run. Codex and Sonnet tend to look in different places on the internet to do research. (As a reminder, I"m researching travel manifests of trips people featured in the Gladych Files took. To look for unknown or unexpected connections, I'm setting LLM agents to research each passenger on the manifest page.) For one passenger's stated address, Codex found a newspaper article, an obituary in fact, indicating the address was a residential apartment building. The obituary was not boring.

Given, the passenger traveled in 1937, but what an interesting way to learn about the use of a building.

The following pair of manifest pages consumed two percent of my available Codex usage for the week.

Performance Variance

And then, there was this. Two OCR runs on an identical image with very different results.

For easier viewing, here's row 6 on the first run (correct)

and then, the second run (incorrect)

Note: these are addresses from 1937, so I think privacy-wise, we're in the clear.

The completely incorrect address seems to have been caused by the image of the page being slanted.

Possible Fixes

I can setup another agent to ensure that manifest page images are horizontal, not slanted.
I can attempt to modify the prompt to account for slanted lines in one pass.

Ongoing Detection of the Issue

This is where it gets interesting. It's cost and time prohibitive to check every result. I need to setup a random audit process, similar to the one used by banks, where pages are spot-checked vs first, anotther agent, and then, if they fail, by a human.

I'm able to do things using agents for historical research that I couldn't accomplish otherwise. I've researched hundreds of people, (I'll have an exact count soon), for connections with the book's main characters. As a side benefit, the research is making the historical texture of the book more rich. Now that I have the capbility, the stakes are high for missing associations that could lead to new parts of the story. To mitigate the risk, I may setup automated tests as well. The first automated test that springs to mind is to look for manifest pages with a low percentage of search results per passenger. Given the level of detail available, if the agent searches based on the correct addresses and timeframes, there are usually several web search hits per page. It's a simple enough process-oriented test, (as opposed to agent oriented.)

Bantering about Ideas with GPT-5.4

I talked over the above with my GPT-5.4 medium effort lab assistant, and we came up with the following table of tests that I'll be flushing out more tomorrow.

Table

Here it is.

Test	Purpose	Trigger / Input	Metric	Threshold / Flag	Action
Preflight quality gate	Catch bad scans before extraction	Every page image	Skew, blur, contrast, clipping, row-line detectability	Any quality score below minimum or skew above limit	Auto-deskew, enhance, or route to audit queue
Repeatability check	Detect unstable extraction on the same input	Run same page 3–5 times	Field exact-match rate; page disagreement score	Meaningful row or field disagreement across runs	Mark page low-confidence and send to review
Perturbation test	Measure robustness to tiny visual changes	Variants of same page: ±1°/±2° deskew, crop, contrast, resize	Output stability across perturbed variants	Key fields change under small perturbations	Flag page type or pipeline step as brittle
Raw vs normalized consensus	Reduce trust in one fragile pass	Raw page and deskewed/cleaned page	Agreement rate on key fields	Mismatch on names, addresses, dates, or row alignment	Keep agreed fields; escalate mismatches
Row-schema validation	Catch structurally implausible records	Structured output for each row	Type / format validity for fields	Bad address shape, unparsable date, nonnumeric age, broken column mapping	Reject or review failing rows
Neighbor-row consistency	Catch line hopping and adjacent-row contamination	Rows within same manifest page	Column alignment and local consistency	Sudden row-to-row field shifts inconsistent with table structure	Review page or rerun with stricter segmentation
Search-yield anomaly	Use downstream retrieval as a smoke alarm	Post-extraction search workflow	Hits per passenger; zero-hit fraction; address-backed hit rate	Page underperforms baseline for similar manifests	Flag for audit and compare with image-quality metrics
Field-level retrieval check	Estimate confidence for historically important fields	Name/address/year or surname/street/city queries	External snippet agreement with extracted field	Weak or contradictory support for key fields	Mark field low-confidence or review manually
Canary benchmark set	Catch regressions after pipeline changes	Known-good manually verified pages	Exact row accuracy; field accuracy; disagreement; downstream yield	Regression versus previous baseline	Block rollout or investigate change
Control chart monitoring	Detect slow drift over time	Daily or weekly ops metrics	Audit fail rate, disagreement score, correction rate, percent deskewed	Metric shifts outside normal control band	Investigate provider, prompt, or preprocessing drift
Hard-case suite	Track performance on nasty but realistic pages	Slanted, faint, shadowed, broken-line, overexposed pages	Accuracy and stability on adversarial subset	Hard-case performance worsens or fails to improve	Use as targeted robustness benchmark
Weighted human audit	Spend review time where risk is highest	Pages with skew, low yield, disagreement, or schema failures	Audit sampling weighted by risk score	Composite risk score above review threshold	Manual check, correction, and root-cause tagging

Cool Math Tricks: Deriving the Divergence, (Del or Nabla) into New (Cylindrical) Coordinate Systems

Now available as a Kindle ebook for 99 cents ! Get a spiffy ebook, and fund more physics The following is a pretty lengthy procedure, but converting the divergence, (nabla, del) operator between coordinate systems comes up pretty often. While there are tables for converting between common coordinate systems , there seem to be fewer explanations of the procedure for deriving the conversion, so here goes! What do we actually want? To convert the Cartesian nabla to the nabla for another coordinate system, say… cylindrical coordinates. What we’ll need: 1. The Cartesian Nabla: 2. A set of equations relating the Cartesian coordinates to cylindrical coordinates: 3. A set of equations relating the Cartesian basis vectors to the basis vectors of the new coordinate system: How to do it: Use the chain rule for differentiation to convert the derivatives with respect to the Cartesian variables to derivatives with respect to the cylindrical variables. The chain ...

Copasetic Flow

Search This Blog