Skip to main content

2026_06_25 LLM Lab Book: OCR Variance & Claude Usage Limits Reporting

 Thijs is new

prior to today, usage was broken out between Sonnet and other models like this


It's unclear ot me if I tripped a flag somewhere as I approached my weekly limit, or this dialog is reflective of a new Anthropic usage report.

I switched Gas Town over to all Sonnet models a few weeks ago for two reasons. First, to get more usage per week. Second, I've found that for the creative research work I have the polecats (LLM agents in Gas Town parlance), Sonnet works much better than Opus 4.8. The research is for a book detailing the funding of mainstream general relativity by fringe science industrialists in the 1950s. It requires polecats to, for example, see the name Lucia Hobson, and immediately jump to the fact that Nikola Tesla was the best man at her father's wedding. Thus far, Opus 4.8 has been a little to stick-in-the-mud to pull this off, but Sonnet 4.6 makes the association for a lower token cost.


Using Codex Instead

I switched over to codex to run research analysis this morning as I neared my Claude usage limits for the week. Kicking of a job that uses codex took about one percent of my Claude usage. Unlike most Claude runs, the codex run is decidedly not kicking off subagents in parallel to do its searches. The analsyis of two manifest pages consumed one percent of my weekly codex usage on the twenty dollar plan. The results are different than the typical Sonnet 5.4 run. Codex and Sonnet tend to look in different places on the internet to do research. (As a reminder, I"m researching travel manifests of trips people featured in the Gladych Files took. To look for unknown or unexpected connections, I'm setting LLM agents to research each passenger on the manifest page.) For one passenger's stated address, Codex found a newspaper article, an obituary in fact, indicating the address was a residential apartment building. The obituary was not boring. 


Given, the passenger traveled in 1937, but what an interesting way to learn about the use of a building.

The following pair of manifest pages consumed two percent of my available Codex usage for the week.  


Performance Variance

And then, there was this. Two OCR runs on an identical image with very different results.

For easier viewing, here's row 6 on the first run (correct)


and then, the second run (incorrect)

Note: these are addresses from 1937, so I think privacy-wise, we're in the clear.

The completely incorrect address seems to have been caused by the image of the page being slanted.


Possible Fixes

  • I can setup another agent to ensure that manifest page images are horizontal, not slanted.
  • I can attempt to modify the prompt to account for slanted lines in one pass.

Ongoing Detection of the Issue

This is where it gets interesting. It's cost and time prohibitive to check every result. I need to setup a random audit process, similar to the one used by banks, where pages are spot-checked vs first, anotther agent, and then, if they fail, by a human. 

I'm able to do things using agents for historical research that I couldn't accomplish otherwise. I've researched hundreds of people, (I'll have an exact count soon), for connections with the book's main characters. As a side benefit, the research is making the historical texture of the book more rich. Now that I have the capbility, the stakes are high for missing associations that could lead to new parts of the story. To mitigate the risk, I may setup automated tests as well. The first automated test that springs to mind is to look for manifest pages with a low percentage of search results per passenger. Given the level of detail available, if the agent searches based on the correct addresses and timeframes, there are usually several web search hits per page. It's a simple enough process-oriented test, (as opposed to agent oriented.)

Bantering about Ideas with GPT-5.4

I talked over the above with my GPT-5.4 medium effort lab assistant, and we came up with the following table of tests that I'll be flushing out more tomorrow.

Table

Here it is.



Test Purpose Trigger / Input Metric Threshold / Flag Action
Preflight quality gate Catch bad scans before extraction Every page image Skew, blur, contrast, clipping, row-line detectability Any quality score below minimum or skew above limit Auto-deskew, enhance, or route to audit queue
Repeatability check Detect unstable extraction on the same input Run same page 3–5 times Field exact-match rate; page disagreement score Meaningful row or field disagreement across runs Mark page low-confidence and send to review
Perturbation test Measure robustness to tiny visual changes Variants of same page: ±1°/±2° deskew, crop, contrast, resize Output stability across perturbed variants Key fields change under small perturbations Flag page type or pipeline step as brittle
Raw vs normalized consensus Reduce trust in one fragile pass Raw page and deskewed/cleaned page Agreement rate on key fields Mismatch on names, addresses, dates, or row alignment Keep agreed fields; escalate mismatches
Row-schema validation Catch structurally implausible records Structured output for each row Type / format validity for fields Bad address shape, unparsable date, nonnumeric age, broken column mapping Reject or review failing rows
Neighbor-row consistency Catch line hopping and adjacent-row contamination Rows within same manifest page Column alignment and local consistency Sudden row-to-row field shifts inconsistent with table structure Review page or rerun with stricter segmentation
Search-yield anomaly Use downstream retrieval as a smoke alarm Post-extraction search workflow Hits per passenger; zero-hit fraction; address-backed hit rate Page underperforms baseline for similar manifests Flag for audit and compare with image-quality metrics
Field-level retrieval check Estimate confidence for historically important fields Name/address/year or surname/street/city queries External snippet agreement with extracted field Weak or contradictory support for key fields Mark field low-confidence or review manually
Canary benchmark set Catch regressions after pipeline changes Known-good manually verified pages Exact row accuracy; field accuracy; disagreement; downstream yield Regression versus previous baseline Block rollout or investigate change
Control chart monitoring Detect slow drift over time Daily or weekly ops metrics Audit fail rate, disagreement score, correction rate, percent deskewed Metric shifts outside normal control band Investigate provider, prompt, or preprocessing drift
Hard-case suite Track performance on nasty but realistic pages Slanted, faint, shadowed, broken-line, overexposed pages Accuracy and stability on adversarial subset Hard-case performance worsens or fails to improve Use as targeted robustness benchmark
Weighted human audit Spend review time where risk is highest Pages with skew, low yield, disagreement, or schema failures Audit sampling weighted by risk score Composite risk score above review threshold Manual check, correction, and root-cause tagging

Comments

Popular posts from this blog

Cool Math Tricks: Deriving the Divergence, (Del or Nabla) into New (Cylindrical) Coordinate Systems

Now available as a Kindle ebook for 99 cents ! Get a spiffy ebook, and fund more physics The following is a pretty lengthy procedure, but converting the divergence, (nabla, del) operator between coordinate systems comes up pretty often. While there are tables for converting between common coordinate systems , there seem to be fewer explanations of the procedure for deriving the conversion, so here goes! What do we actually want? To convert the Cartesian nabla to the nabla for another coordinate system, say… cylindrical coordinates. What we’ll need: 1. The Cartesian Nabla: 2. A set of equations relating the Cartesian coordinates to cylindrical coordinates: 3. A set of equations relating the Cartesian basis vectors to the basis vectors of the new coordinate system: How to do it: Use the chain rule for differentiation to convert the derivatives with respect to the Cartesian variables to derivatives with respect to the cylindrical variables. The chain ...

The Alcubierre Warp Drive Tophat Function and Open Science with Sage

I transferred yesterday's Mathematica file with the Alcubierre warp drive[2] line element and space curvature calculations to the  +Sage Mathematical Software System  today, (the files been  added to the public repository [3]).  If you haven't used Sage before, it's a Python based software package that's similar in functionality to Mathematica.  Oh, and it' free.  I also worked a little more on understanding the theory, but frankly, I made far more progress with the software than the theory.  What follows will be a little more of the Alcubierre theory, plus, a cool Sage interactive demo of one of the Alcubierre functions[1], as well as a bit about my first experience with using Sage. Theory The theory is fun, but it's moving slowly.  Here's the chalk board from this morning's discussion Alcubierre setup the derivation using something called the 3+1 formalism which means we consider space to be flat, (in this case), slices that are labelled ...

The Valentine's Day Magnetic Monopole

There's an assymetry to the form of the two Maxwell's equations shown in picture 1.  While the divergence of the electric field is proportional to the electric charge density at a given point, the divergence of the magnetic field is equal to zero.  This is typically explained in the following way.  While we know that electrons, the fundamental electric charge carriers exist, evidence seems to indicate that magnetic monopoles, the particles that would carry magnetic 'charge', either don't exist, or, the energies required to create them are so high that they are exceedingly rare.  That doesn't stop us from looking for them though! Keeping with the theme of Fairbank[1] and his academic progeny over the semester break, today's post is about the discovery of a magnetic monopole candidate event by one of the Fairbank's graduate students, Blas Cabrera[2].  Cabrera was utilizing a loop type of magnetic monopole detector.  Its operation is in...