Thijs is new
prior to today, usage was broken out between Sonnet and other models like thisIt's unclear ot me if I tripped a flag somewhere as I approached my weekly limit, or this dialog is reflective of a new Anthropic usage report.
I switched Gas Town over to all Sonnet models a few weeks ago for two reasons. First, to get more usage per week. Second, I've found that for the creative research work I have the polecats (LLM agents in Gas Town parlance), Sonnet works much better than Opus 4.8. The research is for a book detailing the funding of mainstream general relativity by fringe science industrialists in the 1950s. It requires polecats to, for example, see the name Lucia Hobson, and immediately jump to the fact that Nikola Tesla was the best man at her father's wedding. Thus far, Opus 4.8 has been a little to stick-in-the-mud to pull this off, but Sonnet 4.6 makes the association for a lower token cost.
Using Codex Instead
I switched over to codex to run research analysis this morning as I neared my Claude usage limits for the week. Kicking of a job that uses codex took about one percent of my Claude usage. Unlike most Claude runs, the codex run is decidedly not kicking off subagents in parallel to do its searches. The analsyis of two manifest pages consumed one percent of my weekly codex usage on the twenty dollar plan. The results are different than the typical Sonnet 5.4 run. Codex and Sonnet tend to look in different places on the internet to do research. (As a reminder, I"m researching travel manifests of trips people featured in the Gladych Files took. To look for unknown or unexpected connections, I'm setting LLM agents to research each passenger on the manifest page.) For one passenger's stated address, Codex found a newspaper article, an obituary in fact, indicating the address was a residential apartment building. The obituary was not boring.
Given, the passenger traveled in 1937, but what an interesting way to learn about the use of a building.
The following pair of manifest pages consumed two percent of my available Codex usage for the week.
Performance Variance
And then, there was this. Two OCR runs on an identical image with very different results.
The completely incorrect address seems to have been caused by the image of the page being slanted.
- I can setup another agent to ensure that manifest page images are horizontal, not slanted.
- I can attempt to modify the prompt to account for slanted lines in one pass.
Ongoing Detection of the Issue
This is where it gets interesting. It's cost and time prohibitive to check every result. I need to setup a random audit process, similar to the one used by banks, where pages are spot-checked vs first, anotther agent, and then, if they fail, by a human.
I'm able to do things using agents for historical research that I couldn't accomplish otherwise. I've researched hundreds of people, (I'll have an exact count soon), for connections with the book's main characters. As a side benefit, the research is making the historical texture of the book more rich. Now that I have the capbility, the stakes are high for missing associations that could lead to new parts of the story. To mitigate the risk, I may setup automated tests as well. The first automated test that springs to mind is to look for manifest pages with a low percentage of search results per passenger. Given the level of detail available, if the agent searches based on the correct addresses and timeframes, there are usually several web search hits per page. It's a simple enough process-oriented test, (as opposed to agent oriented.)
Bantering about Ideas with GPT-5.4
I talked over the above with my GPT-5.4 medium effort lab assistant, and we came up with the following table of tests that I'll be flushing out more tomorrow.
Table
Here it is.
| Test | Purpose | Trigger / Input | Metric | Threshold / Flag | Action |
|---|---|---|---|---|---|
| Preflight quality gate | Catch bad scans before extraction | Every page image | Skew, blur, contrast, clipping, row-line detectability | Any quality score below minimum or skew above limit | Auto-deskew, enhance, or route to audit queue |
| Repeatability check | Detect unstable extraction on the same input | Run same page 3–5 times | Field exact-match rate; page disagreement score | Meaningful row or field disagreement across runs | Mark page low-confidence and send to review |
| Perturbation test | Measure robustness to tiny visual changes | Variants of same page: ±1°/±2° deskew, crop, contrast, resize | Output stability across perturbed variants | Key fields change under small perturbations | Flag page type or pipeline step as brittle |
| Raw vs normalized consensus | Reduce trust in one fragile pass | Raw page and deskewed/cleaned page | Agreement rate on key fields | Mismatch on names, addresses, dates, or row alignment | Keep agreed fields; escalate mismatches |
| Row-schema validation | Catch structurally implausible records | Structured output for each row | Type / format validity for fields | Bad address shape, unparsable date, nonnumeric age, broken column mapping | Reject or review failing rows |
| Neighbor-row consistency | Catch line hopping and adjacent-row contamination | Rows within same manifest page | Column alignment and local consistency | Sudden row-to-row field shifts inconsistent with table structure | Review page or rerun with stricter segmentation |
| Search-yield anomaly | Use downstream retrieval as a smoke alarm | Post-extraction search workflow | Hits per passenger; zero-hit fraction; address-backed hit rate | Page underperforms baseline for similar manifests | Flag for audit and compare with image-quality metrics |
| Field-level retrieval check | Estimate confidence for historically important fields | Name/address/year or surname/street/city queries | External snippet agreement with extracted field | Weak or contradictory support for key fields | Mark field low-confidence or review manually |
| Canary benchmark set | Catch regressions after pipeline changes | Known-good manually verified pages | Exact row accuracy; field accuracy; disagreement; downstream yield | Regression versus previous baseline | Block rollout or investigate change |
| Control chart monitoring | Detect slow drift over time | Daily or weekly ops metrics | Audit fail rate, disagreement score, correction rate, percent deskewed | Metric shifts outside normal control band | Investigate provider, prompt, or preprocessing drift |
| Hard-case suite | Track performance on nasty but realistic pages | Slanted, faint, shadowed, broken-line, overexposed pages | Accuracy and stability on adversarial subset | Hard-case performance worsens or fails to improve | Use as targeted robustness benchmark |
| Weighted human audit | Spend review time where risk is highest | Pages with skew, low yield, disagreement, or schema failures | Audit sampling weighted by risk score | Composite risk score above review threshold | Manual check, correction, and root-cause tagging |
Comments
Post a Comment
Please leave your comments on this topic: