Thijs is new
prior to today, usage was broken out between Sonnet and other models like thisIt's unclear ot me if I tripped a flag somewhere as I approached my weekly limit, or this dialog is reflective of a new Anthropic usage report.
I switched Gas Town over to all Sonnet models a few weeks ago for two reasons. First, to get more usage per week. Second, I've found that for the creative research work I have the polecats (LLM agents in Gas Town parlance), Sonnet works much better than Opus 4.8. The research is for a book detailing the funding of mainstream general relativity by fringe science industrialists in the 1950s. It requires polecats to, for example, see the name Lucia Hobson, and immediately jump to the fact that Nikola Tesla was the best man at her father's wedding. Thus far, Opus 4.8 has been a little to stick-in-the-mud to pull this off, but Sonnet 4.6 makes the association for a lower token cost.
Using Codex Instead
I switched over to codex to run research analysis this morning as I neared my Claude usage limits for the week. Kicking of a job that uses codex took about one percent of my Claude usage. Unlike most Claude runs, the codex run is decidedly not kicking off subagents in parallel to do its searches. The analsyis of two manifest pages consumed one percent of my weekly codex usage on the twenty dollar plan. The results are different than the typical Sonnet 5.4 run. Codex and Sonnet tend to look in different places on the internet to do research. (As a reminder, I"m researching travel manifests of trips people featured in the Gladych Files took. To look for unknown or unexpected connections, I'm setting LLM agents to research each passenger on the manifest page.) For one passenger's stated address, Codex found a newspaper article, an obituary in fact, indicating the address was a residential apartment building. The obituary was not boring.
Given, the passenger traveled in 1937, but what an interesting way to learn about the use of a building.
The following pair of manifest pages consumed two percent of my available Codex usage for the week.
Performance Variance
And then, there was this. Two OCR runs on an identical image with very different results.
The completely incorrect address seems to have been caused by the image of the page being slanted.
- I can setup another agent to ensure that manifest page images are horizontal, not slanted.
- I can attempt to modify the prompt to account for slanted lines in one pass.
Ongoing Detection of the Issue
This is where it gets interesting. It's cost and time prohibitive to check every result. I need to setup a random audit process, similar to the one used by banks, where pages are spot-checked vs first, anotther agent, and then, if they fail, by a human.
The stakes are high for failure, so I may setup automated tests as well. The first automated test that springs to mind is to look for manifest pages with a low percentage of search results per passenger. Given this level of detail, if the agent searches based on the correct addresses, there aree usually several web search hits per page.
Comments
Post a Comment
Please leave your comments on this topic: