I'm condensing the steps to move from travel manifest page to human readable findings to sqlite database here.
History Research Contextual Recap
I'm working on a history of physics research project that explores how industrialists interested in fringe physics wound up actually funding mainstream general relativity research. As part of that research, I've been looking at the travel manifests of various industrialists and research scientists from the 1930s to the 1950s. Because there are literally thousands of passengers on their combined voyages, I'm using LLM agents orchestrated through Gas Town to coordinate the research. At present, I am working on a bit of a mystery. Multiple sources state that Hedy Lamarr came to the United States aboard the S.S. Normandy, arriving on September 30th, 1937. That's the same ship that Tom Slick, (one of the industrialists of whom I spoke above), took across the Atlantic. There's only one problem. Hedy isn't on the travel manifest. At least not under her own name.
Benefiting from Human Language vs. Programming
When you take programming classes, a lot of fuss is made about putting a lot of thought into defining the interfaces of your software system. This is a true statement for so many reasons. One is documentation, another is so you can farm work to multiple team members, another is software reuse. And so, when I started to design the page scanning, (OCR), app for this project, the agent I was working with and I thought for quite a while on what would be the perfect set of information to collect for a travel manifest page.
After a few turns at batting this problem around, two things occurred to me. First, I wasn't programming, I was setting up an agentic system. Second, I could not anticipate every kind of information that might come off a travel manifest page. I couldn't even come close. But, you know who could thing of this kind of thing on the fly? You guessed it, the LLM agent that was actually going to read the pages. It would know, hopefully with certainty, exactly what was going to be on the page without me having to think about it. Good thing too, becasue there have been hundreds of pages so far. I created only two constraints. All the information had to be placed in a json structure of the agents choosing and to save tokens, there were to be no indents or carriage returns that weren't required. (In other words, no point wasting thousands of characters making the file readable for the human that was never going to read it.)
I have yet to regret this decision. Here's why.
LLM to Parsing Code to Database
Highlights
- Don't sweat the data structure too much. You're using a world-class natural language tool. Let it do what it's good at.
- When you finally get ready to crunch lots of data through sorts, filters, and searches, do use a database, but be clever about how you get your data into a db file. Your LLM doesn't need to see every byte.
The complete U.S. travel manifest for the ship that carried Tom Slick and Hedy Lamarr, the Normandie , consisted of over 74 pages with details about more than 1890 passengers and crew members. The LLM analysis output consists of 74 .md files containing 2.4 MB of text data.
I wanted to gather these records into a database for analysis. However, if I had an LLM read them to find passenger information, that was was going to consume most of my weekly token budget. Instead I asked the LLM to write a set of python scripts to extract passenger data and deposit that data in a sqlite database.
You might think this is where my loosey-goosey data acquisition methodology bit me in the butt. You'd be wrong. On the nNormandie, there were three types of manifest pages. There were one page versions for passengers from the United States. There were two pages versions for passengers who were not from the United States. The second page held information about where they had come from, an address of someone they knew there, the amount of money in their pocket, and the address of the person they were going to stay with once they reacedthe United States. Then, there were the crew pages which were single page affairs again. (Apparently, one of the easiest ways to get into a country is by the a member of a ship's crew. At least in 1937. NO visa required. They did keep a form that said you hadn't returned to the ship if you didn't, but that was it.)
I asked the LLM agent, (Claude Sonnet 4.6 running as the mayor of Gastown), to look at one finding shee for each of the those kinds of pages and spitball a table schema for a database. I followed the five turn rule, batting ideas around until the tables felt good enough. Then, I asked the agent to grep the first column of the tables on each one of the 74 pages and create a json file that mapped every single column name to a name that existed in the schema. We didn't miss any information, but if we had, I could have asked the agent to add it after the fact.
Finally, I asked the agent to write a python script that used that map to dump all the tabular data about passengers into our sqlite db schema.
The whole thing worked like a champ and used less than 5% of the tokens available in my five hour window. (I was using the $100/month Claude subscription.)
And that was the that! The natural language capabilities of the LLM agent had done exactly what they were supposed to do. Make my life easier by automating tasks without me having to do copious paperwork or think overly much.
Analyzing the data with datasette-agent
Highlights
- datasette is very useful for analyzing large collections of data without getting an LLM involved. It has a nice web interface, you can add plugins of your own, or leverage an entire open source library of plugins that include things like charts and mapping. It was created by the same person who co-create django.
- Don't like messing with SQL? No problem. datasette-agent is a plugin, again created by Simon, that hooks back into an LLM API and generates the SQL for you so you can ask natural language questions. On top of everything else, at the end of a session, you can save your entire transcript as an .md file.
- Watch your transcript size. Remember that every time you ask an additional questin, you're sending all the tokens back again. It's a database. .Can you save intermediate results into tables that you can then pass to new LLM chat sessions without the previous context? If so, do that. Also, only use as sophisticated a model as you need.
Comments
Post a Comment
Please leave your comments on this topic: