Skip to main content

LLM Lab Book 2026-06-30: Using LLMs with Datasette-agent and Database Prep vs Token Usage on Claude

 I'm condensing the steps to move from travel manifest page to human readable findings to sqlite database here.

History Research Contextual Recap

I'm working on a history of physics research project that explores how industrialists interested in fringe physics wound up actually funding mainstream general relativity research. As part of that research, I've been looking at the travel manifests of various industrialists and research scientists from the 1930s to the 1950s. Because there are literally thousands of passengers on their combined voyages, I'm using LLM agents orchestrated through Gas Town to coordinate the research. At present, I am working on a bit of a mystery. Multiple sources state that Hedy Lamarr came to the United States aboard the S.S. Normandy, arriving on September 30th, 1937. That's the same ship that Tom Slick, (one of the industrialists of whom I spoke above), took across the Atlantic. There's only one problem. Hedy isn't on the travel manifest. At least not under her own name.


Benefiting from Human Language vs. Programming

When you take programming classes, a lot of fuss is made about putting a lot of thought into  defining the interfaces of your software system. This is a true statement for so many reasons. One is documentation, another is so you can farm work to multiple team members, another is software reuse. And so, when I started to design the page scanning, (OCR), app for this project, the agent I was working with and I thought for quite a while on what would be the perfect set of information to collect for a travel manifest page. 

After a few turns at batting this problem around, two things occurred to me. First, I wasn't programming, I was setting up an agentic system. Second, I could not anticipate every kind of information that might come off a travel manifest page. I couldn't even come close. But, you know who could thing of this kind of thing on the fly? You guessed it, the LLM agent that was actually going to read the pages. It would know, hopefully with certainty, exactly what was going to be on the page without me having to think about it. Good thing too, becasue there have been hundreds of pages so far. I created only two constraints. All the information had to be placed in a json structure of the agents choosing and to save tokens, there were to be no indents or carriage returns that weren't required. (In other words, no point wasting thousands of characters making the file readable for the human that was never going to read it.)

I have yet to regret this decision. Here's why.

LLM to Parsing Code to Database

Highlights

  • Don't sweat the data structure too much. You're using a world-class natural language tool. Let it do what it's good at.
  • When you finally get ready to crunch lots of data through sorts, filters, and searches, do use a database, but be clever about how you get your data into a db file. Your LLM doesn't need to see every byte.

The complete U.S. travel manifest for the ship that carried Tom Slick and Hedy Lamarr, the Normandie , consisted of over 74 pages with details about more than 1890 passengers and crew members. The LLM analysis output consists of 74 .md files containing 2.4 MB of text data.

I wanted to gather these records into a database for analysis. However, if I had an  LLM read them to find passenger information, that was was going to consume most of my weekly token budget. Instead I asked the LLM to  write a set of python scripts to extract passenger data and deposit that data in a sqlite database.

You might think this is where my loosey-goosey data acquisition methodology bit me in the butt. You'd be wrong. On the nNormandie, there were three types of manifest pages. There were one page versions for passengers from the United States. There were two pages versions for passengers who were not from the United States. The second page held information about where they had come from, an address of someone they knew there, the amount of money in their pocket, and the address of the person they were going to stay with once they reacedthe United States. Then, there were the crew pages which were single page affairs again. (Apparently, one of the easiest ways to get into a country is by the a member of a ship's crew. At least in 1937. NO visa required. They did keep a form that said you hadn't returned to the ship if you didn't, but that was it.)

I asked the LLM agent, (Claude Sonnet 4.6 running as the mayor of Gastown), to look at one finding shee for each of the those kinds of pages and spitball a table schema for a database. I followed the five turn rule, batting ideas around until the tables felt good enough. Then, I asked the agent to grep the first column of the tables on each one of the 74 pages and create a json file that mapped every single column name to a name that existed in the schema. We didn't miss any information, but if we had, I could have asked the agent to add it after the fact.

Finally, I asked the agent to write a python script that used that map to dump all the tabular data about passengers into our sqlite db schema.

The whole thing worked like a champ and used less than 5% of the tokens available in my five hour window. (I was using the $100/month Claude subscription.)

And that was the that! The natural language capabilities of the LLM agent had done exactly what they were supposed to do. Make my life easier by automating tasks without me having to do copious paperwork or think overly much.


Analyzing the data with datasette-agent

Highlights

  • datasette is very useful for analyzing large collections of data without getting an LLM involved. It has a nice web interface, you can add plugins of your own, or leverage an entire open source library of plugins that include things like charts and mapping. It was created by the same person who co-create django.
  • Don't like messing with SQL? No problem. datasette-agent is a plugin, again created by Simon, that hooks back into an LLM API and generates the SQL for you so you can ask natural language questions. On top of everything else, at the end of a session, you can save your entire transcript as an .md file.
  • Watch your transcript size. Remember that every time you ask an additional questin, you're sending all the tokens back again. It's a database. .Can you save intermediate results into tables that you can then pass to new LLM chat sessions without the previous context? If so, do that. Also, only use as sophisticated a model as you need.
Once  I had my db file in hand, it was easy enough to pull it into datasette where I could work on the data using SQL for the most part instead of LLM tokens.  You can see more about that in the previous, and in all likelihood, the next lab book entries.


Comments

Popular posts from this blog

Cool Math Tricks: Deriving the Divergence, (Del or Nabla) into New (Cylindrical) Coordinate Systems

Now available as a Kindle ebook for 99 cents ! Get a spiffy ebook, and fund more physics The following is a pretty lengthy procedure, but converting the divergence, (nabla, del) operator between coordinate systems comes up pretty often. While there are tables for converting between common coordinate systems , there seem to be fewer explanations of the procedure for deriving the conversion, so here goes! What do we actually want? To convert the Cartesian nabla to the nabla for another coordinate system, say… cylindrical coordinates. What we’ll need: 1. The Cartesian Nabla: 2. A set of equations relating the Cartesian coordinates to cylindrical coordinates: 3. A set of equations relating the Cartesian basis vectors to the basis vectors of the new coordinate system: How to do it: Use the chain rule for differentiation to convert the derivatives with respect to the Cartesian variables to derivatives with respect to the cylindrical variables. The chain ...

The Alcubierre Warp Drive Tophat Function and Open Science with Sage

I transferred yesterday's Mathematica file with the Alcubierre warp drive[2] line element and space curvature calculations to the  +Sage Mathematical Software System  today, (the files been  added to the public repository [3]).  If you haven't used Sage before, it's a Python based software package that's similar in functionality to Mathematica.  Oh, and it' free.  I also worked a little more on understanding the theory, but frankly, I made far more progress with the software than the theory.  What follows will be a little more of the Alcubierre theory, plus, a cool Sage interactive demo of one of the Alcubierre functions[1], as well as a bit about my first experience with using Sage. Theory The theory is fun, but it's moving slowly.  Here's the chalk board from this morning's discussion Alcubierre setup the derivation using something called the 3+1 formalism which means we consider space to be flat, (in this case), slices that are labelled ...

The Valentine's Day Magnetic Monopole

There's an assymetry to the form of the two Maxwell's equations shown in picture 1.  While the divergence of the electric field is proportional to the electric charge density at a given point, the divergence of the magnetic field is equal to zero.  This is typically explained in the following way.  While we know that electrons, the fundamental electric charge carriers exist, evidence seems to indicate that magnetic monopoles, the particles that would carry magnetic 'charge', either don't exist, or, the energies required to create them are so high that they are exceedingly rare.  That doesn't stop us from looking for them though! Keeping with the theme of Fairbank[1] and his academic progeny over the semester break, today's post is about the discovery of a magnetic monopole candidate event by one of the Fairbank's graduate students, Blas Cabrera[2].  Cabrera was utilizing a loop type of magnetic monopole detector.  Its operation is in...