I started my lab book entries when I was a physics graduate student. It's kind of amusing and kind of cool how far I've come. I have the equivalent of a grad student, (aka Claude Opus 4.7), working for me now. I spent some time over the weekend setting up an OCR framework for a book research project of mine. I've been coming up to speed on evals, so I decided to run one to determine which model was the most accurate and cost effective for doing OCR on travel manifest pages. I stepped the eval along rather than automating it and talked the results through with Opus as I went. First, it turns out that Opus at low effort is the most accurate and the most cost effective choice! That was a surprise. The result has to do with Opus' ability to look at higher res images which means it needs to think less for OCR vs. Sonnet. Second, at the end of the eval, as I was preparing to write up my results it occurred to me that I could ask my grad student to do it instead. Here's...
I took time to play with a new Dolt enabled app example called Quorum last night. Quorum sets 13 LLM agents with different defined personas loose on a users question. The agents come up with solutions to the question and then discuss their individual solutions with each other to arrive at a consensus. There's much more detail in this blog post that accompanies the app. Quorum is cool. It is not, however, what I wanted to talk aobut here. Instead, I'm going to focus on the blog post for the app. In short, I'm very excited to see ideas that I've used to manage verification processes for years get codified into tools for LLM agents. Here's one of the important parts " I can shut down the app, lose the server, or disappear entirely — and the deliberation history remains, publicly accessible and cryptographically verified. " Imagine what an engineer can do to work back through their debug hypothesis tree with that sort of infrastructure! As the article'...