Skip to main content

Posts

LLM Agent Research Protocol for Avoiding Stigmergy

 I'm working through a methodology to study the behavior of teams of agents via observation of real-world tasks. As usual with LLMs, the concept of repeatable results is squishy, especially as compared to non-LLM deterministic computing. My finding last week was that LLM agents, especially Claude (per Google's research), can exhibit stigmergic , (a fancy word for how insects, like ants, 'learn' where important locations are from other insects), learning and behavior. In short, agents given the exact same instructions, (prompts), can and often times will exihibit different behaviors if they can see the results of the work of other agents. If you want to study the variance in the behavior of an LLM agent over multiple runs, this stigmergic behavior has to be accounted for. Otherwise, we're not measuring the behavior of an LLM agent with a set of inputs and prompts. With stigmergic behavior, if we're not careful, we're observing the behavior of a community of ...
Recent posts

fable-5 down for now per US Government Directive

 It was fun getting to use Anthropic's Fable-5 for a few days. Hopefully the chance will come up again. For the moment, the US government has denied access to non-US citizens.

LLM Evals Lab Book: The Importance of Statistics and Also Stigmergy

 Recap During an analysis of a travel manifest, two agents, (referred to as polecats in Gastown terminology), were accidentally handed the same manifest page for input. The agents produced different results. One agent found an association between Lucia Hobson and Nikola Tesla, a very valuable association for the research project. The other agent did not. A set of eval experiments ensued to determine how often polecats missed the association. The initial answer was that they missed it quite frequently with only 3 out of 16 agents making the association. Models Used In the following, all agents are using Sonnet 4.6. Orchestration is handled with Gastown. New Findings On the fourth batch of five test case runs, four polecats made the Tesla association. The chances of this happening randomly were less than 3% in the absence of any other process changes. Here's the Fisher's test run by Gemini. Fisher's Exact Test (Recommended) This compares your two distinct groups (the past 16...

Can Agents Think Outside the Box?

 With all the work that's been put into making agents "correct" by construction, I gotta say, sometimes I need an LLM agent to take a chance at just being wrong. I'm working on a book project called The Gladych Files . While the book is narrative nonfiction about the history of general relativity research, it explores the liminal space inhabited by very rich fringe scientist speculators of the 1950s who funded mainstream general relativity advances, (more or less on accident.) In those spaces, you'll find Tesla, the architect of the FBI building, Timothy Leary's LSD explorations and many,  many other things, institutions, and people.  I've accumulated hundreds of pages of historical documents from various archives, and I'm using orchestrated agentic AI, (in the form of Gastown), to review those documents. So far, the analysis has gone well, but last week I saw something that made me look up. I'd accidentally input the same archive page twice, so i...

Gladych Files Lab Book: Document OCR vs LLM Model vs Cost or Opus is Cheaper than Sonnet for OCR!

I started my lab book entries when I was a physics graduate student. It's kind of amusing and kind of cool how far I've come. I have the equivalent of a grad student, (aka Claude Opus 4.7), working for me now. I spent some time over the weekend setting up an OCR framework for a book research project of mine. I've been coming up to speed on evals, so I decided to run one to determine which model was the most accurate and cost effective for doing OCR on travel manifest pages. I stepped the eval along rather than automating it and talked the results through with Opus as I went.  First, it turns out that Opus at low effort is the most accurate and the most cost effective choice! That was a surprise. The result has to do with Opus' ability to look at higher res images which means it needs to think less for OCR vs. Sonnet. Second, at the end of the eval, as I was preparing to write up my results it occurred to me that I could ask my grad student to do it instead. Here's...

Working with Process Revision Control

 I took time to play with a new Dolt enabled app example called Quorum last night. Quorum sets 13 LLM agents with different defined personas  loose on a users question. The agents come up with solutions to the question and then discuss their individual solutions with each other to arrive at a consensus. There's much more detail in this blog post that accompanies the app. Quorum is cool. It is not, however, what I wanted to talk aobut here. Instead, I'm going to focus on the blog post for the app. In short, I'm very excited to see ideas that I've used to manage verification processes for years get codified into tools for LLM agents. Here's one of the important parts " I can shut down the app, lose the server, or disappear entirely — and the deliberation history remains, publicly accessible and cryptographically verified. " Imagine what an engineer can do to work back through their debug hypothesis tree with that sort of infrastructure! As the article'...

Working Through McConnell's Tensor Book

 I'm working through McConnell's tensor textbook, the one I used in general relativity class at New Mexico State. I'm probably not going to be looking at spacetime a whole lot this time as I'm more interested in machine learning model implementations this time. The first video in the series is shown below. Each video has a link to the next part in the series. You can find the textbook on the Internet Archives.