Skip to main content

Posts

Showing posts with the label agentic orchestration

LLM Evals Lab Book: The Importance of Statistics and Also Stigmergy

 Recap During an analysis of a travel manifest, two agents, (referred to as polecats in Gastown terminology), were accidentally handed the same manifest page for input. The agents produced different results. One agent found an association between Lucia Hobson and Nikola Tesla, a very valuable association for the research project. The other agent did not. A set of eval experiments ensued to determine how often polecats missed the association. The initial answer was that they missed it quite frequently with only 3 out of 16 agents making the association. Models Used In the following, all agents are using Sonnet 4.6. Orchestration is handled with Gastown. New Findings On the fourth batch of five test case runs, four polecats made the Tesla association. The chances of this happening randomly were less than 3% in the absence of any other process changes. Here's the Fisher's test run by Gemini. Fisher's Exact Test (Recommended) This compares your two distinct groups (the past 16...

Working with Process Revision Control

 I took time to play with a new Dolt enabled app example called Quorum last night. Quorum sets 13 LLM agents with different defined personas  loose on a users question. The agents come up with solutions to the question and then discuss their individual solutions with each other to arrive at a consensus. There's much more detail in this blog post that accompanies the app. Quorum is cool. It is not, however, what I wanted to talk aobut here. Instead, I'm going to focus on the blog post for the app. In short, I'm very excited to see ideas that I've used to manage verification processes for years get codified into tools for LLM agents. Here's one of the important parts " I can shut down the app, lose the server, or disappear entirely — and the deliberation history remains, publicly accessible and cryptographically verified. " Imagine what an engineer can do to work back through their debug hypothesis tree with that sort of infrastructure! As the article'...