With all the work that's been put into making agents "correct" by construction, I gotta say, sometimes I need an LLM agent to take a chance at just being wrong.
I'm working on a book project called The Gladych Files. While the book is narrative nonfiction about the history of general relativity research, it explores the liminal space inhabited by very rich fringe scientist speculators of the 1950s who funded mainstream general relativity advances, (more or less on accident.) In those spaces, you'll find Tesla, the architect of the FBI building, Timothy Leary's LSD explorations and many, many other things, institutions, and people.
I've accumulated hundreds of pages of historical documents from various archives, and I'm using orchestrated agentic AI, (in the form of Gastown), to review those documents. So far, the analysis has gone well, but last week I saw something that made me look up. I'd accidentally input the same archive page twice, so it was analyzed twice by two different agents. One immediately found a connection between one of the people on the page and Nikola Tesla. The other agent did not.
Intrigued, I setup an eval with five more agents analyzing the same identical page today. I wanted to measure the variance of repeat agent runs. I figured most of the agents would find Tesla and I was looking forward to studying the subtlties of how the few that missed Tesla had done that, but nope!
Not one of them found the Tesla connnection. The miss wasn't the anomaly as I'd hoped. The, (absolutely correct by the way), Tesla hit was.
And now I'm off to consruct more experiments to see if I can lean on the agents just enough so that they'll be more reliable at finding connections while, at the same time, not leaning on them so hard that they simply make things up. My experience so far has been that I have a significant amount of envelope before agents, (aka polecats in Gastown), start to make up anything. The biggest fault I've seen so far in this variance experiment was that one polecat abjectly claimed that the two people on the page whose familial relations reveal the Tesla association were in fact simply not related. (That polecat was wrong.)
I'll experiment with turning the temperature of the polecats up first. Perhaps that will make them more creative. The second experiment will be to cut the number of passengers each polecat has to reasearch in half from 30 to 15. Perhaps the lower-weight context window will free up space for more productive thinking. Finally, if I need to, I'll try different variants of the polecat's prompt. As it is, the tone of the project in the prompt is defined as
"Tone: This is research for a fun, rollicking nonfiction book. The webs inside it are huge — think spy novels that happen to be true. Be expressive, imaginative, and speculative where the evidence invites it. Follow threads that feel alive. If a connection makes the hair on your neck stand up, say so and say why. The context files in the repo top directory show you the kind of story we are building — read them and catch the vibe."
I'm not sure how much further I can turn that particular knob.
One additional note: I've had to back away from using frontier models like GPT-5.4 and Opus-4.8 because they tend to shut down further searches rather than thinking creatively about research tasks in these particular contexts.
Orchestrator: Gastown
Model: Sonnet 5.4
Comments
Post a Comment
Please leave your comments on this topic: