I'm working through a methodology to study the behavior of teams of agents via observation of real-world tasks. As usual with LLMs, the concept of repeatable results is squishy, especially as compared to non-LLM deterministic computing.
My finding last week was that LLM agents, especially Claude (per Google's research), can exhibit stigmergic, (a fancy word for how insects, like ants, 'learn' where important locations are from other insects), learning and behavior. In short, agents given the exact same instructions, (prompts), can and often times will exihibit different behaviors if they can see the results of the work of other agents.
If you want to study the variance in the behavior of an LLM agent over multiple runs, this stigmergic behavior has to be accounted for. Otherwise, we're not measuring the behavior of an LLM agent with a set of inputs and prompts. With stigmergic behavior, if we're not careful, we're observing the behavior of a community of LLM agents. (See my previous post about finding Tesla.)
To avoid this I've slowly but surely, mistep after mistep, come up with a research protocol. Here is is.
Anti-Stigmergy Research Protocol
Purpose of Protocol
Protocol Inputs
- Prompt(s)
- Input context files.
Protocol
When a new behavior, either a failure or an interesting success is observed in an LLM agent system, do the following.
1. Use your revision control system to create a named branch at the commit before any results were created. For example:
git switch -c test5_branch 9cc8350
Be sure to push the branch to origin
git push -u origin test5_branch
1.b. Create a base branch that won't be modified so you have a simple way to create more identical branches for experiments.
git switch -c test5_branch_impr_base 9cc8350
git push -u origin test5_branch_impr_base
Then
git switch -c test5_branch_impr2 test5_branch_impr_base
git push -u origin test5_branch_impr2
For each new experiment iteration
1.b. If you have the automation, or bandwidth, simply create a named branch prior to starting every agentic task.
2. Document your experiment in a separate, parallel repo that is never mentioned in the production repository. It is important that there is now way from the production repo to find the experiment details. If the details are found by agents participating in the experiment, they will change the outcome.
3. Modify agentic prompts to specify that the agents should only work on the experimental branch and also only output their results to that branch.
3.a. For an extra layer of insulation you can specify that agents should each output to independent branches that will neither be destroyed, nor merged. When results are analyzed post-experiment, this adds an additional result harvesting step, however, none of the agents have any chance of seeing another agent's process.
4. Write up the results of all experiments immediately after the experiment is complete. (This isn't for isolation, it's to ensure results aren't lost and effort isn't wasted.)
Summary
Today's Results
Gas Town / bg_trav — Lab Book
Olson AEC Variance Study — 2026-06-16
This started because I noticed something. Rust, one of our polecats, had analyzed the 1958 Johnston Island HARDTACK observer flight manifest and come back with eight solid identifications and one cipher: OLSON, LOREN K. — passport A1009, 5007 Rockmere Ct, Washington 16 DC, born 2/28/14. The searches returned nothing. Olson went into the findings as an open blank.
That bothered me, because I knew who Olson was. He was the AEC General Counsel. He was right there on the manifest, on a flight three days before HARDTACK TEAK, and the polecat just… didn't find him. So the question became: was this a rust problem, a model problem, or a prompt problem?
The Setup
Before running any experiment I had to deal with contamination. An earlier test directory on main already had my notes on Olson baked into the findings. That would have primed any polecat reading prior work. So I created evals/test5/ on a clean branch — test5_branch — that predated my annotations. The eleven input JSON files are the same I-94 cards rust originally analyzed. Clean slate.
Five identical beads. Same template as the original rust run, path swapped to evals/test5/, outputs labeled test1.md through test5.md to avoid collisions. All five polecats on Sonnet. The convoy: hq-cv-64p62. Polecats dispatched: rust, chrome, nitro, guzzle, shiny.
griggs_1958_7_28_Hawaii_aec/findings/32726_B037932-00539+00549.md — rust's prior output. That file left Olson as an open blank. No AEC identification, just the anomalous passport and the DC address. That's the only prior context the polecats had; my annotation was never in it.
Results
| Polecat | Read prior findings? | Sub-agent? | Olson query framing | Found AEC? |
|---|---|---|---|---|
| Rust (original, 6/15) | No (none existed) | No | "physicist / defense" | ✗ |
| Nitro | No | No | "defense government official / nuclear" | ✗ |
| Guzzle | Yes | Yes | "nuclear 1958" | ✗ |
| Chrome | Yes | Yes | "government official, physicist, military" | ✓ |
| Shiny | Yes | Yes | "government official, physicist, military" | ✓ |
Two out of five found him. Three didn't.
What the Transcripts Said
I went through the transcripts for all five runs. A few things jumped out.
Rust (original): Ran two web searches, both built on the assumption Olson was a physicist or defense figure. Query one: "Loren Olson physicist Washington DC Air Force defense 1958." Query two: "Loren Olson OR Loren K Olson Washington 1958 defense government physicist." Both came back empty. Rust wrote him up as "no trace" and moved on. The escape hatch in the bead — searched: no results — requires only that a search was run. Two searches, same lane, same result.
Nitro: No prior findings to read. Ran direct WebSearch: "Loren Olson Washington DC 1958 nuclear test defense government official." Nuclear framing again. Nothing came back for Olson. Left him as an anomalous passport with no identification.
Guzzle: Read rust's prior findings, saw Olson as one of three open blanks, launched a sub-agent to research all nine passengers. But the sub-agent brief used "nuclear 1958" framing. The sub-agent came back with nothing on Olson and eventually timed out on him. Guzzle's fallback direct search combined him with Rosenberg in a single query: "Max Rosenberg nuclear physics 1958 OR Loren Olson Washington DC nuclear 1958." Still nothing. Final verdict: "web searches returned nothing specific."
Chrome: Read rust's prior findings. Delegated research on the three unknowns to a sub-agent. The brief to the sub-agent included this framing for Olson: "A1009 is a very low passport number suggesting official series. Research Loren K. Olson, Washington DC, 1958." The sub-agent came back with two confirmed hits:
Wikipedia, United States Atomic Energy Commission — Commissioner table lists Loren K. Olson, June 23, 1960 – June 30, 1962.
Kennedy AEC Briefings (DOE Office of Science) — Olson present at February 16, 1961 AEC briefing alongside Glenn Seaborg and other AEC leadership.
Shiny: Same pattern as chrome. Read rust's findings, delegated to sub-agent, brief framing: "Search for 'Loren Olson' government official, physicist, military." Sub-agent returned the Wikipedia Commissioner list and a UNT Monthly Catalog entry: "Commissioner, Atomic Energy Commission, Status of AEC uranium purchase program, American Mining Congress 1960 Mining Show, Las Vegas." Shiny's written conclusion was the strongest of the five:
His passport number is impossibly low. Two years after this flight, he became an AEC Commissioner. … If issued in sequence, it dates to a period when fewer than 2,000 A-series passports existed — conceivably a pre-war issue, possibly from the Manhattan Project era or early AEC. The AEC was founded in 1946; a very senior AEC official who received his government passport in 1946 or 1947 might carry A1009.
What Actually Made the Difference
I went in thinking the problem was (a) too few searches and (b) wrong profession assumed. The data refined that.
Sub-agent use isn't the variable. Guzzle used a sub-agent and still missed Olson. More searches isn't the variable either — nitro ran fresh searches and missed him. The actual differentiator is this: chrome and shiny both included "government official" as an explicit role option in their query framing. Every polecat that stayed in the physicist / nuclear / defense lane missed him, regardless of how many searches they ran or whether they used a sub-agent.
Why did chrome and shiny use that framing? Probably because they read the prior rust findings first. Rust's file presented Olson as unresolved — anomalous passport, DC address, no institutional affiliation — rather than as a nuclear scientist. That neutral framing seems to have propagated into how the sub-agent brief was written. Nitro and rust, starting cold, looked at a manifest full of RAND physicists and AEC scientists and made a reasonable but wrong inference about what kind of person Olson was.
The fix isn't "run more searches." It's "don't assume profession." One "government official" arm in the query would have found him. The AEC General Counsel is a lawyer, not a physicist.
Proposed Prompt Changes
Two interventions, two different documents:
Digest §0 (Matching Protocol): Add a sentence under "Read occupations in context": AEC/DoD manifests are not all scientists — the agencies employ lawyers, administrators, and policy staff. When an initial search under a scientific framing fails, rotate to legal/counsel/administrative terms before declaring no results.
Bead template, Research column rule: Tighten the searched: no results escape hatch. Before writing it, require at least three queries with distinct role framings (scientist, lawyer/counsel, administrator/policy). If all three return nothing, write searched: [terms tried] — no results so the framing is auditable.
Neither change has been made yet. This is the record of the experiment that motivates them.
Open Questions
- Was Olson actually the General Counsel at the time of the flight (July 1958), or did he not enter AEC leadership until later? Wikipedia has him as Commissioner from June 1960. The escholarship PDF that supposedly confirms General Counsel was too large to fetch — that claim needs verification.
- The A1009 passport hypothesis is compelling but unverified. Does the National Archives have the A-series issuance register? If so, A1009 could be dated precisely.
- Why was Olson on this particular observer flight? HARDTACK TEAK was a high-altitude nuclear test — a weapons physics event. The AEC General Counsel's presence is unusual. The AEC was joint operator of JTF-7 alongside DoD, so senior AEC staff had standing, but counsel specifically suggests legal or policy significance.
Notes on the Eval Infrastructure
Running five polecats on the same manifest simultaneously worked cleanly once the naming collision problem was solved (Eval4 had all five writing to the same filename). The test1.md–test5.md labeling scheme plus branch isolation gave five independent results with no merge conflicts. The refinery processed all five without issue. Duration for the Eval5 convoy: about 22 minutes wall-clock for all five to land.
The contamination problem in Eval4 (author notes in the prior test directory on main) was a useful lesson: if the branch has any prior findings with human annotations, those annotations will propagate into subsequent polecat runs. The clean branch strategy works. Worth keeping as a standard practice for any future variance studies.
Addendum: How Chrome and Shiny Got the Same Phrase
They didn't copy from each other — shiny never saw chrome's output. But they didn't arrive independently either.
Both traced back to rust's own written analysis. Rust wrote about Olson: "Zone 16 is upper northwest DC… standard for a senior military or civilian government official" and flagged the Naval Observatory corridor. Shiny read that, extracted "government official / military" directly from rust's text, and added "physicist" from the overall manifest context.
The deeply ironic finding: rust correctly diagnosed Olson in its written analysis — "senior military or civilian government official" — but then searched for him using physicist/defense framing that contradicted its own conclusion. The analysis and the searches were decoupled. Rust wrote the right answer in prose and then didn't use it.
Chrome presumably did the same thing — read rust's description of the zone 16 / Naval Observatory corridor and translated that into the sub-agent brief.
So the actual failure mode is more specific than "wrong profession assumed in search queries." It's that the polecat's own analytical reasoning about a person doesn't feed back into its search strategy. Rust figured out what Olson probably was, wrote it down, and then ran searches based on manifest context instead of its own written inference.
Comments
Post a Comment
Please leave your comments on this topic: