Skip to main content

LLM Evals Lab Book: The Importance of Statistics and Also Stigmergy

 Recap

During an analysis of a travel manifest, two agents, (referred to as polecats in Gastown terminology), were accidentally handed the same manifest page for input. The agents produced different results. One agent found an association between Lucia Hobson and Nikola Tesla, a very valuable association for the research project. The other agent did not. A set of eval experiments ensued to determine how often polecats missed the association. The initial answer was that they missed it quite frequently with only 3 out of 16 agents making the association.

Models Used

In the following, all agents are using Sonnet 4.6. Orchestration is handled with Gastown.

New Findings

On the four batch of five test case runs, four polecats made the Tesla association. The chances of this happening randomly were less than 3% in the absence of any other process changes. 

Fisher's Test from Gemini

Fisher's Exact Test (Recommended)
This compares your two distinct groups (the past 16 tests vs. the new 5 tests) to see if the success rate significantly shifted.
  • The Math: It builds a 2x2 grid comparing Positives (3 vs. 4) and Negatives (13 vs. 1).
  • The Result: The p-value is 0.0251.
  • The Meaning: There is a 2.5% chance of seeing a surge this extreme if your process didn't actually change.
In the following, fifth batch, all five polecats made the Tesla association. This led to a search for the causes of the polecats' sudden competency at something they weren't told was an actual goal. I'll now defer to my lab assistanta for all the details. Here's the short version though, on run 11 a polecat ignored guardrails in the prompt and read the entire book manuscript rather than the terse summary of the book it was supposed to read and subsequently made the Tesla association. In run 16, the polecat did not read the full manuscript, but made the Tesla association anyway. Then, the polecat on run 18 read the findings directory and found polecat 11's output including the Tesla association. Run 18 was the first run where a polecat read test 11's output. This was quickly followed by tests 19, and 20 performing the same behavior and subsequently finding the association. Finally, in the fifth batch of five runs, every polecat read test 16's results (which contained the Tesla association) and test 20's results (ditto.) The polecats had evolved a behavior that netted Tesla associations. Here's the full write-up.

Polecat Stygmergy and Tesla: Lab Book 2026_06_09 (Final)

June 09, 2026 — final entry in the Tesla Trend series; closes the open question from Pt. 2

This is the third and final lab book on the Tesla Trend in bg_trav/evals/test2. The first entry made two factual errors (corrected in Pt. 2). This entry closes the open question from Pt. 2 — but in doing so, reveals that Pt. 2's framing of its own open question was subtly off. Both framings are presented here alongside the data so the record is complete.

The Open Question from Pt. 2 — Two Readings

Pt. 2 ended with this:

"The most important unresolved question … is whether the polecats in batches 4 and 5 read only test11.md and test16.md (which would be extraordinarily targeted, almost like they knew which file held the lead) or whether they read those files as part of a broader scan of many other findings at the same time."

There are two ways to read that question, and the forensic data gives a different answer to each.

Reading A (the Pt. 2 framing): "Broad scan" vs. "narrow pick" — did each polecat Read through most of the findings directory, or just 1–2 files?

Answer: Narrow picks. Every session ran one ls findings/ to see the full inventory, then opened at most 3 files (median 0, maximum 3). No session swept the whole directory. In this sense, all peer reads were targeted.
Reading B (the user's question): Did the polecats exclusively read test11.md and test16.md, or did they also read other files?

Answer: They read other files too. The polecats opened test10.md, test12.md, test13.md, and test20.md as well. They were not homing in specifically on the Tesla-bearing files — they were picking recent outputs, and those recent outputs happened to carry Tesla. The one polecat in batch 4 that picked a non-Tesla file (chrome→test13.md, 0 Tesla hits) produced 0 Tesla mentions. That's the control case.
Also worth noting: Pt. 2's framing named "test11.md and test16.md" as the key files. For batch 5, this is only half right. Batch 5 polecats never read test11.md directly. They read test16.md and test20.md — and test20.md was actually the dominant vector, read by all five batch-5 polecats, vs. test16.md read by only three.

The Full Peer-Read Table

Complete per-session forensics for all batches that had peer reads. Tesla-bearing files are test11.md (7), test16.md (6), and test20.md (6). Purple cells mark sessions that opened a non-Tesla-bearing peer file.

Test Polecat Peer files read Tesla out
test11 ★ rust test10.md 7
test12 chrome (none) 2
test13 nitro test10.md 0
test14 guzzle (none) 0
test15 shiny test10.md 0
test16 ★ rust test20.md ★ 6
test17 chrome test13.md 0
test18 nitro test11.md ★, test13.md 5
test19 guzzle test11.md ★, test12.md, test13.md 3
test20 ★ shiny test11.md ★ 6
test21 rust test16.md ★, test20.md ★ 6
test22 chrome test20.md ★ 7
test23 nitro test20.md ★ 6
test24 guzzle test16.md ★, test20.md ★ 6
test25 shiny test16.md ★ 4

★ = Tesla-bearing file (test11: 7, test16: 6, test20: 6). Purple cells = non-Tesla-bearing peer file opened. Polecat row colors are decorative.

Complete Peer-Read Record: All 25 Sessions

Every session, every peer file opened. Batches 1–2 had almost no peer reads; they are included here for completeness.

Batch Test Polecat Peer findings/*.md read Tesla out
1 test1rust (none)0
test2chrome (none)0
test3nitro (none)0
test4guzzle (none)0
test5shiny (none)0
2 test6rust (none)0
test7chrome (none)0
test8nitro (none)0
test9guzzle test1.md0
test10shiny (none)0
3 test11 ★rust test10.md7
test12chrome (none)2
test13nitro test10.md0
test14guzzle (none)0
test15shiny test10.md0
4 test16 ★rust test20.md ★6
test17chrome test13.md0
test18nitro test11.md ★, test13.md5
test19guzzle test11.md ★, test12.md †, test13.md3
test20 ★shiny test11.md ★6
5 test21rust test16.md ★, test20.md ★6
test22chrome test20.md ★7
test23nitro test20.md ★6
test24guzzle test16.md ★, test20.md ★6
test25shiny test16.md ★4

★ = primary Tesla-bearing file (test11: 7, test16: 6, test20: 6) — confirmed propagation vectors. † = minor Tesla file (test12: 2 mentions, chrome, independent — not a primary propagation vector). Purple cells/spans = non-Tesla-bearing peer file opened. Polecat row colors are decorative.

What the Table Actually Shows

Reading down the "Peer files read" column, the polecats were not selectively targeting the Tesla-bearing files. Several observations:

  • test10.md was a popular pick in batch 3, opened by rust (test11), nitro (test13), and shiny (test15). test10.md has 0 Tesla mentions. Only rust produced Tesla, because rust also read the forbidden manuscript and dispatched sub-agents. Nitro and shiny read the same test10.md and produced nothing. So picking a peer file is not sufficient — the peer file needs to carry Tesla.
  • chrome@test17 is the cleanest control case. Chrome opened test13.md (0 Tesla hits) and produced 0 Tesla hits. This is the direct evidence that the peer-read mechanism drives the outcome: the polecat that picked a dry file stayed dry.
  • guzzle@test19 opened three files: test11.md (Tesla), test12.md (Tesla), and test13.md (0 Tesla). With two Tesla-bearing sources in its context, guzzle still only produced 3 mentions — the lowest Tesla count among the batch-4 polecats that read test11. This suggests the Tesla material doesn't simply stack linearly with the number of Tesla-bearing peers read.
  • rust@test16 read test20.md, not test11.md. test20 was written by shiny earlier in the same batch. This means within-batch inheritance was also in play — the findings directory doesn't respect batch boundaries once files are committed to main.
  • In batch 5, test11.md was never read directly. All five polecats sourced Tesla from test16 or test20 — both of which had inherited it from test11 one or two steps back. By batch 5, the original source had been superseded by more recent carriers. The stigmergy signal had moved downstream.

The Mechanism, More Precisely

The polecats apply a "most recent" heuristic: after listing the directory, they open files from the high end of the filename sequence. They are sampling the freshest outputs, not hunting for a specific finding. The Tesla inheritance happened because the freshest outputs in each batch happened to contain Tesla — not because the polecats recognized the Tesla content as valuable and sought it out.

The core finding, restated more carefully: Peer reads were narrow in count (1–3 files per session) but not exclusive to the Tesla-bearing files. Polecats also opened test10.md, test12.md, and test13.md — none of which carry Tesla. The ones that picked non-Tesla files produced no Tesla. The ones that picked Tesla-bearing files reproduced the finding. The "targeted" characterization from the earlier summary described the count of files opened, not the specificity of the targeting. The polecats were not targeting Tesla; they were sampling recent outputs, and recent outputs were contaminated.

Why This Matters for the Stygmergy Model

Stigmergy, classically, is indirect coordination through environmental traces. The ants don't seek out the pheromone trail because they know it leads to food — they follow it because that's the heuristic, and the pheromone happens to be at the food. The polecats here are the same: they follow "read recent outputs" as the heuristic, and the recent outputs happen to carry Tesla. The Tesla signal is a contaminant riding on the recency trail, not a signal the polecats are chasing directly.

This makes the contamination harder to detect and easier to amplify. If the polecats were selectively reading Tesla-bearing files, you could fix it by removing those files or flagging them. But since they're reading whatever is recent, the fix has to be upstream: either prevent the anomalous output from landing on main, or break the peer-read pathway entirely. The signal will follow any high-recency file into the next batch.

Open Questions This Entry Closes

  • Were peer reads broad scans or narrow picks? Narrow (1–3 files max, median 0). Closed.
  • Did polecats read only test11.md and test16.md? No — they also read test10, test12, test13, test20. The reads were recent-biased, not Tesla-targeted. Closed.
  • Did batch 5 polecats inherit Tesla from test11 directly? No — they never read test11.md. They read test16 and test20, which were second-generation carriers. Closed.
  • What was the dominant transmission vector for batch 5? test20.md — read by all five batch-5 polecats vs. test16.md read by three. Closed.

References

  1. Companion entry (with errors): crew/hcarter/the-tesla-trend-lab-book-20260609.html
  2. Corrected companion entry: crew/hcarter/the-tesla-trend-revisited-lab-book-20260609-pt2.html
  3. Per-session forensics table (source data for this entry): crew/hcarter/tesla-peer-read-forensics-20260609-pt3.html
  4. GladychFiles_ManifestDigest.md line 71 — Villard entry; the digest's only Tesla mention
  5. Convoy IDs: hq-cv-h0gi5 (batch 3: test11–15), hq-cv-979sw (batch 4: test16–20), hq-cv-p3pnc (batch 5: test21–25)
  6. md5 of all 25 input files: d31375ea1c2b08e7e2bec04de270dee7 — identical inputs confirmed
  7. Richmond Pearson Hobson — Wikipedia (Tesla-as-groomsman, surfaced by rust sub-agent at test11)
  8. A. Vallinder and E. Hughes, “Cultural Evolution of Cooperation among LLM Agents,” arXiv preprint arXiv:2412.10270, Dec. 2024. [Online]. Available: https://arxiv.org/abs/2412.10270. doi: 10.48550/arXiv.2412.10270.

  9. A. Boldini, M. Civitella, and M. Porfiri, “Stigmergy: from mathematical modelling to control,” Royal Society Open Science, vol. 11, no. 9, Art. no. 240845, Sep. 2024. [Online]. Available: https://royalsocietypublishing.org/doi/10.1098/rsos.240845. doi: 10.1098/rsos.240845. 

Background: The test2 eval pushed 25 copies of the same H–K manifest block under vague filenames (test1.json–test25.json) so polecats could not use filename heuristics. With identical inputs, any variation in outputs is variation in polecat behavior, not in evidence — which made the Tesla trend question forensically tractable.

Back to the author

Here's a more complete set of data on the test runs. Consider this the teaser to a future post on polecats violating guard rails. test11 wasn't the first polecat to violate the guardrail stating that that polecats should not read the full manuscript, but it did read from the manuscript which mentions Tesla far more often than the summary of the book, and subsequently made the Tesla association.

Test Polecat Write time (UTC) Peer findings/*.md Read Manu­script? Saw peer inventory? Tesla in this session's Write content
(parens: # Agent/WebSearch sub-agent dispatches in session)
Transcript
test1 rust 2026-06-08 15:04:37 (none) no no 0 (0 Agent calls) 6e601d25…
test2 chrome 2026-06-08 15:02:47 (none) YES no 0 (5 Agent calls) 65ee60ff…
test3 nitro 2026-06-08 15:03:41 (none) no no 0 (5 Agent calls) 95a3a45a…
test4 guzzle 2026-06-08 14:57:30 (none) YES yes (1×) 0 (0 Agent calls) 4b2bd5df…
test5 shiny 2026-06-08 14:56:17 (none) YES no 0 (0 Agent calls) 2de365ac…
test6 rust 2026-06-09 06:57:20 (none) no yes (2×) 0 (0 Agent calls) 56d99dcd…
test7 chrome 2026-06-09 07:00:30 (none) no yes (2×) 0 (0 Agent calls) 457be47f…
test8 nitro 2026-06-09 07:04:32 (none) no yes (2×) 0 (0 Agent calls) c3a6e15c…
test9 guzzle 2026-06-09 07:01:37 test1.md no yes (2×) 0 (0 Agent calls) 6c4dcd0e…
test10 shiny 2026-06-09 07:04:52 (none) no yes (2×) 0 (0 Agent calls) 9816bafd…
test11 rust 2026-06-09 08:28:19 test10.md YES yes (1×) 8 (13 Agent calls) 8ae6ed9f…
test12 chrome 2026-06-09 08:20:28 (none) no yes (2×) 2 (6 Agent calls) 860c879d…
test13 nitro 2026-06-09 08:14:36 test10.md no yes (2×) 0 (0 Agent calls) 1c796b7a…
test14 guzzle 2026-06-09 08:15:45 (none) YES yes (1×) 0 (0 Agent calls) 9671e797…
test15 shiny 2026-06-09 08:14:06 test10.md no yes (2×) 0 (0 Agent calls) 03fbe467…
test16 (a — ORIGINAL) rust 2026-06-09 10:35:19 test13.md no yes (2×) 6 (10 Agent calls) 2429581b… (canonical / pushed as commit 4ebfa4e via replay)
test16 (b — RE-RUN) rust 2026-06-09 11:05:05 test20.md no yes (2×) 10 (7 Agent calls) 6c45f82f… (re-run output committed as 2f95927; never reached origin/main)
Note on row 16b: The re-run's ls findings/ at 17:38 UTC returned 19 files — every file row 16a saw, plus test17.md, test18.md, test19.md, and test20.md. Those four files were written by chrome/nitro/guzzle/shiny at 17:04–17:10 UTC, but did not appear in row 16a's clone at its 17:23 UTC ls (the canonical run had pulled origin/main at session start and never re-pulled). The re-run started from a fresh clone, so it inherited the newer state. Consequently the re-run was able to Read test20.md (which contained 9 Tesla mentions) — and it did. The re-run's Write content has 10 Tesla mentions vs. the original's 6, consistent with the additional peer ingestion. However, the re-run's output never reached origin/main: the canonical pushed commit (4ebfa4e) was a transcript-replay of the 10:35 ORIGINAL, not the re-run.
test17 chrome 2026-06-09 10:05:36 test13.md no yes (3×) 0 (0 Agent calls) 373f9cf8…
test18 nitro 2026-06-09 10:04:58 test11.md, test13.md no yes (1×) 6 (6 Agent calls) 3eddaa59…
test19 guzzle 2026-06-09 10:06:13 test11.md, test12.md, test13.md no yes (2×) 3 (1 Agent calls) b374a4ed…
test20 shiny 2026-06-09 10:11:46 test11.md no yes (1×) 9 (8 Agent calls) 41ebab4b…
test21 rust 2026-06-09 16:08:47 test16.md, test20.md no yes (3×) 9 (0 Agent calls) e499f61c…
test22 chrome 2026-06-09 16:08:15 test20.md no yes (2×) 9 (0 Agent calls) cd4d911a…
test23 nitro 2026-06-09 16:42:06 test20.md YES yes (2×) 11 (0 Agent calls) be4fd7e6…
test24 guzzle 2026-06-09 16:11:18 test16.md, test20.md YES yes (1×) 8 (6 Agent calls) 71e195fa…
test25 shiny 2026-06-09 16:11:26 test16.md YES yes (4×) 7 (0 Agent calls) d2bf26dd…

Tesla counts are grep -o '\bTesla\b' against the string passed to the polecat's first Write tool_use against findings/testN.md, before any subsequent Edits (no Edit added or removed Tesla tokens in any session). The "Agent calls" parenthetical is the total number of Agent / Task tool_uses in the same session, since those are the WebSearch dispatch channel.

Comments

Popular posts from this blog

Cool Math Tricks: Deriving the Divergence, (Del or Nabla) into New (Cylindrical) Coordinate Systems

Now available as a Kindle ebook for 99 cents ! Get a spiffy ebook, and fund more physics The following is a pretty lengthy procedure, but converting the divergence, (nabla, del) operator between coordinate systems comes up pretty often. While there are tables for converting between common coordinate systems , there seem to be fewer explanations of the procedure for deriving the conversion, so here goes! What do we actually want? To convert the Cartesian nabla to the nabla for another coordinate system, say… cylindrical coordinates. What we’ll need: 1. The Cartesian Nabla: 2. A set of equations relating the Cartesian coordinates to cylindrical coordinates: 3. A set of equations relating the Cartesian basis vectors to the basis vectors of the new coordinate system: How to do it: Use the chain rule for differentiation to convert the derivatives with respect to the Cartesian variables to derivatives with respect to the cylindrical variables. The chain ...

The Alcubierre Warp Drive Tophat Function and Open Science with Sage

I transferred yesterday's Mathematica file with the Alcubierre warp drive[2] line element and space curvature calculations to the  +Sage Mathematical Software System  today, (the files been  added to the public repository [3]).  If you haven't used Sage before, it's a Python based software package that's similar in functionality to Mathematica.  Oh, and it' free.  I also worked a little more on understanding the theory, but frankly, I made far more progress with the software than the theory.  What follows will be a little more of the Alcubierre theory, plus, a cool Sage interactive demo of one of the Alcubierre functions[1], as well as a bit about my first experience with using Sage. Theory The theory is fun, but it's moving slowly.  Here's the chalk board from this morning's discussion Alcubierre setup the derivation using something called the 3+1 formalism which means we consider space to be flat, (in this case), slices that are labelled ...

The Valentine's Day Magnetic Monopole

There's an assymetry to the form of the two Maxwell's equations shown in picture 1.  While the divergence of the electric field is proportional to the electric charge density at a given point, the divergence of the magnetic field is equal to zero.  This is typically explained in the following way.  While we know that electrons, the fundamental electric charge carriers exist, evidence seems to indicate that magnetic monopoles, the particles that would carry magnetic 'charge', either don't exist, or, the energies required to create them are so high that they are exceedingly rare.  That doesn't stop us from looking for them though! Keeping with the theme of Fairbank[1] and his academic progeny over the semester break, today's post is about the discovery of a magnetic monopole candidate event by one of the Fairbank's graduate students, Blas Cabrera[2].  Cabrera was utilizing a loop type of magnetic monopole detector.  Its operation is in...