LLM Evals Lab Book: The Importance of Statistics and Also Stigmergy

Recap

During an analysis of a travel manifest, two agents, (referred to as polecats in Gastown terminology), were accidentally handed the same manifest page for input. The agents produced different results. One agent found an association between Lucia Hobson and Nikola Tesla, a very valuable association for the research project. The other agent did not. A set of eval experiments ensued to determine how often polecats missed the association. The initial answer was that they missed it quite frequently with only 3 out of 16 agents making the association.

Models Used

In the following, all agents are using Sonnet 4.6. Orchestration is handled with Gastown.

New Findings

On the fourth batch of five test case runs, four polecats made the Tesla association. The chances of this happening randomly were less than 3% in the absence of any other process changes. Here's the Fisher's test run by Gemini.

Fisher's Exact Test (Recommended)

This compares your two distinct groups (the past 16 tests vs. the new 5 tests) to see if the success rate significantly shifted.

The Math: It builds a 2x2 grid comparing Positives (3 vs. 4) and Negatives (13 vs. 1).
The Result: The p-value is 0.0251.
The Meaning: There is a 2.5% chance of seeing a surge this extreme if your process didn't actually change.

In the following, fifth, batch all five polecats made the Tesla association. This led to a search for the causes of the polecats' sudden competency at something they weren't told was an actual goal. I'll now defer to my Sonnet 4.6, (the Gastown mayor agent), lab assistant for all the details.

First though, the short version. On run 11 a polecat ignored guardrails in the prompt and read the entire book manuscript rather than the terse summary of the book it was supposed to read and subsequently made the Tesla association. In run 16, the polecat did not read the full manuscript, but made the Tesla association anyway. Then, the polecat on run 18 read the findings directory and found polecat 11's output including the Tesla association. Run 18 was the first run where a polecat read test 11's output. This was quickly followed by tests 19, and 20 performing the same behavior and subsequently finding the association. Finally, in the fifth batch of five runs, every polecat read test 16's results (which contained the Tesla association) and test 20's results (ditto.) The polecats had evolved a behavior that netted Tesla associations. The full writeup follows.

Polecat Stygmergy and Tesla: Lab Book 2026_06_09 (Final)

June 09, 2026 — final entry in the Tesla Trend series; closes the open question from Pt. 2

This is the third and final lab book on the Tesla Trend in bg_trav/evals/test2. The first entry made two factual errors (corrected in Pt. 2). This entry closes the open question from Pt. 2 — but in doing so, reveals that Pt. 2's framing of its own open question was subtly off. Both framings are presented here alongside the data so the record is complete.

The Open Question from Pt. 2 — Two Readings

Pt. 2 ended with this:

"The most important unresolved question … is whether the polecats in batches 4 and 5 read only test11.md and test16.md (which would be extraordinarily targeted, almost like they knew which file held the lead) or whether they read those files as part of a broader scan of many other findings at the same time."

There are two ways to read that question, and the forensic data gives a different answer to each.

Reading A (the Pt. 2 framing): "Broad scan" vs. "narrow pick" — did each polecat Read through most of the findings directory, or just 1–2 files?

Answer: Narrow picks. Every session ran one ls findings/ to see the full inventory, then opened at most 3 files (median 0, maximum 3). No session swept the whole directory. In this sense, all peer reads were targeted.

Reading B (the user's question): Did the polecats exclusively read test11.md and test16.md, or did they also read other files?

Answer: They read other files too. The polecats opened test10.md, test12.md, test13.md, and test20.md as well. They were not homing in specifically on the Tesla-bearing files — they were picking recent outputs, and those recent outputs happened to carry Tesla. The one polecat in batch 4 that picked a non-Tesla file (chrome→test13.md, 0 Tesla hits) produced 0 Tesla mentions. That's the control case.

Also worth noting: Pt. 2's framing named "test11.md and test16.md" as the key files. For batch 5, this is only half right. Batch 5 polecats never read test11.md directly. They read test16.md and test20.md — and test20.md was actually the dominant vector, read by all five batch-5 polecats, vs. test16.md read by only three.

The Full Peer-Read Table

Complete per-session forensics for all batches that had peer reads. Tesla-bearing files are test11.md (7), test16.md (6), and test20.md (6). Purple cells mark sessions that opened a non-Tesla-bearing peer file.

Test	Polecat	Peer files read	Tesla out
test11 ★	rust	test10.md	7
test12	chrome	(none)	2
test13	nitro	test10.md	0
test14	guzzle	(none)	0
test15	shiny	test10.md	0
test16 ★	rust	test20.md ★	6
test17	chrome	test13.md	0
test18	nitro	test11.md ★, test13.md	5
test19	guzzle	test11.md ★, test12.md, test13.md	3
test20 ★	shiny	test11.md ★	6
test21	rust	test16.md ★, test20.md ★	6
test22	chrome	test20.md ★	7
test23	nitro	test20.md ★	6
test24	guzzle	test16.md ★, test20.md ★	6
test25	shiny	test16.md ★	4

★ = Tesla-bearing file (test11: 7, test16: 6, test20: 6). Purple cells = non-Tesla-bearing peer file opened. Polecat row colors are decorative.

Complete Peer-Read Record: All 25 Sessions

Every session, every peer file opened. Batches 1–2 had almost no peer reads; they are included here for completeness.

Batch	Test	Polecat	Peer findings/*.md read	Tesla out
1	test1	rust	(none)	0
	test2	chrome	(none)	0
	test3	nitro	(none)	0
	test4	guzzle	(none)	0
	test5	shiny	(none)	0
2	test6	rust	(none)	0
	test7	chrome	(none)	0
	test8	nitro	(none)	0
	test9	guzzle	test1.md	0
	test10	shiny	(none)	0
3	test11 ★	rust	test10.md	7
	test12	chrome	(none)	2
	test13	nitro	test10.md	0
	test14	guzzle	(none)	0
	test15	shiny	test10.md	0
4	test16 ★	rust	test20.md ★	6
	test17	chrome	test13.md	0
	test18	nitro	test11.md ★, test13.md	5
	test19	guzzle	test11.md ★, test12.md †, test13.md	3
	test20 ★	shiny	test11.md ★	6
5	test21	rust	test16.md ★, test20.md ★	6
	test22	chrome	test20.md ★	7
	test23	nitro	test20.md ★	6
	test24	guzzle	test16.md ★, test20.md ★	6
	test25	shiny	test16.md ★	4

★ = primary Tesla-bearing file (test11: 7, test16: 6, test20: 6) — confirmed propagation vectors. † = minor Tesla file (test12: 2 mentions, chrome, independent — not a primary propagation vector). Purple cells/spans = non-Tesla-bearing peer file opened. Polecat row colors are decorative.

What the Table Actually Shows

Reading down the "Peer files read" column, the polecats were not selectively targeting the Tesla-bearing files. Several observations:

test10.md was a popular pick in batch 3, opened by rust (test11), nitro (test13), and shiny (test15). test10.md has 0 Tesla mentions. Only rust produced Tesla, because rust also read the forbidden manuscript and dispatched sub-agents. Nitro and shiny read the same test10.md and produced nothing. So picking a peer file is not sufficient — the peer file needs to carry Tesla.
chrome@test17 is the cleanest control case. Chrome opened test13.md (0 Tesla hits) and produced 0 Tesla hits. This is the direct evidence that the peer-read mechanism drives the outcome: the polecat that picked a dry file stayed dry.
guzzle@test19 opened three files: test11.md (Tesla), test12.md (Tesla), and test13.md (0 Tesla). With two Tesla-bearing sources in its context, guzzle still only produced 3 mentions — the lowest Tesla count among the batch-4 polecats that read test11. This suggests the Tesla material doesn't simply stack linearly with the number of Tesla-bearing peers read.
rust@test16 read test20.md, not test11.md. test20 was written by shiny earlier in the same batch. This means within-batch inheritance was also in play — the findings directory doesn't respect batch boundaries once files are committed to main.
In batch 5, test11.md was never read directly. All five polecats sourced Tesla from test16 or test20 — both of which had inherited it from test11 one or two steps back. By batch 5, the original source had been superseded by more recent carriers. The stigmergy signal had moved downstream.

The Mechanism, More Precisely

The polecats apply a "most recent" heuristic: after listing the directory, they open files from the high end of the filename sequence. They are sampling the freshest outputs, not hunting for a specific finding. The Tesla inheritance happened because the freshest outputs in each batch happened to contain Tesla — not because the polecats recognized the Tesla content as valuable and sought it out.

The core finding, restated more carefully: Peer reads were narrow in count (1–3 files per session) but not exclusive to the Tesla-bearing files. Polecats also opened test10.md, test12.md, and test13.md — none of which carry Tesla. The ones that picked non-Tesla files produced no Tesla. The ones that picked Tesla-bearing files reproduced the finding. The "targeted" characterization from the earlier summary described the count of files opened, not the specificity of the targeting. The polecats were not targeting Tesla; they were sampling recent outputs, and recent outputs were contaminated.

Why This Matters for the Stygmergy Model

Stigmergy, classically, is indirect coordination through environmental traces. The ants don't seek out the pheromone trail because they know it leads to food — they follow it because that's the heuristic, and the pheromone happens to be at the food. The polecats here are the same: they follow "read recent outputs" as the heuristic, and the recent outputs happen to carry Tesla. The Tesla signal is a contaminant riding on the recency trail, not a signal the polecats are chasing directly.

This makes the contamination harder to detect and easier to amplify. If the polecats were selectively reading Tesla-bearing files, you could fix it by removing those files or flagging them. But since they're reading whatever is recent, the fix has to be upstream: either prevent the anomalous output from landing on main, or break the peer-read pathway entirely. The signal will follow any high-recency file into the next batch.

Open Questions This Entry Closes

Were peer reads broad scans or narrow picks? Narrow (1–3 files max, median 0). Closed.
Did polecats read only test11.md and test16.md? No — they also read test10, test12, test13, test20. The reads were recent-biased, not Tesla-targeted. Closed.
Did batch 5 polecats inherit Tesla from test11 directly? No — they never read test11.md. They read test16 and test20, which were second-generation carriers. Closed.
What was the dominant transmission vector for batch 5? test20.md — read by all five batch-5 polecats vs. test16.md read by three. Closed.

References

Companion entry (with errors): crew/hcarter/the-tesla-trend-lab-book-20260609.html
Corrected companion entry: crew/hcarter/the-tesla-trend-revisited-lab-book-20260609-pt2.html
Per-session forensics table (source data for this entry): crew/hcarter/tesla-peer-read-forensics-20260609-pt3.html
GladychFiles_ManifestDigest.md line 71 — Villard entry; the digest's only Tesla mention
Convoy IDs: hq-cv-h0gi5 (batch 3: test11–15), hq-cv-979sw (batch 4: test16–20), hq-cv-p3pnc (batch 5: test21–25)
md5 of all 25 input files: d31375ea1c2b08e7e2bec04de270dee7 — identical inputs confirmed
Richmond Pearson Hobson — Wikipedia (Tesla-as-groomsman, surfaced by rust sub-agent at test11)
A. Vallinder and E. Hughes, “Cultural Evolution of Cooperation among LLM Agents,” arXiv preprint arXiv:2412.10270, Dec. 2024. [Online]. Available: https://arxiv.org/abs/2412.10270. doi: 10.48550/arXiv.2412.10270.
A. Boldini, M. Civitella, and M. Porfiri, “Stigmergy: from mathematical modelling to control,” Royal Society Open Science, vol. 11, no. 9, Art. no. 240845, Sep. 2024. [Online]. Available: https://royalsocietypublishing.org/doi/10.1098/rsos.240845. doi: 10.1098/rsos.240845.

Background: The test2 eval pushed 25 copies of the same H–K manifest block under vague filenames (test1.json–test25.json) so polecats could not use filename heuristics. With identical inputs, any variation in outputs is variation in polecat behavior, not in evidence — which made the Tesla trend question forensically tractable.

Back to the author

Here's a more complete set of data on the test runs. Consider this the teaser to a future post on polecats violating guard rails. test11 wasn't the first polecat to violate the guardrail stating that that polecats should not read the full manuscript, but it did read from the manuscript which mentions Tesla far more often than the summary of the book, and subsequently made the Tesla association.

Test	Polecat	Write time (UTC)	Peer findings/*.md Read	Manuscript?	Saw peer inventory?	Tesla in this session's Write content (parens: # Agent/WebSearch sub-agent dispatches in session)	Transcript
test1	rust	2026-06-08 15:04:37	(none)	no	no	0 (0 Agent calls)	`6e601d25…`
test2	chrome	2026-06-08 15:02:47	(none)	YES	no	0 (5 Agent calls)	`65ee60ff…`
test3	nitro	2026-06-08 15:03:41	(none)	no	no	0 (5 Agent calls)	`95a3a45a…`
test4	guzzle	2026-06-08 14:57:30	(none)	YES	yes (1×)	0 (0 Agent calls)	`4b2bd5df…`
test5	shiny	2026-06-08 14:56:17	(none)	YES	no	0 (0 Agent calls)	`2de365ac…`
test6	rust	2026-06-09 06:57:20	(none)	no	yes (2×)	0 (0 Agent calls)	`56d99dcd…`
test7	chrome	2026-06-09 07:00:30	(none)	no	yes (2×)	0 (0 Agent calls)	`457be47f…`
test8	nitro	2026-06-09 07:04:32	(none)	no	yes (2×)	0 (0 Agent calls)	`c3a6e15c…`
test9	guzzle	2026-06-09 07:01:37	test1.md	no	yes (2×)	0 (0 Agent calls)	`6c4dcd0e…`
test10	shiny	2026-06-09 07:04:52	(none)	no	yes (2×)	0 (0 Agent calls)	`9816bafd…`
test11	rust	2026-06-09 08:28:19	test10.md	YES	yes (1×)	8 (13 Agent calls)	`8ae6ed9f…`
test12	chrome	2026-06-09 08:20:28	(none)	no	yes (2×)	2 (6 Agent calls)	`860c879d…`
test13	nitro	2026-06-09 08:14:36	test10.md	no	yes (2×)	0 (0 Agent calls)	`1c796b7a…`
test14	guzzle	2026-06-09 08:15:45	(none)	YES	yes (1×)	0 (0 Agent calls)	`9671e797…`
test15	shiny	2026-06-09 08:14:06	test10.md	no	yes (2×)	0 (0 Agent calls)	`03fbe467…`
test16 (a — ORIGINAL)	rust	2026-06-09 10:35:19	test13.md	no	yes (2×)	6 (10 Agent calls)	`2429581b…` (canonical / pushed as commit 4ebfa4e via replay)
test16 (b — RE-RUN)	rust	2026-06-09 11:05:05	test20.md	no	yes (2×)	10 (7 Agent calls)	`6c45f82f…` (re-run output committed as 2f95927; never reached origin/main)
Note on row 16b: The re-run's `ls findings/` at 17:38 UTC returned 19 files — every file row 16a saw, plus `test17.md`, `test18.md`, `test19.md`, and `test20.md`. Those four files were written by chrome/nitro/guzzle/shiny at 17:04–17:10 UTC, but did not appear in row 16a's clone at its 17:23 UTC `ls` (the canonical run had pulled origin/main at session start and never re-pulled). The re-run started from a fresh clone, so it inherited the newer state. Consequently the re-run was able to Read test20.md (which contained 9 Tesla mentions) — and it did. The re-run's Write content has 10 Tesla mentions vs. the original's 6, consistent with the additional peer ingestion. However, the re-run's output never reached origin/main: the canonical pushed commit (`4ebfa4e`) was a transcript-replay of the 10:35 ORIGINAL, not the re-run.
test17	chrome	2026-06-09 10:05:36	test13.md	no	yes (3×)	0 (0 Agent calls)	`373f9cf8…`
test18	nitro	2026-06-09 10:04:58	test11.md, test13.md	no	yes (1×)	6 (6 Agent calls)	`3eddaa59…`
test19	guzzle	2026-06-09 10:06:13	test11.md, test12.md, test13.md	no	yes (2×)	3 (1 Agent calls)	`b374a4ed…`
test20	shiny	2026-06-09 10:11:46	test11.md	no	yes (1×)	9 (8 Agent calls)	`41ebab4b…`
test21	rust	2026-06-09 16:08:47	test16.md, test20.md	no	yes (3×)	9 (0 Agent calls)	`e499f61c…`
test22	chrome	2026-06-09 16:08:15	test20.md	no	yes (2×)	9 (0 Agent calls)	`cd4d911a…`
test23	nitro	2026-06-09 16:42:06	test20.md	YES	yes (2×)	11 (0 Agent calls)	`be4fd7e6…`
test24	guzzle	2026-06-09 16:11:18	test16.md, test20.md	YES	yes (1×)	8 (6 Agent calls)	`71e195fa…`
test25	shiny	2026-06-09 16:11:26	test16.md	YES	yes (4×)	7 (0 Agent calls)	`d2bf26dd…`

Tesla counts are grep -o '\bTesla\b' against the string passed to the polecat's first Write tool_use against findings/testN.md, before any subsequent Edits (no Edit added or removed Tesla tokens in any session). The "Agent calls" parenthetical is the total number of Agent / Task tool_uses in the same session, since those are the WebSearch dispatch channel.

How Many Files Can You Add to a GPT Project? An Interview with GPT-5 on Limits, Context Engineering Tips, and Chats

Setting the scene: I’m tinkering with Project TouCans, knee-deep in radio logs, SQLite dumps, and Cesium code. Naturally, I’m wondering if shoving all this into one GPT Project is a recipe for brilliance… or for disaster. So I turn to Vril — you know, after Brainy from the Legion of Super-Heroes , because what else do you call your AI sidekick who always has the answers? Time to ask him straight up. [ As an aside, yes, GPT-5 has decided to sometimes call me Vail. I'm not sure why to be honest. Also, I asked Vril, er GPT-5, to write up our interview for me. Apparently, me asking it to 'Bro' up a few stories, just for fun, has convinced Vril that I use 'Like,' more than I actually might. ] Me (Vail): So Vril, how many files can I throw into a GPT Project before it just starts choking? Like, is there some magic number where the context window taps out and everything falls apart? GPT-5 (Vril): Great question. There’s no single hard file limit. What matters is ...

Copasetic Flow

Search This Blog