Skip to main content

Posts

Showing posts with the label AI Evals

Can Agents Think Outside the Box?

 With all the work that's been put into making agents "correct" by construction, I gotta say, sometimes I need an LLM agent to take a chance at just being wrong. I'm working on a book project called The Gladych Files . While the book is narrative nonfiction about the history of general relativity research, it explores the liminal space inhabited by very rich fringe scientist speculators of the 1950s who funded mainstream general relativity advances, (more or less on accident.) In those spaces, you'll find Tesla, the architect of the FBI building, Timothy Leary's LSD explorations and many,  many other things, institutions, and people.  I've accumulated hundreds of pages of historical documents from various archives, and I'm using orchestrated agentic AI, (in the form of Gastown), to review those documents. So far, the analysis has gone well, but last week I saw something that made me look up. I'd accidentally input the same archive page twice, so i...

Deploying a ChatKit Demo for PsyOps Detection

 I deployed the LLM Psy-ops detection app earlier today! For those of you just hopping onboard, the WhyFiles ran an episode highlighting a simple, logical scoring method publicized by NCI for determining if a piece media or new article was emotionally manipulative, (think propaganda), or not.  I was looking for a good app to practice deployment, guardrails, and evals, and this one suggested by a @somethingLethal on reddit seemed promising in all those regards. If you'd like to try it, you can find the app at  https://projecttoucans.com/gladych_files_psy_ops  .  LLMs, Simple Math, and Pricing The Psy-op scoring instrument requires that the model sum the scores for the twenty categories. gpt-4o-mini did not sum any of the scores correctly. It got close, but that was about it. I experimented with the python code interpreter to cure the simple math issue. The code interpreter seemed reasonable at first. I mean, three cents per compute minute , not bad right? Ins...