Skip to main content

LLMs or SLMs? A Gladych Files PsyOps Demo Study

I put OpenAI’s gpt-5-nano and gpt-5.1 head-to-head on my psy-ops article scorer to see what you really get for the extra spend. Along the way I ran into pricing surprises, wild variance, and a reminder that ChatGPT’s shiny new memory feature can quietly bend your evals if you’re not careful.


  A post on LinkedIn a few days back suggested using Small Language Models (SLMs) as opposed to LLMs for repetitive tasks. This seemed like a great idea in some regards for me, but I was curious about how it would apply to apps that were intended to perform lanugage analysis. Luckily, I have the psy-ops app up and running. Also? At the moment, it is using a close-to-an-SLM model, gpt-5-nano due to pricing decisions. I used it as a test vehicle to look at the difference betwween gpt-5 nano and full featured gpt-5.1.

The testing framework I used:

Starting from this article, I first did three separate anayses with gpt-5-nano, and then three others with gpt-5.1. I then used gpt-5.1 to analyze the differences in the output text.

Pricing: Very Different

The most noticeable difference is pricing. Here's the pricing chart at the time of this writing.


Keep in mind that those prices are per million tokens. Whether or not they matter very much depends on your usage and economic deploymet model for your app. For this test, I went from not being able to move the needle above one cent in an hour or so of testing with gpt-5-nano testing to speding eight cents once I invoked gpt-5.1.

tl;dr

gpt-5.1 really did do a better job. I've included the rather extensive comparisons, (performed by gpt-5.1), below since they're extensive. 

There was the time that gpt-5.1 actually didn't do the analysis and just repackaged what it had done in another, earlier chat session with me.

For links to all the comparisons, and an explanation of the above, read on.

Links to Variance Evals

Comparing three gpt-5-nano chats for variance
Comparing three gpt-5.1 chats for variance

The Memory Free Comparison of gpt-5-nano to gpt-5.1

Here's the comparison I trust with the gpt's memory feature turned off. For more details, keep reading.

Models Gonna Lie

I had to go to somewhat great lengths to get gpt-5.1 to do the work I asked it to do. In the first comparison of the two models, I accidentally let it slip that the first group of outpus was from gpt-5-nano. 


GPT Shirking Work

To compensate for that , I ran a second comparison removing the identity of the model



That comparison gave similar results and thanks to GPT's memory feature admitted that it had not forgotten that I'd let slip what one of the models was:


GPT Admonished

That was annoying, but maybe the comparison was still valid? I mean after all, I hadn't changed the order of the inputs. To find out, in my next test I did change the order of the inputs. I used the same prompt, but first admonished the LLM to not look at our previous chats.


Here are the results


Once it bothered to read the input at all, it detected that the more full featured model, gpt-5.1 was now in the first group rather than the second..

Memory Feature Turned Off

Finally, I turned off gpt's memory feature so I had an unbiased LLM with no recollection that we'd ever spoken to each other. The results matched my adomonished chat.


Closing Thoughts

If I had anything that depended on the output of the psy-op tool, I'd use the gpt-5.1 model. It had less variance, and I didn't witness it get caught up in the article being analyzed. Also? I'd use the analysis as a set of pointers to the article to do my own analysis. In other words, I'd still need to read the article myself if I felt like there was anything at stake.

GPT's memory dosier feature is a thing to be reckoned with. I've purposely depended on it to make my prompting more loose and frequently more successful. I've also seen gpt-5.* shirk work becuase of iti more than once. Buyer beware.

If you’re experimenting with LLM-based evaluators, I’d love to hear what you’re seeing in the wild.
Leave a comment—there’s a lot more to explore about how these models behave when the stakes aren’t just toy examples.



Comments

Popular posts from this blog

Cool Math Tricks: Deriving the Divergence, (Del or Nabla) into New (Cylindrical) Coordinate Systems

Now available as a Kindle ebook for 99 cents ! Get a spiffy ebook, and fund more physics The following is a pretty lengthy procedure, but converting the divergence, (nabla, del) operator between coordinate systems comes up pretty often. While there are tables for converting between common coordinate systems , there seem to be fewer explanations of the procedure for deriving the conversion, so here goes! What do we actually want? To convert the Cartesian nabla to the nabla for another coordinate system, say… cylindrical coordinates. What we’ll need: 1. The Cartesian Nabla: 2. A set of equations relating the Cartesian coordinates to cylindrical coordinates: 3. A set of equations relating the Cartesian basis vectors to the basis vectors of the new coordinate system: How to do it: Use the chain rule for differentiation to convert the derivatives with respect to the Cartesian variables to derivatives with respect to the cylindrical variables. The chain ...

The Valentine's Day Magnetic Monopole

There's an assymetry to the form of the two Maxwell's equations shown in picture 1.  While the divergence of the electric field is proportional to the electric charge density at a given point, the divergence of the magnetic field is equal to zero.  This is typically explained in the following way.  While we know that electrons, the fundamental electric charge carriers exist, evidence seems to indicate that magnetic monopoles, the particles that would carry magnetic 'charge', either don't exist, or, the energies required to create them are so high that they are exceedingly rare.  That doesn't stop us from looking for them though! Keeping with the theme of Fairbank[1] and his academic progeny over the semester break, today's post is about the discovery of a magnetic monopole candidate event by one of the Fairbank's graduate students, Blas Cabrera[2].  Cabrera was utilizing a loop type of magnetic monopole detector.  Its operation is in...

More Cowbell! Record Production using Google Forms and Charts

First, the what : This article shows how to embed a new Google Form into any web page. To demonstrate ths, a chart and form that allow blog readers to control the recording levels of each instrument in Blue Oyster Cult's "(Don't Fear) The Reaper" is used. HTML code from the Google version of the form included on this page is shown and the parts that need to be modified are highlighted. Next, the why : Google recently released an e-mail form feature that allows users of Google Documents to create an e-mail a form that automatically places each user's input into an associated spreadsheet. As it turns out, with a little bit of work, the forms that are created by Google Docs can be embedded into any web page. Now, The Goods: Click on the instrument you want turned up, click the submit button and then refresh the page. Through the magic of Google Forms as soon as you click on submit and refresh this web page, the data chart will update immediately. Turn up the:...