I put OpenAI’s gpt-5-nano and gpt-5.1 head-to-head on my psy-ops article scorer to see what you really get for the extra spend. Along the way I ran into pricing surprises, wild variance, and a reminder that ChatGPT’s shiny new memory feature can quietly bend your evals if you’re not careful.
A post on LinkedIn a few days back suggested using Small Language Models (SLMs) as opposed to LLMs for repetitive tasks. This seemed like a great idea in some regards for me, but I was curious about how it would apply to apps that were intended to perform lanugage analysis. Luckily, I have the psy-ops app up and running. Also? At the moment, it is using a close-to-an-SLM model, gpt-5-nano due to pricing decisions. I used it as a test vehicle to look at the difference betwween gpt-5 nano and full featured gpt-5.1.
The testing framework I used:
Starting from this article, I first did three separate anayses with gpt-5-nano, and then three others with gpt-5.1. I then used gpt-5.1 to analyze the differences in the output text.
Pricing: Very Different
The most noticeable difference is pricing. Here's the pricing chart at the time of this writing.
Keep in mind that those prices are per million tokens. Whether or not they matter very much depends on your usage and economic deploymet model for your app. For this test, I went from not being able to move the needle above one cent in an hour or so of testing with gpt-5-nano testing to speding eight cents once I invoked gpt-5.1.
tl;dr
gpt-5.1 really did do a better job. I've included the rather extensive comparisons, (performed by gpt-5.1), below since they're extensive.
There was the time that gpt-5.1 actually didn't do the analysis and just repackaged what it had done in another, earlier chat session with me.
For links to all the comparisons, and an explanation of the above, read on.
Links to Variance Evals
The Memory Free Comparison of gpt-5-nano to gpt-5.1
Models Gonna Lie
I had to go to somewhat great lengths to get gpt-5.1 to do the work I asked it to do. In the first comparison of the two models, I accidentally let it slip that the first group of outpus was from gpt-5-nano.
GPT Shirking Work
To compensate for that , I ran a second comparison removing the identity of the model.
That comparison gave similar results and thanks to GPT's memory feature admitted that it had not forgotten that I'd let slip what one of the models was:
GPT Admonished
That was annoying, but maybe the comparison was still valid? I mean after all, I hadn't changed the order of the inputs. To find out, in my next test I did change the order of the inputs. I used the same prompt, but first admonished the LLM to not look at our previous chats.
Here are the results
Once it bothered to read the input at all, it detected that the more full featured model, gpt-5.1 was now in the first group rather than the second..
Memory Feature Turned Off
Finally, I turned off gpt's memory feature so I had an unbiased LLM with no recollection that we'd ever spoken to each other. The results matched my adomonished chat.
Closing Thoughts
If you’re experimenting with LLM-based evaluators, I’d love to hear what you’re seeing in the wild.
Leave a comment—there’s a lot more to explore about how these models behave when the stakes aren’t just toy examples.
Comments
Post a Comment
Please leave your comments on this topic: