Skip to main content

Posts

Showing posts with the label llm evaluation

Gladych Files Lab Book: Document OCR vs LLM Model vs Cost or Opus is Cheaper than Sonnet for OCR!

I started my lab book entries when I was a physics graduate student. It's kind of amusing and kind of cool how far I've come. I have the equivalent of a grad student, (aka Claude Opus 4.7), working for me now. I spent some time over the weekend setting up an OCR framework for a book research project of mine. I've been coming up to speed on evals, so I decided to run one to determine which model was the most accurate and cost effective for doing OCR on travel manifest pages. I stepped the eval along rather than automating it and talked the results through with Opus as I went.  First, it turns out that Opus at low effort is the most accurate and the most cost effective choice! That was a surprise. The result has to do with Opus' ability to look at higher res images which means it needs to think less for OCR vs. Sonnet. Second, at the end of the eval, as I was preparing to write up my results it occurred to me that I could ask my grad student to do it instead. Here's...

LLMs or SLMs? A Gladych Files PsyOps Demo Study

I put OpenAI’s gpt-5-nano and gpt-5.1 head-to-head on my psy-ops article scorer to see what you really get for the extra spend. Along the way I ran into pricing surprises, wild variance, and a reminder that ChatGPT’s shiny new memory feature can quietly bend your evals if you’re not careful.   A post on LinkedIn a few days back suggested using Small Language Models (SLMs) as opposed to LLMs for repetitive tasks. This seemed like a great idea in some regards for me, but I was curious about how it would apply to apps that were intended to perform lanugage analysis. Luckily, I have the psy-ops app up and running. Also? At the moment, it is using a close-to-an-SLM model, gpt-5-nano due to pricing decisions. I used it as a test vehicle to look at the difference betwween gpt-5 nano and full featured gpt-5.1. The testing framework I used: Starting from this article, I first did three separate anayses with gpt-5-nano, and then three others with gpt-5.1. I then used gpt-5.1...