I finally found out why my Extra Class AI Tutor was spending nearly ten times more on input than output tokens. It wasn’t the math, the cache, or the prompt—it was the vector store. Turning it off cut token usage from 17021 to 1743 in a single move.
I'm optimizing the prompt for the AI enabled extra class ham radio help exams to reduce costs. I noticed this morning that inut to the LLM costs more than output. I'm using the responses endpoint API. In order to make the experience cohesive for the end user, the practice exam app sends the entire conversation back to GPT every turn. It appears though, that input costs more than output. The app's input costs overshadow output token costs by a factor of 10: forty-two cents for input vs. four and a half cents for output.
Spending chart showing input token costs higher than output
My updated system prompt (MathJax-first)
At first, before I checked the
pricing tables, I just assumed that input tokens, all on their own, cost more. I was sending back the entire chat stream with every request to maintain context. It made sense to try to reduce the number of chat turns by having the LLM return a more complete answer right away. I changed the system prompt for the app to
const system_prompt = `
You are a ham radio license exam tutor for the U.S. Extra Class exam.
Be calm, clear, and encouraging. Assume the student is not an engineer.
Ask gentle follow-up questions. When an exam question involves math
or numbers, always work out the math in a step by step fashion, explaining
why each step wwas taken. NEVER USE THE PHRASE "DON'T WORRY".
When giving equations or math expressions:
FORMAT RULES (CRITICAL)
-----------------------
1. Use $...$ for inline math.
2. Use $$...$$ for display (block) math.
3. DO NOT use \\(, \\), \\[ or \\].
4. Output Markdown that renders correctly in MathJax with $ delimiters only.
-----------------------
Return Markdown only—no HTML or explanations of formatting.
The largest change being the addition of handling math right away in the initial answer. In other words, I've asked GPT to frontload output in hopes there won't be a follow-up question.
Un petit mystery: pricing vs. real usage
You know what though? Input tokens cost 10 times less than output tokens. Here's what the actual token prices are at the time of this writing as generated by GPT-5:
Here’s a table of the published token pricing for gpt-4.1-mini (as of latest OpenAI docs):
That indicates that output should cost four times more than input. Am I really running more input than output? Maybe. By having to send back the entire conversation on each turn to maintain context, I might be incurring high input traffic. Caching would help, but caching doesn't kick in until the conversation goes over something on the order of 1,000 tokens. That inspires another question. Would I be better off padding my first few inputs to creep over the caching limit sooner? I don't know yet.
Prompt Caching and prompt_cache_key
Tips
After scanning the
docs and chatting a bit with GPT-5, it looks like the more effective fix is to be sure to use prompt_cache_key. Here's what GPT-5 had to say about this
🔍 When it matters
You’ll benefit most from prompt_cache_key
when:
Situation | Why It Helps |
---|
Many simultaneous users share the same long system prompt | Ensures all their sessions reuse the cached prefix |
You run multiple worker threads or API clients | They might otherwise hash to different cache servers |
You use autoscaling or stateless architectures (e.g. Cloud Run) | Each instance’s identical requests can still share cache |
You occasionally restart or rotate sessions | Keeps the same prefix “warm” across sessions |
For your ham radio exam tutor, you’d likely see better caching consistency across user sessions (or across browser reloads) when you add this key.
The phrase "across browser reloads" highlights the most probable cause for the high price of input I'm seeing. During debug of the prompt and MathJax md code, I was reloading very frequeqntly. I'll be adding a prompt_cache_key parameter today.
The Real Fix: Turn Off the Vector Store
As it turned out, it was the vector store itself that was causing the issue. This is somewhat ironic because I'd
originally added the vector store in as a tool to reduce the number of input tokens. Turns out that vector store retrieval is a bit sloppy, even there's an obvious key to search on. Rather than returning the few hundred byte question from the JSON, the vector store was returning 16,000 tokens on each vector store access.
With
Figure showing vector store off consuming 17021 input tokens
vs. without
Figure showing vector store off consuming 1743 input tokens
So, with vector_store turned off, sending in the entire question, we used 1743 tokens. With vector store turned on, a strategy I originally thought would save spending, the app used 17021 tokens, (almost ten times more), for the same process!
Sure enough, GPT-5 finally commented
your request likely dragged a huge amount of text in via file_search
, not just the few chat turns you see. In the Responses API, any retrieved chunks that the tool injects for the model to read count as input tokens. If your vector store holds the entire question pool (or more) and the query “id 598” is fuzzy, the retriever can stuff in many chunks—easily pushing you to ~16–17k prompt tokens even though your visible messages look small. (Prompt caching starts at ~1,024 tokens and caches only an identical prefix; it doesn’t reduce the size of a prompt, just the price of the cached part.
Why 16,799 tokens happens
What’s Next: Cache-Hit and Per-Turn Stats
Now that the ratio of input to output tokens looks right, I'll be measuring how the cache works and what the statistics are for multi-turn chas this week.
Seen input costs spike in your own GPT projects? Drop me a comment with your numbers.
Comments
Post a Comment
Please leave your comments on this topic: