How I Cut GPT Input Costs 10× by Turning Off the Vector Store on the Ham Radio Practice Exams

I finally found out why my Extra Class AI Tutor was spending nearly ten times more on input than output tokens. It wasn’t the math, the cache, or the prompt—it was the vector store. Turning it off cut token usage from 17021 to 1743 in a single move.

I'm optimizing the prompt for the AI enabled extra class ham radio help exams to reduce costs. I noticed this morning that inut to the LLM costs more than output. I'm using the responses endpoint API. In order to make the experience cohesive for the end user, the practice exam app sends the entire conversation back to GPT every turn. It appears though, that input costs more than output. The app's input costs overshadow output token costs by a factor of 10: forty-two cents for input vs. four and a half cents for output.

Usage dashboard panels showing gpt-4.1-mini input (~$0.42) vs output (~$0.045) on Oct 06, highlighting a ~10× input:output spend despite output’s higher per-token price.

Spending chart showing input token costs higher than output

My updated system prompt (MathJax-first)

At first, before I checked the pricing tables, I just assumed that input tokens, all on their own, cost more. I was sending back the entire chat stream with every request to maintain context. It made sense to try to reduce the number of chat turns by having the LLM return a more complete answer right away. I changed the system prompt for the app to

const system_prompt = `
You are a ham radio license exam tutor for the U.S. Extra Class exam. 
Be calm, clear, and encouraging. Assume the student is not an engineer. 
Ask gentle follow-up questions. When an exam question involves math 
or numbers, always work out the math in a step by step fashion, explaining 
why each step wwas taken. NEVER USE THE PHRASE "DON'T WORRY".

When giving equations or math expressions:

FORMAT RULES (CRITICAL)
-----------------------
1. Use $...$ for inline math.
2. Use $$...$$ for display (block) math.
3. DO NOT use \\(, \\), \\[ or \\].
4. Output Markdown that renders correctly in MathJax with $ delimiters only.
-----------------------

Return Markdown only—no HTML or explanations of formatting.

The largest change being the addition of handling math right away in the initial answer. In other words, I've asked GPT to frontload output in hopes there won't be a follow-up question.

Un petit mystery: pricing vs. real usage

You know what though? Input tokens cost 10 times less than output tokens. Here's what the actual token prices are at the time of this writing as generated by GPT-5:

Here’s a table of the published token pricing for gpt-4.1-mini (as of latest OpenAI docs):

Token Type	Price per 1 Million Tokens	Notes / Source
Input	$0.40	OpenAI+2OpenAI+2
Cached Input	$0.10	OpenAI
Output	$1.60	OpenAI+2OpenAI+2

That indicates that output should cost four times more than input. Am I really running more input than output? Maybe. By having to send back the entire conversation on each turn to maintain context, I might be incurring high input traffic. Caching would help, but caching doesn't kick in until the conversation goes over something on the order of 1,000 tokens. That inspires another question. Would I be better off padding my first few inputs to creep over the caching limit sooner? I don't know yet.

Prompt Caching and `prompt_cache_key` Tips

After scanning the docs and chatting a bit with GPT-5, it looks like the more effective fix is to be sure to use prompt_cache_key. Here's what GPT-5 had to say about this

🔍 When it matters

You’ll benefit most from prompt_cache_key when:

Situation	Why It Helps
Many simultaneous users share the same long system prompt	Ensures all their sessions reuse the cached prefix
You run multiple worker threads or API clients	They might otherwise hash to different cache servers
You use autoscaling or stateless architectures (e.g. Cloud Run)	Each instance’s identical requests can still share cache
You occasionally restart or rotate sessions	Keeps the same prefix “warm” across sessions

For your ham radio exam tutor, you’d likely see better caching consistency across user sessions (or across browser reloads) when you add this key.

The phrase "across browser reloads" highlights the most probable cause for the high price of input I'm seeing. During debug of the prompt and MathJax md code, I was reloading very frequeqntly. I'll be adding a prompt_cache_key parameter today.

The Real Fix: Turn Off the Vector Store

As it turned out, it was the vector store itself that was causing the issue. This is somewhat ironic because I'd originally added the vector store in as a tool to reduce the number of input tokens. Turns out that vector store retrieval is a bit sloppy, even there's an obvious key to search on. Rather than returning the few hundred byte question from the JSON, the vector store was returning 16,000 tokens on each vector store access.

With

Figure showing vector store off consuming 17021 input tokens

vs. without

Figure showing vector store off consuming 1743 input tokens

So, with vector_store turned off, sending in the entire question, we used 1743 tokens. With vector store turned on, a strategy I originally thought would save spending, the app used 17021 tokens, (almost ten times more), for the same process!

Sure enough, GPT-5 finally commented

your request likely dragged a huge amount of text in via file_search, not just the few chat turns you see. In the Responses API, any retrieved chunks that the tool injects for the model to read count as input tokens. If your vector store holds the entire question pool (or more) and the query “id 598” is fuzzy, the retriever can stuff in many chunks—easily pushing you to ~16–17k prompt tokens even though your visible messages look small. (Prompt caching starts at ~1,024 tokens and caches only an identical prefix; it doesn’t reduce the size of a prompt, just the price of the cached part.

Why 16,799 tokens happens

Tool-expanded context: file_search pulls matching passages and the API silently appends them to the model’s input; those tokens are billed as prompt tokens. If the retriever returns lots of chunks, totals jump fast. (See OpenAI docs on File Search tools.) OpenAI Platform

What’s Next: Cache-Hit and Per-Turn Stats

Now that the ratio of input to output tokens looks right, I'll be measuring how the cache works and what the statistics are for multi-turn chas this week.

Seen input costs spike in your own GPT projects? Drop me a comment with your numbers.

Cool Math Tricks: Deriving the Divergence, (Del or Nabla) into New (Cylindrical) Coordinate Systems

Now available as a Kindle ebook for 99 cents ! Get a spiffy ebook, and fund more physics The following is a pretty lengthy procedure, but converting the divergence, (nabla, del) operator between coordinate systems comes up pretty often. While there are tables for converting between common coordinate systems , there seem to be fewer explanations of the procedure for deriving the conversion, so here goes! What do we actually want? To convert the Cartesian nabla to the nabla for another coordinate system, say… cylindrical coordinates. What we’ll need: 1. The Cartesian Nabla: 2. A set of equations relating the Cartesian coordinates to cylindrical coordinates: 3. A set of equations relating the Cartesian basis vectors to the basis vectors of the new coordinate system: How to do it: Use the chain rule for differentiation to convert the derivatives with respect to the Cartesian variables to derivatives with respect to the cylindrical variables. The chain ...

Copasetic Flow

Search This Blog