Skip to main content

How I Cut GPT Input Costs 10× by Turning Off the Vector Store on the Ham Radio Practice Exams


I finally found out why my Extra Class AI Tutor was spending nearly ten times more on input than output tokens. It wasn’t the math, the cache, or the prompt—it was the vector store. Turning it off cut token usage from 17021 to 1743 in a single move.

I'm optimizing the prompt for the AI enabled extra class ham radio help exams to reduce costs. I  noticed this morning that inut to the LLM costs more than output. I'm using the responses endpoint API. In order to make the experience cohesive for the end user, the practice exam app sends the entire conversation back to GPT every turn. It appears though, that input costs more than output. The app's input costs overshadow output token costs by a factor of 10: forty-two cents for input vs. four and a half cents for output.

Usage dashboard panels showing gpt-4.1-mini input (~$0.42) vs output (~$0.045) on Oct 06, highlighting a ~10× input:output spend despite output’s higher per-token price.
Spending chart showing input  token costs higher than output

My updated system prompt (MathJax-first)

At first, before I checked the pricing tables, I just assumed that input tokens, all on their own, cost more. I was sending back the entire chat stream with every request to maintain context. It made sense to try to reduce the number of chat turns by having the LLM return a more complete answer right away. I changed the system prompt for the app to 

const system_prompt = `
You are a ham radio license exam tutor for the U.S. Extra Class exam.
Be calm, clear, and encouraging. Assume the student is not an engineer.
Ask gentle follow-up questions. When an exam question involves math
or numbers, always work out the math in a step by step fashion, explaining
why each step wwas taken. NEVER USE THE PHRASE "DON'T WORRY".

When giving equations or math expressions:

FORMAT RULES (CRITICAL)
-----------------------
1. Use $...$ for inline math.
2. Use $$...$$ for display (block) math.
3. DO NOT use \\(, \\), \\[ or \\].
4. Output Markdown that renders correctly in MathJax with $ delimiters only.
-----------------------

Return Markdown only—no HTML or explanations of formatting.

The largest change being the addition of handling math right away in the initial answer. In other words, I've asked GPT to frontload output in hopes there won't be a follow-up question. 

Un petit mystery: pricing vs. real usage

You know what though? Input tokens cost 10 times less than output tokens. Here's what the actual token prices are at the time of this writing as generated by GPT-5:

Here’s a table of the published token pricing for gpt-4.1-mini (as of latest OpenAI docs):

Token TypePrice per 1 Million TokensNotes / Source
Input$0.40OpenAI+2OpenAI+2
Cached Input$0.10OpenAI
Output$1.60OpenAI+2OpenAI+2

 That indicates that output should cost four times more than input. Am I really running more input than output? Maybe. By having to send back the entire conversation on each turn to maintain context, I might be incurring high input traffic. Caching would help, but caching doesn't kick in until the conversation goes over something on the order of 1,000 tokens. That inspires another question. Would I be better off padding my first few inputs to creep over the caching limit sooner? I don't know yet.

Prompt Caching and prompt_cache_key Tips

After scanning the docs and chatting a bit with GPT-5, it looks like the more effective fix is to be sure to use prompt_cache_key. Here's what GPT-5 had to say about this

🔍 When it matters

You’ll benefit most from prompt_cache_key when:

SituationWhy It Helps
Many simultaneous users share the same long system promptEnsures all their sessions reuse the cached prefix
You run multiple worker threads or API clientsThey might otherwise hash to different cache servers
You use autoscaling or stateless architectures (e.g. Cloud Run)Each instance’s identical requests can still share cache
You occasionally restart or rotate sessionsKeeps the same prefix “warm” across sessions

For your ham radio exam tutor, you’d likely see better caching consistency across user sessions (or across browser reloads) when you add this key.


The phrase "across browser reloads" highlights the most probable cause for the high price of input I'm seeing. During debug of the prompt and MathJax md code, I was reloading very frequeqntly. I'll be adding a prompt_cache_key parameter today.

The Real Fix: Turn Off the Vector Store

As it turned out, it was the vector store itself that was causing the issue. This is somewhat ironic because I'd originally added the vector store in as a tool to reduce the number of input tokens. Turns out that vector store retrieval is a bit sloppy, even there's an obvious key to search on. Rather than returning the few hundred byte question from  the JSON, the vector store was returning 16,000 tokens on each vector store access.
With
Figure showing vector store off consuming 17021 input tokens

vs. without

Figure showing vector store off consuming 1743 input tokens


So, with vector_store turned off, sending in the entire question, we used 1743 tokens. With vector store turned on, a strategy I originally thought would save spending, the app used 17021 tokens, (almost ten times more), for the same process!

Sure enough, GPT-5 finally commented

your request likely dragged a huge amount of text in via file_search, not just the few chat turns you see. In the Responses API, any retrieved chunks that the tool injects for the model to read count as input tokens. If your vector store holds the entire question pool (or more) and the query “id 598” is fuzzy, the retriever can stuff in many chunks—easily pushing you to ~16–17k prompt tokens even though your visible messages look small. (Prompt caching starts at ~1,024 tokens and caches only an identical prefix; it doesn’t reduce the size of a prompt, just the price of the cached part.

Why 16,799 tokens happens

  • Tool-expanded context: file_search pulls matching passages and the API silently appends them to the model’s input; those tokens are billed as prompt tokens. If the retriever returns lots of chunks, totals jump fast. (See OpenAI docs on File Search tools.) OpenAI Platform


What’s Next: Cache-Hit and Per-Turn Stats

Now that the ratio of input to output tokens looks right, I'll be measuring how the cache works and what the statistics are for multi-turn chas this week.

Seen input costs spike in your own GPT projects? Drop me a comment with your numbers.















Comments

Popular posts from this blog

Cool Math Tricks: Deriving the Divergence, (Del or Nabla) into New (Cylindrical) Coordinate Systems

Now available as a Kindle ebook for 99 cents ! Get a spiffy ebook, and fund more physics The following is a pretty lengthy procedure, but converting the divergence, (nabla, del) operator between coordinate systems comes up pretty often. While there are tables for converting between common coordinate systems , there seem to be fewer explanations of the procedure for deriving the conversion, so here goes! What do we actually want? To convert the Cartesian nabla to the nabla for another coordinate system, say… cylindrical coordinates. What we’ll need: 1. The Cartesian Nabla: 2. A set of equations relating the Cartesian coordinates to cylindrical coordinates: 3. A set of equations relating the Cartesian basis vectors to the basis vectors of the new coordinate system: How to do it: Use the chain rule for differentiation to convert the derivatives with respect to the Cartesian variables to derivatives with respect to the cylindrical variables. The chain ...

The Valentine's Day Magnetic Monopole

There's an assymetry to the form of the two Maxwell's equations shown in picture 1.  While the divergence of the electric field is proportional to the electric charge density at a given point, the divergence of the magnetic field is equal to zero.  This is typically explained in the following way.  While we know that electrons, the fundamental electric charge carriers exist, evidence seems to indicate that magnetic monopoles, the particles that would carry magnetic 'charge', either don't exist, or, the energies required to create them are so high that they are exceedingly rare.  That doesn't stop us from looking for them though! Keeping with the theme of Fairbank[1] and his academic progeny over the semester break, today's post is about the discovery of a magnetic monopole candidate event by one of the Fairbank's graduate students, Blas Cabrera[2].  Cabrera was utilizing a loop type of magnetic monopole detector.  Its operation is in...

More Cowbell! Record Production using Google Forms and Charts

First, the what : This article shows how to embed a new Google Form into any web page. To demonstrate ths, a chart and form that allow blog readers to control the recording levels of each instrument in Blue Oyster Cult's "(Don't Fear) The Reaper" is used. HTML code from the Google version of the form included on this page is shown and the parts that need to be modified are highlighted. Next, the why : Google recently released an e-mail form feature that allows users of Google Documents to create an e-mail a form that automatically places each user's input into an associated spreadsheet. As it turns out, with a little bit of work, the forms that are created by Google Docs can be embedded into any web page. Now, The Goods: Click on the instrument you want turned up, click the submit button and then refresh the page. Through the magic of Google Forms as soon as you click on submit and refresh this web page, the data chart will update immediately. Turn up the:...