Skip to content

← All writing

How LLM Tokenization Works (and Why Your API Bill Is Higher Than You Think)

· by Andergrove Software

Large language models don't read words — they read tokens. Every prompt you send and every reply you get back is counted, billed, and capped in tokens, not characters or words. If you've ever been surprised by an API bill or hit a "context length exceeded" error on a prompt that looked short, tokenization is almost always why.

Here's what a token actually is, why the count is so hard to predict, and how to estimate it before you ship. You can check any text against the LLM Token Counter as you read.

What is a token?

A token is a chunk of text — usually a few characters — produced by the model's tokenizer. Tokenizers are built with algorithms like Byte Pair Encoding (BPE), which start from individual characters and merge the most common pairs over and over until they have a fixed vocabulary of perhaps 100,000 sub-word units.

The practical result: common words are often a single token, while rare words get split into pieces. The word "token" might be one token, but "tokenization" could split into token + ization. A leading space is usually part of the token too, so " the" and "the" can tokenize differently.

The rule of thumb for English prose is roughly 1 token ≈ ¾ of a word, or about 4 characters per token. So 1,000 tokens is around 750 words. That estimate is good enough for a first guess — and badly wrong for a lot of real input.

Why "4 characters per token" breaks

The ¾-of-a-word rule assumes ordinary English. The moment your text stops looking like prose, the ratio shifts — usually against you:

  • Code and JSON. Brackets, indentation, quotes and operators fragment into many small tokens. A minified JSON blob can use far more tokens per character than the same number of words of English.
  • Other languages. Text in languages the tokenizer wasn't optimized for — or any non-Latin script — often uses several tokens per character.
  • Emoji and symbols. A single emoji can be two or more tokens. Math symbols, box-drawing characters, and unusual Unicode all inflate the count.
  • Long numbers and IDs. A UUID or a long number is rarely one token — it's chopped into several.

This is why you can't eyeball a token count. Two prompts of identical word count can differ by 2× in token count depending on what's in them.

Different models count differently

There is no universal token. Each model family ships its own tokenizer, so the same text produces different token counts on different models. A prompt that's 1,000 tokens for one model might be 1,150 for another. When you migrate between models — even within the same provider — re-measure rather than assuming the old count holds.

One trap worth calling out: don't reach for a generic library like tiktoken to count tokens for a non-OpenAI model. It's OpenAI's tokenizer and will undercount other models' tokens, sometimes badly on code or non-English text. Use the provider's own counter, or a general-purpose estimate (like the one on this site) for budgeting.

Input tokens, output tokens, and the bill

You pay for two things, usually at different rates:

  • Input tokens — everything you send: the system prompt, the conversation history, retrieved documents, and the user's message.
  • Output tokens — everything the model generates back.

Output is typically the more expensive of the two. As of mid-2026, a frontier model like Claude Opus 4.8 is priced around $5 per million input tokens and $25 per million output tokens — output costs 5× as much. (Always check current pricing; these numbers move.)

The non-obvious cost driver is history. APIs are stateless: every turn of a chat re-sends the entire conversation as input. A 2,000-token system prompt plus a growing transcript is paid for again on every request. Ten turns into a conversation, most of your input bill is re-reading what you already sent.

A worked example

Say you're running a support assistant:

  • System prompt: ~1,500 tokens
  • Average user message: ~200 tokens
  • Average reply: ~400 tokens

For one request, input ≈ 1,700 tokens and output ≈ 400 tokens. At $5 / $25 per million:

  • Input: 1,700 × $5 / 1,000,000 ≈ $0.0085
  • Output: 400 × $25 / 1,000,000 = $0.0100
  • Total ≈ $0.0185 per request

That's under two cents — but multiply by 100,000 requests a month and you're at roughly $1,850, before you account for conversation history growing the input on multi-turn chats. Tokens are cheap individually and expensive in aggregate, which is exactly why estimating them up front matters.

Five ways to cut token usage

  1. Trim the system prompt. It's paid on every request. Tighten it ruthlessly and move rarely-needed detail into retrieval.
  2. Summarize or truncate history. Don't resend the whole transcript forever — compact older turns once a conversation gets long.
  3. Ask for structured, bounded output. "Reply with a JSON object of at most three fields" produces fewer output tokens than an open-ended essay.
  4. Cache repeated context. If you send the same large preamble every time, prompt caching can cut the cost of that prefix dramatically.
  5. Measure before you ship. Paste a representative prompt into the token counter and price it out at your model's rate before it's running at scale.

The takeaway

Tokens are the real currency of LLM APIs, and they don't map cleanly to words. Code, other languages, emoji and long IDs all cost more than they look, history quietly compounds your input bill, and every model counts differently. Get in the habit of measuring: paste your prompts into the LLM Token Counter, estimate the cost, and you'll never be surprised by the invoice again.