The basic formula
Most text LLM APIs publish prices per 1 million tokens. A useful first estimate is: input tokens multiplied by the input rate, plus output tokens multiplied by the output rate, multiplied by request volume. The calculator uses that shape, then adjusts for cached input and batch pricing where those rates are published.
The key habit is to estimate input and output separately. A chat message with a long system prompt, retrieved documents, tool schemas, and conversation history may have far more input than visible user text. A model that writes long answers can be cheap on input but expensive on output.
Input tokens and output tokens
Input tokens are the prompt side of the request: system instructions, developer messages, user messages, conversation history, retrieved context, tool definitions, images or files when represented as tokens, and any hidden framing the API counts. Output tokens are generated by the model. Providers usually charge different rates for each side.
For planning, measure real traces instead of guessing from character counts. A support bot might use hundreds of input tokens per turn; a repo-aware coding assistant or research agent can use thousands. Use the AI token cost calculator to test conservative, expected, and high-traffic cases.
Cached input is not the same as free input
Cached input pricing rewards repeated prompt prefixes. Examples include a stable system prompt, a large instruction block, a reused policy document, or a coding agent repeatedly sending the same repository context. When a provider recognizes the repeated input, those cached tokens may be billed at a lower cached-input rate.
Cache rules vary by provider. Some providers expose separate prices for cache writes and cache hits; some publish only a cached input rate; some models have no cached discount. This site models cached input as a percentage of prompt tokens billed at the published cached-input or cache-hit rate. It does not add cache storage fees or first-write premiums unless they are represented in the model price row.
Batch pricing is for work that can wait
Batch APIs are often cheaper because requests can be processed asynchronously. They fit offline jobs such as classification, extraction, tagging, evaluation, backfills, and large summarization runs. They are a poor fit for chat, agents, autocomplete, or any product path where the user is waiting for the response.
In the calculator, the batch toggle only changes the estimate when a model row has explicit batch input and output rates. If a provider page does not publish a batch rate for that model, the estimate stays on standard pricing.
Context windows affect what can fit, not just what costs
A context window is the maximum amount of prompt plus output a model can handle in one request. A larger context window can make long-document analysis, codebase tasks, and multi-step agents possible, but it does not mean you should fill the window every time. Large prompts cost more, may run slower, and can crowd out output space.
Compare context limits on the AI model pricing table, then estimate the actual token payload your app sends. A smaller model with a lower output rate can beat a larger-context model if your workload rarely needs the extra room.
Why calculator estimates can differ from real bills
A calculator is a planning model, not your provider invoice. Real bills can include retries, streaming responses users abandon, tool-call loops, cache misses, prompt growth over a long conversation, different rates for audio or image inputs, regional fees, taxes, committed-use discounts, enterprise contracts, and minimum billing rules.
The safest workflow is to start with a calculator estimate, instrument token usage in production, and review provider invoices after launch. If your application is agentic, also track how many model calls happen per user action. One visible user request can turn into multiple hidden API calls.
| Concept | What changes cost | How to estimate |
|---|---|---|
| Input tokens | Prompt size, history, retrieved context, tool definitions | Measure typical request payloads and keep high-percentile traces |
| Output tokens | Answer length, reasoning style, formatting, max token settings | Log actual generated tokens per feature or workflow |
| Cached input | Repeated prompt prefixes and provider cache eligibility | Use a cache percentage only when prompts are genuinely reused |
| Batch mode | Async processing eligibility and published batch rates | Separate offline jobs from latency-sensitive product requests |
| Context window | How much prompt plus output can fit in one request | Budget for the tokens you send, not the maximum the model allows |
Provider checks before choosing a model
Provider pricing pages are not perfectly interchangeable. Before choosing a model, check whether the page separates cache reads from cache writes, whether batch mode is available, whether the listed rate is for a preview or stable model, and whether long context tiers change the price.
Pricing data on LLM Pricing was last reviewed on 2026-06-16. Official source links are maintained on each provider page and in the footer: OpenAI, Anthropic, Google, DeepSeek.
Practical checklist
- Estimate input and output tokens separately for each feature.
- Use cached input only for prompt content that is actually reused.
- Use batch rates only for asynchronous work that can tolerate delay.
- Check context size, but optimize for the tokens your app sends.
- Compare at least one premium, one balanced, and one budget model under the same workload.
- Validate the estimate with production token logs and provider invoices.