API Usage: Scaling AI Without Going Broke

Pascual Vila
AI Engineer // Code Syllabus
Deploying an AI app is easy. Keeping the API costs under control when you hit scale is where true engineering begins. By mastering Tokenomics, Rate Limiting, and Semantic Caching, you secure your app's future.
1. Token Economics (Tokenomics)
LLMs don't read words; they read Tokens. A token is approximately 4 characters in English. Providers like OpenAI and Anthropic charge per 1,000 or 1,000,000 tokens.
Crucially, you are billed for both Input (your prompt) and Output (the AI's response). Output tokens are generally much more expensive than input tokens.
2. Rate Limiting to Prevent Abuse
Without rate limiting, a single malicious userβor a bug in a `while` loopβcan drain your entire monthly OpenAI budget in minutes.
Using middleware (like Upstash Redis for Next.js), you should enforce strict limits per user IP or API Key (e.g., 10 AI requests per minute). Always return a 429 Too Many Requests status code to halt the drain.
3. Semantic Caching
If users frequently ask "What is your refund policy?", paying an LLM to generate the exact same answer 500 times is burning money. Implement Semantic Caching.
- Exact Caching: Hash the exact prompt string. If Redis has it, return it. $0 API cost.
- Vector Caching: Convert the prompt to an embedding. If a user asks a semantically similar question (e.g., "How do I get a refund?"), return the cached answer.
β Frequently Asked Questions
How do I calculate OpenAI API costs for my web app?
Costs are calculated using Token consumption. You count the tokens in your system prompt + user input (Input Tokens), and the tokens generated by the AI (Output Tokens). Multiply these by the model's specific pricing tier per 1k or 1M tokens.
Which LLM model should I choose to reduce costs?
Always use a "waterfall" approach. Default to the fastest, cheapest model (like gpt-3.5-turbo or Claude 3 Haiku) for 80% of tasks like routing, summarizing, or classification. Only escalate to flagship models (GPT-4o, Claude Opus) for complex reasoning or coding tasks.
How can I set hard billing limits on OpenAI?
Inside the OpenAI dashboard, navigate to Billing > Usage Limits. Set a "Hard limit" which will automatically reject all further API requests once your budget is hit, protecting you from unexpected massive bills.