Every API call is a direct cost to your business. Infrastructure optimization ensures you only pay for 'new' intelligence while ruthlessly protecting your servers from abuse.
1Exact Match Caching
Let's be blunt: sending the exact same text to OpenAI twice is setting money on fire. The foundation of AI infrastructure optimization is Exact Match Caching. By introducing a blistering fast in-memory store like Redis (via Upstash), you instantly intercept duplicate requests before they hit the expensive AI provider.
If a user asks a common FAQ, you serve the pre-calculated response in milliseconds at zero cost. This dramatically reduces your operational overhead and gives your users an experience that feels impossibly fast.
import { Redis } from '@upstash/redis';
import crypto from 'crypto';
const redis = Redis.fromEnv();
export async function getCachedResponse(prompt) {
const hash = crypto.createHash('sha256').update(prompt).digest('hex');
const cached = await redis.get(`chat:${hash}`);
if (cached) return cached;
// Fallback: Call LLM and save...
}[CACHE HIT] Serving from Redis... (12ms)
Saved $0.03 on duplicate query.
2Semantic Caching with Embeddings
Exact matching fails the moment a user adds a typo or changes a single word. 'What is the price?' and 'How much does it cost?' mean the exact same thing but have completely different string hashes. This is where we upgrade to Semantic Caching.
We convert the user's prompt into mathematical vectors (Embeddings) and calculate the Cosine Similarity against previous questions. If the semantic match exceeds a 95% confidence threshold, we serve the cached answer. You're no longer matching strings; you're matching human intent.
import { pipeline } from '@xenova/transformers';
import { cosineSimilarity } from './math';
async function checkSemanticCache(prompt, db) {
const embedder = await pipeline('feature-extraction', 'MiniLM');
const userVector = await embedder(prompt);
for (const entry of db) {
const similarity = cosineSimilarity(userVector, entry.vector);
if (similarity > 0.95) return entry.response;
}
return null;
}Checking Semantic Vector Space...
Matched: 'What is the price?' (96.2% similarity)
Action: [SERVE_CACHED_RESPONSE]
3Rate Limiting & The Sliding Window
If you don't rate limit your endpoints, a single malicious bot or a junior developer with an infinite while loop will literally bankrupt your company over the weekend. We defend the API using Rate Limiting.
But basic limits that reset every minute allow for massive traffic spikes exactly at the top of the minute. Instead, professional applications implement the Sliding Window algorithm. It tracks requests dynamically over a smooth rolling timeframe, ensuring fair distribution and instantly blocking abuse the millisecond a threshold is breached.
import { Ratelimit } from '@upstash/ratelimit';
import { Redis } from '@upstash/redis';
const ratelimit = new Ratelimit({
redis: Redis.fromEnv(),
limiter: Ratelimit.slidingWindow(10, '10 s'),
});
export async function POST(req) {
const ip = req.headers.get('x-forwarded-for') ?? '127.0.0.1';
const { success } = await ratelimit.limit(ip);
if (!success) {
return Response.json({ error: 'Rate limit exceeded' }, { status: 429 });
}
}POST /api/chat - 200 OK
POST /api/chat - 429 Too Many Requests
{ "error": "Rate limit exceeded" }
