An AI doesn't have a hard drive; it has a 'Reading Desk'. If the desk is full, you can't add more papers without taking some away.
1The Boundary of Intelligence
Think of an AI model like a brilliant assistant who unfortunately has a very strict limit on how much they can hold in their working memory. This critical constraint is called the Context Window.
If you accidentally dump too much text and exceed the model's hard token limit, the API will immediately reject your request and return a harsh 400 error. As a professional engineer, you must strictly utilize specialized tokenizers like OpenAI's tiktoken to mathematically calculate exactly how many tokens your massive prompt contains BEFORE you send it.
import { get_encoding } from 'tiktoken';
// Always count tokens before sending
function checkTokenLimit(promptText, limit = 8192) {
const encoder = get_encoding('cl100k_base');
const count = encoder.encode(promptText).length;
encoder.free(); // clear memory
if (count > limit) {
throw new Error(`Token limit exceeded: ${count} / ${limit}`);
}
return true;
}Counted before sending -> Token Count: 2
Context Used: 125,000 / 128,000
Status: [CRITICAL_WARNING]
2Pruning & Truncation
When you finally hit that inevitable token limit, you are forced to 'Prune' the conversation. The absolute simplest, most brute-force method is FIFO (First-In, First-Out) Truncation, where you literally just delete the oldest messages in the chat array.
However, simple FIFO has a massive flaw: if you delete the very first message, the AI forgets its core instructions. To solve this, we use Importance-based Pruning. We permanently pin the critical System Prompt to the top of the array so it is never deleted.
// Importance-based Pruning
function pruneMessages(messages, maxRetained = 5) {
// Keep the critical system prompt (index 0)
const systemPrompt = messages[0];
// Keep only the N most recent user/assistant messages
const recentMessages = messages.slice(-maxRetained);
// Reconstruct the array
return [systemPrompt, ...recentMessages];
}[DELETED] Message 1 (Oldest)
[DELETED] Message 2
[KEPT] Message 3
[KEPT] Message 4 (Newest)
3Recursive Summarization
For truly long-form interactions, the absolute gold standard architecture is Recursive Summarization. Instead of violently deleting old messages and losing them forever, we periodically ask the AI to summarize its own previous thoughts into a single, dense paragraph.
We then inject that summary back into the prompt. This saves massive amounts of space while preserving the core context, minimizing the brutally expensive unit costs of sending 100,000 tokens per request.
// Recursive Summarization strategy
async function compressHistory(oldMessages) {
const historyText = oldMessages.map(m => m.content).join('\n');
const summary = await ai.generate({
model: "gpt-4o-mini", // Use cheap model for summarization
prompt: `Summarize the following conversation:\n${historyText}`
});
return [{ role: 'system', content: `Context: ${summary}` }];
}[50 Long Messages]
โฌ๏ธ
[1 Short Summary Paragraph]
Cost Reduction: ACTIVE
