NLP Preprocessing: Cleaning the Noise

Pascual Vila
Data Science Instructor // Code Syllabus
Garbage in, garbage out. The foundation of any powerful NLP model, from simple sentiment analyzers to advanced LLMs like GPT, relies entirely on how the text is cleaned and tokenized before training.
Standardizing Text (Lowercasing)
Language is messy. A human knows that "Bank", "BANK", and "bank" all refer to the same financial institution. However, to a computer, these are completely different ASCII/Unicode strings.
By applying a simple `.lower()` transformation across our entire corpus (dataset), we drastically reduce the vocabulary size the machine learning model has to process, preventing data sparsity issues.
Removing Punctuation
Unless you are building a model specifically designed to read emotion where exclamation points matter (!), punctuation is generally considered "noise."
We typically use Regular Expressions (Regex) to strip out non-alphanumeric characters. Leaving punctuation attached to words creates false uniques (e.g., "apple" vs "apple,").
Tokenization: Slicing the Data
Tokenization is the process of breaking a document down into smaller chunks called "tokens." A token can be a word, a subword, or even a single character.
- Word Tokenization: Splitting by spaces. Simple, but struggles with languages without spaces or compound words.
- Subword Tokenization: Used by modern LLMs. It breaks down unknown words into known chunks (e.g., "unfriendly" -> "un", "friend", "ly").
🤖 NLP Architecture FAQs
Why is tokenization necessary for Large Language Models (LLMs)?
LLMs (like GPT or BERT) process mathematical vectors, not raw English text. Tokenization is the vital bridge between text and math. By splitting text into manageable tokens, we can assign a unique ID to each token. These IDs are then mapped to high-dimensional embeddings (vectors) that capture semantic meaning. Without tokenization, neural networks couldn't ingest language data efficiently.
What is the difference between Word and Subword tokenization?
Word Tokenization splits text based on delimiters (like spaces). It results in a massive vocabulary size and suffers from the "Out Of Vocabulary" (OOV) problem when encountering new words.
Subword Tokenization (e.g., Byte-Pair Encoding or WordPiece) breaks rare words into smaller, frequently used sub-components. This solves the OOV problem and drastically reduces the vocabulary size, which is why all modern models use subword tokenizers.
Should I always lowercase text before training?
Not always. For basic Bag-of-Words models or generic sentiment analysis, lowercasing is recommended to reduce vocabulary size. However, for Named Entity Recognition (NER), capitalization is a critical feature (e.g., knowing "Apple" is a company vs "apple" the fruit). Advanced models like BERT offer both "cased" and "uncased" versions depending on the task.