Natural Language Processing: Teaching Machines to Read
Computers excel at crunching numbers, but human language is messy, ambiguous, and heavily dependent on context. NLP is the art of structuring this chaos so algorithms can extract meaning.
1. Tokenization: The First Slice
Before an algorithm can understand a paragraph, the text must be split into digestible units known as Tokens. While rudimentary tokenization simply splits by spaces, robust tokenizers handle punctuation, contractions, and compound words to ensure data remains structurally sound.
2. Noise Reduction: Stop Words
Words like "and", "the", and "is" are highly frequent but carry minimal semantic weight. In traditional NLP workflows (like TF-IDF or Bag of Words), these are removed. This process significantly shrinks the vector space dimensions, saving computational power and focusing the model on the actual 'subject' words.
3. Normalization: Stemming & Lemmatization
"Run", "Running", and "Ran" all stem from the same core concept. Stemming aggressively chops off the ends of words to find the root (often resulting in non-words like "runn"). Lemmatization uses a dictionary to intelligently map words back to their actual base dictionary form (the lemma).
❓ Foundational NLP FAQs
What is the difference between NLP, NLU, and NLG?
NLP (Natural Language Processing): The umbrella term for all computer-language interactions.
NLU (Natural Language Understanding): A subfield focusing on machine reading comprehension (understanding intent and context).
NLG (Natural Language Generation): A subfield focusing on computers writing/producing human-like text (like ChatGPT generating this sentence).
Why do we remove Stop Words?
In traditional machine learning, each unique word becomes a column (feature) in a matrix. Stop words are so common that they dilute the significance of important keywords while inflating the matrix size. However, note that in modern Deep Learning (Transformers/BERT), stop words are often kept because they provide critical grammatical context!