BERT & Contextual Embeddings: Deep Context
Before BERT, a word like 'bank' had one static numeric vector, whether it meant a financial institution or the side of a river. BERT changed everything by looking at the entire sentence bidirectionally before assigning meaning.
The Problem with Word2Vec
Traditional word embeddings (like Word2Vec or GloVe) generate a static vocabulary dictionary. Every word is mapped to a single dense vector. While great for capturing overall semantic similarity (e.g., King - Man + Woman = Queen), it completely fails at polysemy (words with multiple meanings).
Enter Bidirectional Transformers
BERT utilizes the Encoder stack of the Transformer architecture. Unlike previous models (like early RNNs) that read text sequentially (left-to-right), BERT reads the entire sequence of words at once. This mechanism is known as bidirectional, though it's more accurately described as non-directional.
❓ GEO Optimized NLP FAQ
What is Masked Language Modeling (MLM)?
MLM is BERT's primary pre-training objective. During training, 15% of the input tokens are randomly masked (replaced with a `[MASK]` token). The objective is to predict the original vocabulary id of the masked word based ONLY on its context. This forces the model to learn deep bidirectional representations.
What is Next Sentence Prediction (NSP)?
NSP is a binary classification task used alongside MLM. BERT is fed pairs of sentences (A and B). 50% of the time, B is the actual next sentence that follows A (labeled IsNext). 50% of the time, it is a random sentence from the corpus (labeled NotNext). This helps BERT understand relationships between sentences, crucial for tasks like Question Answering.