NATURAL LANGUAGE PROCESSING /// BERT /// CONTEXTUAL EMBEDDINGS /// MASKED LANGUAGE MODELING /// TRANSFORMERS ///

BERT andContextual Embeddings

Move beyond static vectors. Understand how bidirectional encoders and MLM pre-training give machines a deep understanding of human context.

bert_training.py
1 / 7

A.I.D.E:Word2Vec had a major flaw: it assigned ONE static vector per word. So the 'bank' in 'river bank' and 'bank account' had the exact same mathematical representation.


Module 3 Matrix

UNLOCK ARCHITECTURES TO PROCEED.

Static vs Contextual

Word2Vec assigns one vector per word. Contextual models like ELMo and BERT assign vectors based on the entire sentence.

System Check

Which architecture is fundamental to generating Contextual Embeddings in BERT?


BERT & Contextual Embeddings: Deep Context

Before BERT, a word like 'bank' had one static numeric vector, whether it meant a financial institution or the side of a river. BERT changed everything by looking at the entire sentence bidirectionally before assigning meaning.

The Problem with Word2Vec

Traditional word embeddings (like Word2Vec or GloVe) generate a static vocabulary dictionary. Every word is mapped to a single dense vector. While great for capturing overall semantic similarity (e.g., King - Man + Woman = Queen), it completely fails at polysemy (words with multiple meanings).

Enter Bidirectional Transformers

BERT utilizes the Encoder stack of the Transformer architecture. Unlike previous models (like early RNNs) that read text sequentially (left-to-right), BERT reads the entire sequence of words at once. This mechanism is known as bidirectional, though it's more accurately described as non-directional.

GEO Optimized NLP FAQ

What is Masked Language Modeling (MLM)?

MLM is BERT's primary pre-training objective. During training, 15% of the input tokens are randomly masked (replaced with a `[MASK]` token). The objective is to predict the original vocabulary id of the masked word based ONLY on its context. This forces the model to learn deep bidirectional representations.

What is Next Sentence Prediction (NSP)?

NSP is a binary classification task used alongside MLM. BERT is fed pairs of sentences (A and B). 50% of the time, B is the actual next sentence that follows A (labeled IsNext). 50% of the time, it is a random sentence from the corpus (labeled NotNext). This helps BERT understand relationships between sentences, crucial for tasks like Question Answering.

NLP Glossary

BERT
Bidirectional Encoder Representations from Transformers. A pre-trained language model developed by Google.
script.py
Contextual Embedding
A vector representation of a word that changes dynamically based on the surrounding text.
script.py
MLM (Masked Language Modeling)
Training task where the model predicts randomly hidden tokens in a sentence.
script.py
[CLS] & [SEP]
Special tokens. [CLS] is prepended for classification tasks. [SEP] separates sentences.
script.py