A word is defined by its neighbors. BERT brought the concept of 'Context' into the mathematical heart of AI.
1Static vs. Contextual Embeddings
Earlier NLP models like Word2Vec had a fatal mathematical flaw: a single word could only ever map to a single vector. This meant the model could not differentiate between the 'bank' of a river and a financial 'bank'.
BERT (Bidirectional Encoder Representations from Transformers) fixed this polysemy problem. By reading the entire sentence at once, BERT calculates the meaning of each word based on the words that surround it. The vector for 'bank' becomes dynamic, completely altering its mathematical representation based on its neighbors.
"""
Word2Vec: 'bank' -> [0.12, 0.88, ...]
BERT:
'river bank' -> [0.99, 0.01, ...]
'bank vault' -> [-0.45, 0.77, ...]
"""2Pretrained Base
BERT is massive. Training it from scratch requires immense computational power and mountains of text data.
Fortunately, Google released Pretrained Base models (like bert-base-uncased). This means you don't start from zero. You load a model that has already read the entire English Wikipedia and BookCorpus, possessing a world-class understanding of grammar and syntax straight out of the box.
from transformers import BertModel, BertTokenizer
# Load world-class intelligence in 2 lines
model = BertModel.from_pretrained('bert-base-uncased')
tok = BertTokenizer.from_pretrained('bert-base-uncased')3Masked Language Modeling (MLM)
How did BERT learn all of this context? Through a clever training trick called Masked Language Modeling (MLM).
During training, 15% of the input words were randomly replaced with a special [MASK] token. The model was then forced to predict the missing word by looking at the context from both the left and the right sides. This bidirectional guessing game is what forced the neural network to develop a deep understanding of syntax and semantics.
# Masked Language Modeling (MLM)
text = "The [MASK] chased the mouse."
# BERT must predict 'cat' using bidirectionality.4Next Sentence Prediction (NSP)
Understanding single sentences is great, but real-world language involves paragraphs and discourse. BERT was also trained using Next Sentence Prediction (NSP).
The model is fed two sentences (Sentence A and Sentence B) and must predict whether B naturally follows A, or if it's just a random sentence from another document. This allows BERT to grasp the logical flow of arguments, conversations, and long-form text.
# Next Sentence Prediction (NSP)
A = "He went to the store."
B = "He bought some milk."
prediction = bert_predict_next(A, B) # Returns True5WordPiece Tokenization
Human vocabulary is technically infinite due to prefixes, suffixes, and compound words. If BERT tried to memorize every word, it would run out of memory.
To solve this, BERT uses WordPiece tokenization. It breaks complex or unknown words down into smaller, recognizable sub-words. For example, 'unbelievable' might become 'un', 'believe', and '##able'. This ensures BERT never encounters an 'Out Of Vocabulary' error, allowing it to process typos, slang, and novel words gracefully.
# WordPiece Sub-word tokenization
raw = "unbelievable"
tokens = tokenizer.tokenize(raw)
print(tokens) # ['un', 'believe', '##able']