What does 'Bidirectional' mean in BERT?

It means the model reads the entire sequence of words at once, allowing it to look at the words before AND after a specific target word simultaneously. Older models like RNNs read text purely left-to-right, missing critical future context.

Why does WordPiece tokenization use '##' hashes?

The '##' prefix indicates that a token is a continuation of the previous token, not a new standalone word. It helps the model reconstruct the original word later (e.g., 'play' + '##ing' = 'playing').

Do I have to train BERT from scratch?

Almost never. You download a 'Pretrained' BERT model that already understands language, and then you 'Fine-Tune' it on a much smaller dataset specific to your task (like legal contract analysis or medical sentiment).

What does 'Bidirectional' mean in BERT?

It means the model reads the entire sequence of words at once, allowing it to look at the words before AND after a specific target word simultaneously. Older models like RNNs read text purely left-to-right, missing critical future context.

Why does WordPiece tokenization use '##' hashes?

The '##' prefix indicates that a token is a continuation of the previous token, not a new standalone word. It helps the model reconstruct the original word later (e.g., 'play' + '##ing' = 'playing').

Do I have to train BERT from scratch?

Almost never. You download a 'Pretrained' BERT model that already understands language, and then you 'Fine-Tune' it on a much smaller dataset specific to your task (like legal contract analysis or medical sentiment).

What does 'Bidirectional' mean in BERT?

It means the model reads the entire sequence of words at once, allowing it to look at the words before AND after a specific target word simultaneously. Older models like RNNs read text purely left-to-right, missing critical future context.

Why does WordPiece tokenization use '##' hashes?

The '##' prefix indicates that a token is a continuation of the previous token, not a new standalone word. It helps the model reconstruct the original word later (e.g., 'play' + '##ing' = 'playing').

Do I have to train BERT from scratch?

Almost never. You download a 'Pretrained' BERT model that already understands language, and then you 'Fine-Tune' it on a much smaller dataset specific to your task (like legal contract analysis or medical sentiment).

HTML MASTER CLASS /// LEARN TAGS /// BUILD STRUCTURE /// SEMANTIC WEB /// HTML MASTER CLASS /// LEARN TAGS ///

⚡ Total XP: 0|💻 artificialintelligence XP: 0

BERT & Contextual Embeddings in AI & Artificial Intelligence

Learn about BERT & Contextual Embeddings in this comprehensive AI & Artificial Intelligence tutorial. Explore the bidirectional revolution in NLP. Learn how BERT uses Masked Language Modeling and Next Sentence Prediction to build deep, dynamic representations of language, and how its WordPiece tokenization handles the infinite complexity of human vocabulary.

LOADING ENGINE...

Skill Matrix

UNLOCK NODES BY LEARNING NEW TAGS.

BERT Hub

Deep context.

Quick Quiz //

How does BERT fundamentally solve the problem of polysemy (words with multiple meanings)?

A word is defined by its neighbors. BERT brought the concept of 'Context' into the mathematical heart of AI.

1Static vs. Contextual Embeddings

Earlier NLP models like Word2Vec had a fatal mathematical flaw: a single word could only ever map to a single vector. This meant the model could not differentiate between the 'bank' of a river and a financial 'bank'.

BERT (Bidirectional Encoder Representations from Transformers) fixed this polysemy problem. By reading the entire sentence at once, BERT calculates the meaning of each word based on the words that surround it. The vector for 'bank' becomes dynamic, completely altering its mathematical representation based on its neighbors.

editor.html

"""
Word2Vec: 'bank' -> [0.12, 0.88, ...]

BERT: 
'river bank' -> [0.99, 0.01, ...]
'bank vault' -> [-0.45, 0.77, ...]
"""

localhost:3000

2Pretrained Base

BERT is massive. Training it from scratch requires immense computational power and mountains of text data.

Fortunately, Google released Pretrained Base models (like bert-base-uncased). This means you don't start from zero. You load a model that has already read the entire English Wikipedia and BookCorpus, possessing a world-class understanding of grammar and syntax straight out of the box.

editor.html

from transformers import BertModel, BertTokenizer

# Load world-class intelligence in 2 lines
model = BertModel.from_pretrained('bert-base-uncased')
tok = BertTokenizer.from_pretrained('bert-base-uncased')

localhost:3000

3Masked Language Modeling (MLM)

How did BERT learn all of this context? Through a clever training trick called Masked Language Modeling (MLM).

During training, 15% of the input words were randomly replaced with a special [MASK] token. The model was then forced to predict the missing word by looking at the context from both the left and the right sides. This bidirectional guessing game is what forced the neural network to develop a deep understanding of syntax and semantics.

editor.html

# Masked Language Modeling (MLM)

text = "The [MASK] chased the mouse."
# BERT must predict 'cat' using bidirectionality.

localhost:3000

4Next Sentence Prediction (NSP)

Understanding single sentences is great, but real-world language involves paragraphs and discourse. BERT was also trained using Next Sentence Prediction (NSP).

The model is fed two sentences (Sentence A and Sentence B) and must predict whether B naturally follows A, or if it's just a random sentence from another document. This allows BERT to grasp the logical flow of arguments, conversations, and long-form text.

editor.html

# Next Sentence Prediction (NSP)

A = "He went to the store."
B = "He bought some milk."

prediction = bert_predict_next(A, B) # Returns True

localhost:3000

5WordPiece Tokenization

Human vocabulary is technically infinite due to prefixes, suffixes, and compound words. If BERT tried to memorize every word, it would run out of memory.

To solve this, BERT uses WordPiece tokenization. It breaks complex or unknown words down into smaller, recognizable sub-words. For example, 'unbelievable' might become 'un', 'believe', and '##able'. This ensures BERT never encounters an 'Out Of Vocabulary' error, allowing it to process typos, slang, and novel words gracefully.

editor.html

# WordPiece Sub-word tokenization

raw = "unbelievable"
tokens = tokenizer.tokenize(raw)
print(tokens) # ['un', 'believe', '##able']

localhost:3000

?Frequently Asked Questions

Pascual Vila

Frontend Instructor // Code Syllabus

Lesson Glossary

[01]BERT

Bidirectional Encoder Representations from Transformers; a pre-trained model designed to understand deep bidirectional context.

Code Preview

Bidirectional Model

[02]MLM

Masked Language Modeling; the training task of predicting hidden tokens using surrounding context.

Code Preview

Predict [MASK]

[03]Contextual Embedding

A numeric vector representing a word that changes based on the other words in the sentence.

Code Preview

Dynamic Vector

[04]WordPiece

A sub-word tokenization algorithm that breaks words into smaller pieces to handle rare vocabulary.

Code Preview

Sub-word Tokens

[05]NSP

Next Sentence Prediction; a binary classification task to predict if one sentence follows another.

Code Preview

Sentence Logic

Continue Learning

Foundations

Logistic Regression

Read lesson→

Foundations

Saving and Loading Models (Pickle, Joblib)

nlp capstone

Word Embeddings (Word2Vec, GloVe)

Read lesson→

Foundations

Using OpenAI / Anthropic APIs

Read lesson→

Foundations

Data Cleaning and Handling Missing Values

Read lesson→

Skill Matrix

BERT Hub

Interactive Challenges

1Static vs. Contextual Embeddings

2Pretrained Base

3Masked Language Modeling (MLM)

4Next Sentence Prediction (NSP)

5WordPiece Tokenization

?Frequently Asked Questions

Lesson Glossary

[01]BERT

[02]MLM

[03]Contextual Embedding

[04]WordPiece

[05]NSP

Continue Learning

Article Contents