Giant models can't fit on edge devices, but they can teach smaller ones. Distillation is the process of transferring the wisdom of a heavy-weight model to a light-weight student.
1Learning from Soft Probabilities
When training a standard model, we use 'Hard Labels' (e.g., 0 or 1). However, a large Teacher Model provides much more information. For example, if shown a picture of a dog, a teacher might say it's 90% dog, 9% cat, and 1% car. That 9% cat is Dark Knowledge—it tells the student that this 'dog' has features similar to a cat. By minimizing the difference between the teacher's 'soft' outputs and the student's outputs, the Student Model learns the underlying structure of the data much more efficiently than from labels alone.
Teacher_Output: [0.85, 0.12, 0.03]
Student_Target: Teacher_Soft_Logits
Loss: Distillation_Loss(Teacher, Student)
Status: KNOWLEDGE_TRANSFER_ACTIVE2Temperature and Transfer
To extract this knowledge, we use a hyperparameter called Temperature (T). By increasing T, we 'soften' the probability distribution, making the smaller values more prominent and easier for the student to learn. The training process involves a Distillation Loss (comparing student to teacher) and a standard Student Loss (comparing student to ground truth). This dual-signal approach allows a tiny MobileNet student to reach performance levels previously only possible for massive ensembles or deep ResNets.
Temp: 5.0 // Softens probabilities
Soft_Prob: exp(logit/T) / sum(exp(logit/T))
Context: LEARNING_RELATIONSHIPS
Status: ROBUST_STUDENT_TRAINING