- Knowledge distillation trains a small “student” model to imitate a large “teacher” model, transferring capability into a far cheaper package.
- The technique was formalized by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in their 2015 paper, building on earlier model-compression work by Bucilă et al.
- The key trick is soft targets: the student learns from the teacher’s full probability distribution over classes, not just the single correct answer, capturing “dark knowledge” about how classes relate.
- Distillation is different from quantization (lower-precision numbers) and pruning (removing weights) — and the three are often combined.
- Real results are dramatic: DistilBERT is about 40% smaller and 60% faster than BERT while keeping roughly 97% of its language-understanding performance, per its 2019 paper.
What model distillation is
Model distillation, also called knowledge distillation, is a compression technique where a small student model is trained to reproduce the behavior of a large, accurate teacher model. The goal is to keep most of the teacher’s quality while drastically cutting size, latency, and cost — so the model can run on a phone, a browser, or a cheap server instead of a cluster of GPUs.
The name is a chemistry metaphor: just as distillation concentrates the essential parts of a mixture, model distillation concentrates the essential behavior of a big network into a small one. The idea was popularized by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in their 2015 paper Distilling the Knowledge in a Neural Network, which extended earlier 2006 model-compression work by Bucilă, Caruana, and Niculescu-Mizil.
Soft targets and “dark knowledge”
The heart of distillation is training the student on the teacher’s soft targets — its full probability distribution across all possible outputs — instead of only the hard correct label. When a teacher classifies an image of a dog, it might output 90% dog, 7% wolf, 2% cat, 1% car. Those small non-zero numbers encode how the teacher sees the world: dogs resemble wolves more than cars. Hinton’s team called this extra signal “dark knowledge.”
A hard label throws all of that away, telling the student only “dog.” The soft distribution is far richer, so each training example carries more information and the student learns faster from less data. According to the 2015 paper, this is why a student trained on soft targets can match a teacher despite having far fewer parameters.
Temperature: turning up the signal
To expose dark knowledge, distillation uses a temperature parameter in the softmax function. Raising the temperature softens the distribution, making the small probabilities larger and more informative; the student is trained to match these softened outputs. Hinton, Vinyals, and Dean described temperature as the knob that controls how much relational structure between classes the student sees. The student is typically trained on a blend of the soft teacher targets and the true hard labels.
Teacher and student in practice
In a distillation setup, the teacher is a large, expensive, already-trained model, and the student is a smaller architecture trained to mimic it. The teacher runs over a dataset (often the original training data, sometimes unlabeled data) to produce soft predictions, and the student is optimized to reproduce them. Once trained, the teacher is discarded at deployment — only the lightweight student ships.
The classic example is DistilBERT. In their 2019 paper DistilBERT, a distilled version of BERT, Victor Sanh and colleagues at Hugging Face distilled Google’s BERT into a model with 40% fewer parameters that runs 60% faster while retaining about 97% of BERT’s language-understanding ability on the GLUE benchmark. Both teacher and student here are built on the transformer architecture llm, which makes layer-to-layer knowledge transfer straightforward.
Self-distillation and frontier models
Distillation is not limited to shrinking. In self-distillation, a model is distilled into another model of the same size and can still improve. More recently, frontier AI labs distill huge flagship models into faster, cheaper variants — the “mini” or “flash” tiers many providers offer are commonly produced by distilling a larger sibling. This is one of the main techniques behind today’s small models on device, which deliver surprising capability at a fraction of the compute.
Distillation vs quantization vs pruning
Distillation, quantization, and pruning are three distinct routes to a smaller, faster model, and they attack different things. Distillation trains a new, smaller architecture to copy a teacher’s behavior. Quantization keeps the architecture but stores weights and activations in lower-precision numbers. Pruning keeps the architecture but deletes weights or whole neurons judged unimportant. They are complementary, not competing.
Quantization
Quantization reduces the numeric precision of a model — for example, from 32-bit or 16-bit floating point down to 8-bit integers or even 4-bit. The model keeps the same number of weights but each takes less memory and computes faster on suitable hardware. Going from 16-bit to 8-bit roughly halves memory with often minimal accuracy loss, which is why quantization is a staple of on-device deployment and a core topic in our inference optimization serving at scale coverage.
Pruning
Pruning removes parameters that contribute little to the output, producing a sparser network. It can be unstructured (zeroing individual weights) or structured (removing entire neurons, attention heads, or layers). Structured pruning yields real speedups on standard hardware, while unstructured pruning needs specialized support to translate sparsity into actual gains. In practice, teams often prune, quantize, and distill the same model to stack the savings.
Tradeoffs and limits
Distillation buys speed and size at the cost of some accuracy and a dependence on a strong teacher — the student rarely exceeds its teacher, and a weak teacher produces a weak student. The technique also adds engineering overhead: you must train the teacher, generate soft targets, and train the student, which is more work than just shipping the teacher if cost were no object.
Quality loss varies by task. DistilBERT’s ~3% drop is acceptable for many applications but unacceptable for high-stakes ones. Distilled generative models can also inherit and sometimes amplify the teacher’s biases and blind spots, since the student is explicitly trained to imitate them. The decision is always a trade between the resources you save and the capability you are willing to give up — for latency-sensitive, on-device, or high-volume use cases, that trade usually favors distillation.
Frequently asked questions
Can a distilled student model ever beat its teacher?
Usually not on raw capability — the student is trained to imitate the teacher, so the teacher sets a soft ceiling. There are exceptions: self-distillation and ensemble distillation can sometimes produce a student that generalizes slightly better than the original, and a small model tuned for a narrow domain may outperform a general teacher on that domain. But as a rule, distillation trades a small amount of accuracy for large gains in speed, size, and cost rather than improving peak quality.
Why are soft targets better than the correct labels?
Because soft targets carry far more information per example. A hard label says only which answer is correct; the teacher’s full probability distribution also reveals how the teacher relates the wrong answers to each other — that dogs resemble wolves more than cars, for instance. Hinton’s team called this extra signal “dark knowledge.” It lets the student learn the teacher’s reasoning patterns, not just its final answers, so it trains faster and generalizes better from less data.
Should I use distillation, quantization, or pruning?
Often all three, since they target different inefficiencies and stack well. Start with quantization — it is the cheapest to apply and frequently gives large memory savings with little accuracy loss. Add structured pruning if you need more speed on standard hardware. Reach for distillation when you want a fundamentally smaller architecture or need to compress a very large model into a deployable size. The right mix depends on your latency budget, target hardware, and how much accuracy you can spare.






