Meta

Meta
FacebookXYouTubeLinkedIn
Documentation
OverviewModels Getting the Models Running Llama How-To Guides Integration Guides Community Support

Community
Community StoriesOpen Innovation AI Research CommunityLlama Impact Grants

Resources
CookbookCase studiesVideosAI at Meta BlogMeta NewsroomFAQPrivacy PolicyTermsCookie Policy

Llama Protections
OverviewLlama Defenders ProgramDeveloper Use Guide

Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide

Table Of Contents

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources
How-to guides

Distillation

What is distillation?

Large, powerful models tend to be generalists; they have the ability to perform a wide variety of tasks well. Smaller, weaker models can be just as good as the generalist larger models on specific or specialized tasks if properly trained. Distillation allows you to take knowledge present in a larger model and transfer it to a smaller model. This results in a smaller model with the same quality as the larger model on a subset of relevant tasks.

This is particularly useful for specialized tasks where you don't need the full range of capabilities of the largest models. Smaller models have several advantages when compared with larger models: They're faster to generate text, have lower time to first token, and cost less to host since they need less hardware. By using model distillation, you may be able to, for example, use an 8B parameter model but with the quality of a 17B parameter model on your specific task.

Distillation vs fine-tuning

Distillation and fine-tuning are related concepts:

  • Distillation is the process of taking knowledge in a larger "teacher" model and transferring it into a smaller "student" model.
  • Fine-tuning is the process of modifying a pre-trained model's weights to make small adjustments in behavior.

Fine-tuning is a tool that can be used before or during the distillation process. This guide focuses on the distillation of language models through synthetic data generation and walks you through the process of distilling a Llama model for your specific needs.

How distillation works

Distillation transfers specific knowledge present in a "teacher" model to a smaller "student" model. The process works like this:

  1. Curate a set of task-specific example inputs, similar to the process used for fine-tuning. Unlike fine-tuning, you do not need to supply completions to the inputs.
  2. Use the teacher model to generate high-quality completions automatically.
  3. Evaluate the student model to establish baseline performance.
  4. Fine-tune the student model on the synthetic dataset of inputs and teacher-generated completions.
  5. Evaluate the tuned student model on the same task. You should see an improvement over the baseline student model.

Distillation techniques

Synthetic data generation

For modern distillation techniques, a larger teacher model is used to generate synthetic data for the student model to learn from. This is done by curating a set of inputs, and then using the teacher model to generate completions for the same inputs. This effectively creates a synthetic dataset that the student model can learn from. The student model--a smaller pre-trained model--is then fine-tuned on the synthetic data instead of curated, manual data.

Generating large amounts of synthetic data can present an infrastructure challenge, as it requires producing a large number of batch requests to the teacher, as well as managing the curriculum and data distribution. Meta offers the open-source Synthetic Data Kit tool to help with this process.

Advanced distillation techniques

Distillation signal

Most distillation techniques train the student to replicate the outputs of the teacher. This is a simple and effective way to transfer knowledge from the teacher to the student. However, information within the network is lost--specifically, the student does not learn any information about uncertainty within the teacher, nor does it learn anything about the teacher's internal representations. There are techniques that attempt to address this, and so distillation can be broadly classified into the following categories:

  • Hard features: The teacher's direct outputs are used as the distillation signal, with each token completion having one correct answer, just like during pretraining. This is the simplest form of distillation, and is the most common for LLMs.
  • Logit targets: The log-odds (or logits) of the teacher's output are used as the distillation signal, allowing the teacher to provide uncertainty in the form of probability distributions over the next token. This allows the student to understand more easily when there may be more than one correct completion to a sequence, and can increase learning speed and overall performance.
  • Feature targets: The teacher's internal representations are used as the distillation signal, for example, by matching feature layers at each layer of the network. This allows the student to learn more about the teacher's learned features, and better guides the student to have similar representations internally. However, this method typically requires either a matching network or careful network design to ensure that the student has feature shapes that match the teacher.

Multiple-teacher single-student distillation

Sometimes a single generalist model may not offer the strongest performance for a specific set of tasks. In these cases, it may be useful to distill multiple models, each of which is specialized for a different task. This allows the student model to learn from the strengths of each teacher model, while still maintaining a single model. The typical setup for such an approach is to split the dataset into multiple subsets, and then produce synthetic data for each subset using the teacher that is strongest for that subtask. Then, a student is fine-tuned on the combined synthetic data all at once, enabling it to learn all tasks together.

Unlike typical distillation, the student may be the same size or even larger than the teachers. This is because the student needs to be able to learn all of the tasks, and thus needs to be able to represent all of the knowledge present in the teachers.

Evaluating distilled models

When evaluating a distilled model, you need to measure both the quality of the knowledge transfer and the practical benefits of using a smaller model. The goal is to ensure that the student maintains the teacher's performance while being more efficient to run.

This section will focus on specifically evaluating a distilled model. For more information on evaluating a model generally, see the evaluations guide.

Knowledge transfer metrics

The most important aspect of distillation is ensuring that the student has learned the key knowledge from the teacher. There are several ways to measure this transfer.

  • Output similarity: Compare the outputs of both models on the same inputs. While perfect agreement isn't necessary, the student should produce similar high-quality outputs. Typically, this is measured token-for-token or by using a metric like BLEU or ROUGE. A more advanced approach is to use another LLM to judge the similarity of the semantic quality of the outputs.
  • Distribution matching: Compare the probability distributions of both models. The student should produce similar token distributions to the teacher, indicating it has learned the underlying patterns. This should be true even if you are not doing a logit distillation and are just using hard features.
  • Task performance: Evaluate both models on your specific tasks. The student should maintain most of the teacher's performance, ideally within 5% of the original metrics. The specific metrics you use here will depend on your task, and should be thoughtfully developed using business or task-specific criteria.

Practical benefits

The main reason to distill a model is to make it more efficient to run. Below are some metrics you should be measuring as part of evaluating your distilled model's performance.

  • Inference speed: The student should be significantly faster than the teacher, typically 2-4x faster depending on the size reduction.
  • Resource usage: The student should use less memory and compute resources, making it cheaper to run and easier to deploy.

If you have the resources, best practice is to create several different size models and evaluate them all. This will help you understand the trade-off between performance and efficiency for different model sizes.

Additional resources

  • Distillation guide: For a complete guide showing how to distill Llama 4 using Synthetic Data Kit, see the distillation notebook on Llama Cookbook.
Was this page helpful?
Yes
No
On this page
Distillation
What is distillation?
How distillation works
Distillation techniques
Synthetic data generation
Advanced distillation techniques
Evaluating distilled models
Knowledge transfer metrics
Practical benefits
Additional resources
Meta
Models & Products
Docs
Community
Resources
Llama API
Download models