Large, powerful models tend to be generalists; they have the ability to perform a wide variety of tasks well. Smaller, weaker models can be just as good as the generalist larger models on specific or specialized tasks if properly trained. Distillation allows you to take knowledge present in a larger model and transfer it to a smaller model. This results in a smaller model with the same quality as the larger model on a subset of relevant tasks.
This is particularly useful for specialized tasks where you don't need the full range of capabilities of the largest models. Smaller models have several advantages when compared with larger models: They're faster to generate text, have lower time to first token, and cost less to host since they need less hardware. By using model distillation, you may be able to, for example, use an 8B parameter model but with the quality of a 17B parameter model on your specific task.
Distillation and fine-tuning are related concepts:
Fine-tuning is a tool that can be used before or during the distillation process. This guide focuses on the distillation of language models through synthetic data generation and walks you through the process of distilling a Llama model for your specific needs.
Distillation transfers specific knowledge present in a "teacher" model to a smaller "student" model. The process works like this:
For modern distillation techniques, a larger teacher model is used to generate synthetic data for the student model to learn from. This is done by curating a set of inputs, and then using the teacher model to generate completions for the same inputs. This effectively creates a synthetic dataset that the student model can learn from. The student model--a smaller pre-trained model--is then fine-tuned on the synthetic data instead of curated, manual data.
Generating large amounts of synthetic data can present an infrastructure challenge, as it requires producing a large number of batch requests to the teacher, as well as managing the curriculum and data distribution. Meta offers the open-source Synthetic Data Kit tool to help with this process.
Most distillation techniques train the student to replicate the outputs of the teacher. This is a simple and effective way to transfer knowledge from the teacher to the student. However, information within the network is lost--specifically, the student does not learn any information about uncertainty within the teacher, nor does it learn anything about the teacher's internal representations. There are techniques that attempt to address this, and so distillation can be broadly classified into the following categories:
Sometimes a single generalist model may not offer the strongest performance for a specific set of tasks. In these cases, it may be useful to distill multiple models, each of which is specialized for a different task. This allows the student model to learn from the strengths of each teacher model, while still maintaining a single model. The typical setup for such an approach is to split the dataset into multiple subsets, and then produce synthetic data for each subset using the teacher that is strongest for that subtask. Then, a student is fine-tuned on the combined synthetic data all at once, enabling it to learn all tasks together.
Unlike typical distillation, the student may be the same size or even larger than the teachers. This is because the student needs to be able to learn all of the tasks, and thus needs to be able to represent all of the knowledge present in the teachers.
When evaluating a distilled model, you need to measure both the quality of the knowledge transfer and the practical benefits of using a smaller model. The goal is to ensure that the student maintains the teacher's performance while being more efficient to run.
This section will focus on specifically evaluating a distilled model. For more information on evaluating a model generally, see the evaluations guide.
The most important aspect of distillation is ensuring that the student has learned the key knowledge from the teacher. There are several ways to measure this transfer.
The main reason to distill a model is to make it more efficient to run. Below are some metrics you should be measuring as part of evaluating your distilled model's performance.
If you have the resources, best practice is to create several different size models and evaluate them all. This will help you understand the trade-off between performance and efficiency for different model sizes.