Meta

Meta
FacebookXYouTubeLinkedIn
Documentation
OverviewModels Getting the Models Running Llama How-To Guides Integration Guides Community Support

Community
Community StoriesOpen Innovation AI Research CommunityLlama Impact Grants

Resources
CookbookCase studiesVideosAI at Meta BlogMeta NewsroomFAQPrivacy PolicyTermsCookie Policy

Llama Protections
OverviewLlama Defenders ProgramDeveloper Use Guide

Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide

Table Of Contents

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources
How-to guides

Quantization and performance optimization

What is quantization?

Quantization is a technique used in machine learning to reduce the computational and memory requirements of models, making them more efficient for deployment on servers and edge devices. It involves representing model weights and activations, typically 32-bit floating-point numbers, with lower-precision data, from 16 bits down to just 2 or 3 bits. The benefits of quantization include smaller model sizes and faster inference—particularly beneficial in resource-constrained environments.

Quantization can be viewed as a form of lossy compression, where precision is reduced to shrink the overall model size. Neural networks are built from vast numbers of connected units, each with a weight and each producing activations during computation. Both weights and activations are typically represented as 32-bit floating-point numbers, which are very precise but also relatively large. Because of the redundant nature of neural networks, in many cases, exact values are not critical to arriving at the correct solution. Quantization takes advantage of this by using fewer bits when possible, reducing the information needed to store and process the network.

Because of this loss of precision there is a trade-off: The fewer bits you use, the more you will experience a reduction in model quality.

Overview of quantization methods

Due to the substantial savings in memory and compute requirements offered by quantization, it remains an active field of research. This section provides a brief overview of the different quantization methods, and then discusses the different frameworks and tools available for quantization. For a more in-depth treatment, you can begin with a survey paper on quantization, and then explore some of the latest methods such as AWQ, GPTQ, and SpinQuant. A brief overview table of the different methods is below, followed by a more detailed description.

Quantization TypeDescriptionWeight PrecisionActivation Precision
Weight-onlyOnly the weights are quantized after training; activations remain full-precision.INT8, INT4, INT2Not quantized
DynamicWeights are pre-quantized; activations are quantized on-the-fly during inference.INT8, INT4INT8, FP16
StaticWeights and activations are quantized ahead of time after calibration with a representative dataset.INT8, INT4INT8, INT4
Quantization-aware TrainingSimulates quantization during training so the model adapts to reduced precision.INT8, INT4, INT2INT8, INT4, FP16

Post-training weight-only quantization

In post-training weight-only quantization, the weights of a trained neural network are quantized to lower precision (from 32-bit or 16-bit float down to 8-bit integers or fewer) after the training process is complete. This quantization step does not require access to the original training data, as it operates only on the already-learned weights. In this approach, only the weights are quantized—activations remain in full precision during inference.

Post-training dynamic quantization

In dynamic quantization, the weights of the network are quantized after training like in weights-only quantization. Activations, however, are quantized only during inference, using statistics from the batch of data being processed. This allows the model to adapt to a wide variety of input distributions, but does introduce some computational overhead during inference as the quantization parameters for activations must be recalculated for each batch or input. Dynamic quantization is particularly useful when the model needs to handle a wide range of input distributions, as it can adjust its quantization parameters on-the-fly.

Post-training static quantization

In static quantization, a dataset is used to determine the best way to quantize the model, especially the activations which are more sensitive to the dynamic range. By quantizing both the weights and activations after training completes but before deployment, the model will be faster at inference and potentially more accurate than with dynamic quantization, assuming that the dataset used to quantize the model is representative of the data the model will be used on. The key advantage of static quantization is that all quantization parameters are fixed after calibration, eliminating the need for calculations during inference. However, this method requires careful calibration with a representative dataset to ensure good performance.

Quantization-aware training

Quantization-aware training (QAT) is a technique that simulates the effects of quantization during the training process itself. Unlike post-training quantization methods (static and dynamic quantization), QAT allows the model to learn to adapt to the reduced precision by incorporating the quantization operations directly into the training process. This approach typically results in better model performance compared to post-training quantization, as the model can learn to compensate for the precision loss.

During QAT, the model's weights and activations are quantized and dequantized in the forward pass, while the backward pass uses the full-precision gradients. This process helps the model learn to work effectively with the reduced-precision representation. Typically, applying QAT drastically slows training, and so most QAT models are trained at full-precision and then fine-tuned with QAT applied.

More details about these methods and how they can be applied to different types of models can be found in the official PyTorch documentation. Additionally, a blog post demonstrating an end-to-end solution for QAT compares the effectiveness of common quantization methods on Llama 3, and the results and code to evaluate can be found in this GitHub repository.

Quantization frameworks and tools

PyTorch quantization with TorchAO

The TorchAO library offers several methods for quantization, each with different schemes for how the activations and weights are quantized. It distinguishes between two main types of quantization: weight-only quantization and dynamic quantization.

Quantization TypeBit DepthCalibration RequiredAccuracy EnhancementInference Performance
Weight-only4 or 8NoNoSame
Weight-only (GPTQ)4YesYesSame
Dynamic8NoNoSame
Dynamic (smoothquant)8YesYesSlightly slower

The TorchAO library offers a simple API to test different methods and automatic detection of the best quantization for a given model, known as autoquantization. This API chooses the fastest form of quantization out of the 8-bit dynamic and 8-bit weight-only quantization. It first identifies the shapes of the activations that the different linear layers see, then benchmarks these shapes across different types of quantized and non-quantized layers in order to pick the fastest one. Also, it composes with torch.compile() to generate the fast kernels.

For additional information on torch.compile(), please see this general tutorial.

Note: The TorchAO library is in beta phase and in active development; API changes are expected.

HF-supported quantization

Hugging Face (HF) offers multiple ways to do LLM quantization with their transformers library. For additional guidance and examples on how to use each of these beyond the brief summary presented here, please refer to their quantization guide and the transformers quantization configuration documentation. Hugging Face also provides a quantization guide demonstrating how to quantize the Llama 3 family of models. The llama-cookbook code uses bitsandbytes 8-bit quantization to load the models, both for inference and fine-tuning.

TorchAO

The Hugging Face Transformers library supports TorchAO (PyTorch Architecture Optimization). As described above, TorchAO enables you to quantize and sparsify: weights, gradients, optimizers, and activations. TorchAO supports custom data types and optimizations. You can use TorchAO for both training and inference.

from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer

See the Hugging Face page for more information and examples that describe their support for TorchAO.

Quanto

Quanto is a versatile PyTorch quantization toolkit that uses linear quantization. It provides features such as weight quantization, activation quantization, and compatibility with various devices and modalities. It supports quantization-aware training (QAT) and is easy to integrate with custom kernels for specific devices. More details can be found in the announcement blog, GitHub repository, and HF guide.

AQLM

Additive Quantization of Language Models (AQLM) is a compression method for LLMs. It quantizes multiple weights together, taking advantage of interdependencies between them. AQLM represents groups comprising 8 to 16 weights each as a sum of multiple vector codes. This library supports fine-tuning its quantized models with Parameter-Efficient Fine-Tuning (PEFT) and LoRA by integrating into HF's PEFT library as well. More details can be found in the GitHub repository.

AWQ

Activation-aware Weight Quantization (AWQ) preserves a small percentage of weights that are important for LLM performance, reducing quantization loss. This allows models to run in 4-bit precision without experiencing significant model performance degradation. The HF Transformers library supports loading models quantized with the llm-awq and vLLM autoawq libraries. More details on how to load them with the HF Transformers library can be found in the HF guide.

AutoGPTQ

The AutoGPTQ library implements the GPTQ algorithm, a post-training quantization technique where each row of the weight matrix is quantized independently. These weights are quantized to INT4, but they are restored to FP16 on the fly during inference, saving 4x in memory usage for weight storage. More details can be found in the GitHub repository.

BitsAndBytes

BitsAndBytes is an easy option for quantizing a model to 8-bit and 4-bit. The library supports any model in any modality, as long as it supports loading with Hugging Face Accelerate and contains torch.nn.Linear layers. It also provides features for offloading weights between the CPU and GPU to support fitting very large models into memory, adjusting the outlier threshold for 8-bit quantization, skipping module conversion for certain models, and fine-tuning with 8-bit and 4-bit weights. For 4-bit models, it allows changing the compute data type, using the Normal Float 4 (NF4) data type for weights initialized from a normal distribution, and using nested quantization to save additional memory with no performance cost. More details can be found in the HF guide.

Tradeoffs of quantization

When choosing a quantization method, it is important to consider the trade-off between model accuracy and performance. Static quantization is generally more accurate than dynamic quantization, but requires more time to quantize the model. Dynamic quantization is faster to quantize, but may result in a loss of accuracy. QAT is a good compromise between accuracy and performance, but requires more time to train the model.

When evaluating quantization, it is important to consider the following:

  • Memory usage: Quantization significantly reduces the memory footprint of models. In the best case, for models trained in FP16 (16-bit), converting to INT8 (8-bit) reduces memory usage by 50%, while INT4 (4-bit) reduces it by 75%. This is crucial for deploying models on devices with limited memory or for serving multiple models simultaneously. Because actual memory savings can be less than the computed savings, you should test for the actual memory usage on real hardware when comparing models.
  • Inference time: Quantized models typically run faster due to reduced memory bandwidth requirements and the ability to use optimized integer operations. The speedup varies depending on the hardware and quantization method. For example, INT8 quantization can provide a 2-4x speedup on modern hardware, while INT4 can offer even greater speedups. However, the actual performance gain depends on the specific model architecture, hardware capabilities, and whether the operations are memory-bound or compute-bound.
  • Accuracy and quality: The impact of quantization on model quality must be carefully evaluated. You should compare training metrics like perplexity and accuracy, as well as any domain-specific metrics you are using in your application. For a more detailed guide on evaluating your models, see the evaluation guide.

Performance optimization

Performance optimization beyond quantization is a broad topic, and this section covers only a few of the most common techniques. Many of these techniques are enabled by default for major runtimes. If latency performance is a top priority, there are hosting options available that provide custom accelerated inference hardware, such as Groq and Cerebras.

Batch sizes

Batching involves processing multiple input sequences simultaneously to maximize hardware utilization, particularly on GPUs. By grouping requests, the computational cost is spread across more data, which improves throughput (the number of requests processed per second). However, this comes at the cost of increased latency for individual requests, as the system must wait for the entire batch to be processed. For real-time applications, smaller batch sizes or even a batch size of one is preferred to minimize latency, while for offline processing or high-throughput scenarios, larger batch sizes are more efficient. Inference servers often use dynamic batching, where incoming requests are automatically grouped to balance throughput and latency. For maximum throughput, you typically want to use the largest batch size that can fit in your GPU's memory.

KV caching

In transformer models, generating each new token requires attending to all previous tokens in the sequence. The Key (K) and Value (V) projections for each token are constant once computed. KV caching is a crucial optimization that stores these K and V tensors in GPU memory after they are computed for the first time. For subsequent token generation steps, the model can reuse these cached values instead of recomputing them, significantly reducing the amount of computation and speeding up inference. While this technique dramatically improves inference speed, the cache itself consumes a large amount of memory, which can be a limiting factor for long sequences or large batch sizes.

Fused kernels

Fused kernels combine multiple individual operations (e.g., matrix multiplication, addition, and an activation function) into a single computational kernel that runs on the GPU. This optimization reduces the overhead associated with launching multiple separate kernels and minimizes data movement between the GPU's high-bandwidth memory and its on-chip memory. By keeping intermediate data within the GPU's fastest memory caches, fused kernels can provide substantial speedups. A prominent example is FlashAttention, which fuses the entire attention mechanism into a single kernel, avoiding the need to read and write the large attention matrix to and from memory. Compilers like torch.compile can automatically perform kernel fusion, simplifying the process of applying this optimization.

Balancing cost, latency, and quality

When hosting your own models, understanding your workloads can enable you to make the appropriate tradeoffs between cost, latency, throughput, and quality. Latency in LLMs comes in two forms—overall latency (the time to return the entire response) and first token latency (the time to return the first token of the response).

  • The more quantization you apply, the cheaper the model will be to serve and the lower the overall and first token latency, but the lower the quality.
  • Batching requests can dramatically improve throughput and cost efficiency. However, it increases first token latency, which may not be suitable for real-time applications. Quality will not be affected.
  • Using caching improves overall latency at the expense of higher memory usage (thus requiring more expensive hardware). Throughput will be higher, and so the cost tradeoff will vary based on workload and hardware costs. Quality will not be affected.
  • For specific applications, consider using model distillation to enable you to serve a smaller model at a similar quality. This will decrease latency and increase throughput, as well as reduce serving costs.

Additional resources

  • Quantization tutorials: Get started using TorchAO for quantization with the getting started guide or browse the documentation for additional tutorials.
  • Official Meta quantization releases: Meta makes available quantized versions of lightweight Llama releases, including 1B and 3B lightweight models, in both QAT+LoRA and SpinQuant versions. For more information, including evaluation results, see the release blog post.
  • Evaluating Llama 3 quantization: The community has conducted studies on the effectiveness of common quantization methods on Llama 3, the results and code can be found on Github.
Was this page helpful?
Yes
No
On this page
Quantization and performance optimization
What is quantization?
Overview of quantization methods
Post-training weight-only quantization
Post-training dynamic quantization
Post-training static quantization
Quantization-aware training
Quantization frameworks and tools
PyTorch quantization with TorchAO
HF-supported quantization
Tradeoffs of quantization
Performance optimization
Batch sizes
KV caching
Fused kernels
Balancing cost, latency, and quality
Additional resources
Meta
Models & Products
Docs
Community
Resources
Llama API
Download models