Meta

Meta
FacebookXYouTubeLinkedIn
Documentation
OverviewModels Getting the Models Running Llama How-To Guides Integration Guides Community Support

Community
Community StoriesOpen Innovation AI Research CommunityLlama Impact Grants

Resources
CookbookCase studiesVideosAI at Meta BlogMeta NewsroomFAQPrivacy PolicyTermsCookies

Llama Protections
OverviewLlama Defenders ProgramDeveloper Use Guide

Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookies
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookies
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookies
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide

Table Of Contents

Overview
Models
Llama 4
Llama Guard 4 (New)
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2 (New)
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
How-To Guides
Fine-tuning
Quantization
Prompting
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources

Overview
Models
Llama 4
Llama Guard 4 (New)
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2 (New)
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
How-To Guides
Fine-tuning
Quantization
Prompting
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources
How-to guides

Quantization

Quantization is a technique used in machine learning to reduce the computational and memory requirements of models, making them more efficient for deployment on servers and edge devices. It involves representing model weights and activations, typically 32-bit floating numbers, with lower precision data such as 16-bit float, brain float 16-bit, 8-bit int, or even 4/3/2/1-bit int. The benefits of quantization include smaller model sizes, faster fine-tuning, and faster inference—particularly beneficial in resource-constrained environments. However, the tradeoff is a reduction in model quality due to the loss of precision.

Llama 3.2 Quantized Models

In a follow-up to Llama 3.2, Meta released quantized versions of the Llama 3.2 lightweight models (1B instruct and 3B instruct). Each model was quantized using two techniques for a total of four quantized models.

  • QAT+LoRA*
  • SpinQuant

*Quantization-Aware Training (QAT) combined with Low Rank Adaptation (LoRA)

Meta has open-sourced SpinQuant for use by the community.
For more information about the quantized release see the updated Llama 3.2 documentation.

Supported quantization modes in PyTorch

  • Post-Training Dynamic Quantization: Weights are pre-quantized ahead of time and activations are converted to int8 during inference, just before computation. This results in faster computation due to efficient int8 matrix multiplication and maintains accuracy on the activation layer.
  • Post-Training Static Quantization: This technique improves performance by converting networks to use both integer arithmetic and int8 memory accesses. It involves feeding batches of data through the network and computing the resulting distributions of the different activations. This information is used to determine how the different activations should be quantized at inference time.
  • Quantization Aware Training (QAT): In QAT, all weights and activations are "fake quantized" during both the forward and backward passes of training. This means float values are rounded to mimic int8 values, but all computations are still done with floating point numbers. This method usually yields higher accuracy than the other two methods as all weight adjustments during training are made while "aware" of the fact that the model will ultimately be quantized.
More details about these methods and how they can be applied to different types of models can be found in the official PyTorch documentation. Additionally, the community has already conducted studies on the effectiveness of common quantization methods on Meta Llama 3, and the results and code to evaluate can be found in this GitHub repository.

We will focus next on quantization tools available for Meta Llama models. As this is a constantly evolving space, the libraries and methods detailed here are the most widely used at the moment and are subject to change as the space evolves.

Pytorch quantization with TorchAO

The TorchAO library offers several methods for quantization, each with different schemes for how the activations and weights are quantized. We distinguish between two main types of quantization: weight only quantization and dynamic quantization.
For weight only quantization, we support 8-bit and 4-bit quantization. The 4-bit quantization also has GPTQ support for improved accuracy, which requires calibration but has the same final performance.
For dynamic quantization, we support 8-bit activation quantization and 8-bit weight quantization. We also support this type of quantization with smoothquant for improved accuracy, which requires calibration and has slightly worse performance.
Additionally, the library offers a simple API to test different methods and automatic detection of the best quantization for a given model, known as autoquantization. This API chooses the fastest form of quantization out of the 8-bit dynamic and 8-bit weight only quantization. It first identifies the shapes of the activations that the different linear layers see, then benchmarks these shapes across different types of quantized and non-quantized layers in order to pick the fastest one. Also, it composes with torch.compile() to generate the fast kernels. For additional information on torch.compile, please see this general tutorial.
Note: This library is in beta phase and in active development; API changes are expected.

HF supported quantization

Hugging Face (HF) offers multiple ways to do LLM quantization with their transformers library. For additional guidance and examples on how to use each of these beyond the brief summary presented here, please refer to their quantization guide and the transformers quantization configuration documentation. The llama-cookbook code uses bitsandbytes 8-bit quantization to load the models, both for inference and fine-tuning. (See below for more information about using the bitsandbytes library with Llama. )

TorchAO

The Hugging Face Transformers library supports TorchAO (PyTorch Architecture Optimization). As described above, TorchAO enables you to quantize and sparsify: weights, gradients, optimizers, and activations. TorchAO supports custom data types and optimizations. You can use TorchAO for both training and inference.
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
See the Hugging Face page for more information and examples that describe their support for TorchAO.

Quanto

Quanto is a versatile PyTorch quantization toolkit that uses linear quantization. It provides features such as weights quantization, activation quantization, and compatibility with various devices and modalities. It supports quantization-aware training and is easy to integrate with custom kernels for specific devices. More details can be found in the announcement blog, GitHub repository, and HF guide.

AQLM

Additive Quantization of Language Models (AQLM) is a compression method for LLM. It quantizes multiple weights together, taking advantage of interdependencies between them. AQLM represents groups comprising 8 to16 weights each as a sum of multiple vector codes. This library supports fine-tuning its quantized models with Parameter-Efficient Fine-Tuning and LoRA by integrating into HF's PEFT library as well. More details can be found in the GitHub repository.

AWQ

Activation-aware Weight Quantization (AWQ) preserves a small percentage of weights that are important for LLM performance, reducing quantization loss. This allows models to run in 4-bit precision without experiencing performance degradation. Transformers support loading models quantized with the llm-awq and autoawq libraries. More details on how to load them with the Transformers library can be found in the HF guide.

AutoGPTQ

The AutoGPTQ library implements the GPTQ algorithm, a post-training quantization technique where each row of the weight matrix is quantized independently. These weights are quantized to int4, but they’re restored to fp16 on the fly during inference, saving memory usage by 4x. More details can be found in the GitHub repository.

BitsAndBytes

BitsAndBytes is an easy option for quantizing a model to 8-bit and 4-bit. The library supports any model in any modality, as long as it supports loading with Hugging Face Accelerate and contains torch.nn.Linear layers. It also provides features for offloading weights between the CPU and GPU to support fitting very large models into memory, adjusting the outlier threshold for 8-bit quantization, skipping module conversion for certain models, and fine-tuning with 8-bit and 4-bit weights. For 4-bit models, it allows changing the compute data type, using the Normal Float 4 (NF4) data type for weights initialized from a normal distribution, and using nested quantization to save additional memory at no additional performance cost. More details can be found in the HF guide.
Was this page helpful?
Yes
No
On this page
Quantization
Llama 3.2 Quantized Models
Supported quantization modes in PyTorch
Pytorch quantization with TorchAO
HF supported quantization
TorchAO
Quanto
AQLM
AWQ
AutoGPTQ
BitsAndBytes
Skip to main content
Meta
Models & Products
Docs
Community
Resources
Llama API
Download models