Quantization is a technique used in machine learning to reduce the computational and memory requirements of models, making them more efficient for deployment on servers and edge devices. It involves representing model weights and activations, typically 32-bit floating-point numbers, with lower-precision data, from 16 bits down to just 2 or 3 bits. The benefits of quantization include smaller model sizes and faster inference—particularly beneficial in resource-constrained environments.
Quantization can be viewed as a form of lossy compression, where precision is reduced to shrink the overall model size. Neural networks are built from vast numbers of connected units, each with a weight and each producing activations during computation. Both weights and activations are typically represented as 32-bit floating-point numbers, which are very precise but also relatively large. Because of the redundant nature of neural networks, in many cases, exact values are not critical to arriving at the correct solution. Quantization takes advantage of this by using fewer bits when possible, reducing the information needed to store and process the network.
Because of this loss of precision there is a trade-off: The fewer bits you use, the more you will experience a reduction in model quality.
Due to the substantial savings in memory and compute requirements offered by quantization, it remains an active field of research. This section provides a brief overview of the different quantization methods, and then discusses the different frameworks and tools available for quantization. For a more in-depth treatment, you can begin with a survey paper on quantization, and then explore some of the latest methods such as AWQ, GPTQ, and SpinQuant. A brief overview table of the different methods is below, followed by a more detailed description.
| Quantization Type | Description | Weight Precision | Activation Precision |
|---|---|---|---|
| Weight-only | Only the weights are quantized after training; activations remain full-precision. | INT8, INT4, INT2 | Not quantized |
| Dynamic | Weights are pre-quantized; activations are quantized on-the-fly during inference. | INT8, INT4 | INT8, FP16 |
| Static | Weights and activations are quantized ahead of time after calibration with a representative dataset. | INT8, INT4 | INT8, INT4 |
| Quantization-aware Training | Simulates quantization during training so the model adapts to reduced precision. | INT8, INT4, INT2 | INT8, INT4, FP16 |
In post-training weight-only quantization, the weights of a trained neural network are quantized to lower precision (from 32-bit or 16-bit float down to 8-bit integers or fewer) after the training process is complete. This quantization step does not require access to the original training data, as it operates only on the already-learned weights. In this approach, only the weights are quantized—activations remain in full precision during inference.
In dynamic quantization, the weights of the network are quantized after training like in weights-only quantization. Activations, however, are quantized only during inference, using statistics from the batch of data being processed. This allows the model to adapt to a wide variety of input distributions, but does introduce some computational overhead during inference as the quantization parameters for activations must be recalculated for each batch or input. Dynamic quantization is particularly useful when the model needs to handle a wide range of input distributions, as it can adjust its quantization parameters on-the-fly.
In static quantization, a dataset is used to determine the best way to quantize the model, especially the activations which are more sensitive to the dynamic range. By quantizing both the weights and activations after training completes but before deployment, the model will be faster at inference and potentially more accurate than with dynamic quantization, assuming that the dataset used to quantize the model is representative of the data the model will be used on. The key advantage of static quantization is that all quantization parameters are fixed after calibration, eliminating the need for calculations during inference. However, this method requires careful calibration with a representative dataset to ensure good performance.
Quantization-aware training (QAT) is a technique that simulates the effects of quantization during the training process itself. Unlike post-training quantization methods (static and dynamic quantization), QAT allows the model to learn to adapt to the reduced precision by incorporating the quantization operations directly into the training process. This approach typically results in better model performance compared to post-training quantization, as the model can learn to compensate for the precision loss.
During QAT, the model's weights and activations are quantized and dequantized in the forward pass, while the backward pass uses the full-precision gradients. This process helps the model learn to work effectively with the reduced-precision representation. Typically, applying QAT drastically slows training, and so most QAT models are trained at full-precision and then fine-tuned with QAT applied.
More details about these methods and how they can be applied to different types of models can be found in the official PyTorch documentation. Additionally, a blog post demonstrating an end-to-end solution for QAT compares the effectiveness of common quantization methods on Llama 3, and the results and code to evaluate can be found in this GitHub repository.
The TorchAO library offers several methods for quantization, each with different schemes for how the activations and weights are quantized. It distinguishes between two main types of quantization: weight-only quantization and dynamic quantization.
| Quantization Type | Bit Depth | Calibration Required | Accuracy Enhancement | Inference Performance |
|---|---|---|---|---|
| Weight-only | 4 or 8 | No | No | Same |
| Weight-only (GPTQ) | 4 | Yes | Yes | Same |
| Dynamic | 8 | No | No | Same |
| Dynamic (smoothquant) | 8 | Yes | Yes | Slightly slower |
The TorchAO library offers a simple API to test different methods and automatic detection of the best quantization for a given model, known as autoquantization. This API chooses the fastest form of quantization out of the 8-bit dynamic and 8-bit weight-only quantization. It first identifies the shapes of the activations that the different linear layers see, then benchmarks these shapes across different types of quantized and non-quantized layers in order to pick the fastest one. Also, it composes with torch.compile() to generate the fast kernels.
For additional information on torch.compile(), please see this general tutorial.
Note: The TorchAO library is in beta phase and in active development; API changes are expected.
Hugging Face (HF) offers multiple ways to do LLM quantization with their transformers library. For additional guidance and examples on how to use each of these beyond the brief summary presented here, please refer to their quantization guide and the transformers quantization configuration documentation. Hugging Face also provides a quantization guide demonstrating how to quantize the Llama 3 family of models. The llama-cookbook code uses bitsandbytes 8-bit quantization to load the models, both for inference and fine-tuning.
The Hugging Face Transformers library supports TorchAO (PyTorch Architecture Optimization). As described above, TorchAO enables you to quantize and sparsify: weights, gradients, optimizers, and activations. TorchAO supports custom data types and optimizations. You can use TorchAO for both training and inference.
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
See the Hugging Face page for more information and examples that describe their support for TorchAO.
Quanto is a versatile PyTorch quantization toolkit that uses linear quantization. It provides features such as weight quantization, activation quantization, and compatibility with various devices and modalities. It supports quantization-aware training (QAT) and is easy to integrate with custom kernels for specific devices. More details can be found in the announcement blog, GitHub repository, and HF guide.
Additive Quantization of Language Models (AQLM) is a compression method for LLMs. It quantizes multiple weights together, taking advantage of interdependencies between them. AQLM represents groups comprising 8 to 16 weights each as a sum of multiple vector codes. This library supports fine-tuning its quantized models with Parameter-Efficient Fine-Tuning (PEFT) and LoRA by integrating into HF's PEFT library as well. More details can be found in the GitHub repository.
Activation-aware Weight Quantization (AWQ) preserves a small percentage of weights that are important for LLM performance, reducing quantization loss. This allows models to run in 4-bit precision without experiencing significant model performance degradation. The HF Transformers library supports loading models quantized with the llm-awq and vLLM autoawq libraries. More details on how to load them with the HF Transformers library can be found in the HF guide.
The AutoGPTQ library implements the GPTQ algorithm, a post-training quantization technique where each row of the weight matrix is quantized independently. These weights are quantized to INT4, but they are restored to FP16 on the fly during inference, saving 4x in memory usage for weight storage. More details can be found in the GitHub repository.
BitsAndBytes is an easy option for quantizing a model to 8-bit and 4-bit. The library supports any model in any modality, as long as it supports loading with Hugging Face Accelerate and contains torch.nn.Linear layers. It also provides features for offloading weights between the CPU and GPU to support fitting very large models into memory, adjusting the outlier threshold for 8-bit quantization, skipping module conversion for certain models, and fine-tuning with 8-bit and 4-bit weights. For 4-bit models, it allows changing the compute data type, using the Normal Float 4 (NF4) data type for weights initialized from a normal distribution, and using nested quantization to save additional memory with no performance cost. More details can be found in the HF guide.
When choosing a quantization method, it is important to consider the trade-off between model accuracy and performance. Static quantization is generally more accurate than dynamic quantization, but requires more time to quantize the model. Dynamic quantization is faster to quantize, but may result in a loss of accuracy. QAT is a good compromise between accuracy and performance, but requires more time to train the model.
When evaluating quantization, it is important to consider the following:
Performance optimization beyond quantization is a broad topic, and this section covers only a few of the most common techniques. Many of these techniques are enabled by default for major runtimes. If latency performance is a top priority, there are hosting options available that provide custom accelerated inference hardware, such as Groq and Cerebras.
Batching involves processing multiple input sequences simultaneously to maximize hardware utilization, particularly on GPUs. By grouping requests, the computational cost is spread across more data, which improves throughput (the number of requests processed per second). However, this comes at the cost of increased latency for individual requests, as the system must wait for the entire batch to be processed. For real-time applications, smaller batch sizes or even a batch size of one is preferred to minimize latency, while for offline processing or high-throughput scenarios, larger batch sizes are more efficient. Inference servers often use dynamic batching, where incoming requests are automatically grouped to balance throughput and latency. For maximum throughput, you typically want to use the largest batch size that can fit in your GPU's memory.
In transformer models, generating each new token requires attending to all previous tokens in the sequence. The Key (K) and Value (V) projections for each token are constant once computed. KV caching is a crucial optimization that stores these K and V tensors in GPU memory after they are computed for the first time. For subsequent token generation steps, the model can reuse these cached values instead of recomputing them, significantly reducing the amount of computation and speeding up inference. While this technique dramatically improves inference speed, the cache itself consumes a large amount of memory, which can be a limiting factor for long sequences or large batch sizes.
Fused kernels combine multiple individual operations (e.g., matrix multiplication, addition, and an activation function) into a single computational kernel that runs on the GPU. This optimization reduces the overhead associated with launching multiple separate kernels and minimizes data movement between the GPU's high-bandwidth memory and its on-chip memory. By keeping intermediate data within the GPU's fastest memory caches, fused kernels can provide substantial speedups. A prominent example is FlashAttention, which fuses the entire attention mechanism into a single kernel, avoiding the need to read and write the large attention matrix to and from memory. Compilers like torch.compile can automatically perform kernel fusion, simplifying the process of applying this optimization.
When hosting your own models, understanding your workloads can enable you to make the appropriate tradeoffs between cost, latency, throughput, and quality. Latency in LLMs comes in two forms—overall latency (the time to return the entire response) and first token latency (the time to return the first token of the response).