Quantization is a technique used in machine learning to reduce the computational and memory requirements of models, making them more efficient for deployment on servers and edge devices. It involves representing model weights and activations, typically 32-bit floating numbers, with lower precision data such as 16-bit float, brain float 16-bit, 8-bit int, or even 4/3/2/1-bit int. The benefits of quantization include smaller model sizes, faster fine-tuning, and faster inference—particularly beneficial in resource-constrained environments. However, the tradeoff is a reduction in model quality due to the loss of precision.
In a follow-up to Llama 3.2, Meta released quantized versions of the Llama 3.2 lightweight models (1B instruct and 3B instruct). Each model was quantized using two techniques for a total of four quantized models.
*Quantization-Aware Training (QAT) combined with Low Rank Adaptation (LoRA)
We will focus next on quantization tools available for Meta Llama models. As this is a constantly evolving space, the libraries and methods detailed here are the most widely used at the moment and are subject to change as the space evolves.
llama-cookbook
code uses bitsandbytes 8-bit quantization to load the models, both for inference and fine-tuning. (See below for more information about using the bitsandbytes library with Llama. )from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer