The new quantized models are substantially faster than their non-quantized (BF16) counterparts. The quantized models also have a much lower memory footprint and lower power consumption. At the same time though, they retain nearly the same accuracy as the non-quantized versions.
In addition, because these models were trained and evaluated using Meta’s data and frameworks, they have the same levels of trust and safety as other models in the Llama collection.
The quantized models are appropriate for any use case that involves constrained memory conditions or the need to conserve power. Typical environments include phones, tablets, and other edge devices, such as smart glasses.
Quantization-Aware Training (QAT) simulates the effects of quantization during the training of the Llama 3.2 models, which enables us to optimize their performance in low precision environments. To initialize QAT, we utilize BF16 Llama 3.2 model checkpoints obtained after supervised fine-tuning (SFT), then perform an additional full round of SFT training with QAT. We then freeze the backbone of the QAT model and perform another round of SFT with low-rank adaptation (LoRA) adaptors applied to all layers within the transformer block. Meanwhile, the LoRA adaptors' weights and activations are maintained in bfloat16, similar to QLoRA.
Finally, we fine-tune the resulting model (both backbone and LoRA adaptors) using direct preference optimization (DPO). The result is a highly efficient model that achieves accuracy that is competitive with the original BF16 model, while maintaining speed and a memory footprint comparable to other quantization methods.
A key advantage of SpinQuant is its ability to operate without requiring access to training datasets, which are often private. It is an attractive solution for applications where data availability or computational resources are limited.
For both quantization methods, QAT+LoRA and SpinQuant, we used the following quantization scheme:
The lightweight models share many characteristics with the Llama 3.1 text-only models. For information that is applicable across both sets of models, see the following sections on the Llama 3.1 page.
Tool-calling with the lightweight models can be done in 2 ways:
Set the function definitions
function_definitions = """[
{
"name": "get_user_info",
"description": "Retrieve details for a specific user by their unique identifier. Note that the provided function is in Python 3 syntax.",
"parameters": {
"type": "dict",
"required": [
"user_id"
],
"properties": {
"user_id": {
"type": "integer",
"description": "The unique identifier of the user. It is used to fetch the specific user details from the database."
},
"special": {
"type": "string",
"description": "Any special information or parameters that need to be considered while fetching user details.",
"default": "none"
}
}
}
}
]
"""
Set the default system prompt
system_prompt = """You are an expert in composing functions. You are given a question and a set of possible functions.
Based on the question, you will need to make one or more function/tool calls to achieve the purpose.
If none of the function can be used, point it out. If the given question lacks the parameters required by the function,
also point it out. You should only return the function call in tools call sections.
If you decide to invoke any of the function(s), you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]\n
You SHOULD NOT include any other text in the response.
Here is a list of functions in JSON format that you can invoke.\n\n{functions}\n""".format(functions=function_definitions)
Set the user query
query = "Can you retrieve the details for the user with the ID 7890, who has black as their special request?"
With the above function definition, system prompt and user query, the input to the LLM looks like:
<|start_header_id|>system<|end_header_id|>
You are an expert in composing functions. You are given a question and a set of possible functions.
Based on the question, you will need to make one or more function/tool calls to achieve the purpose.
If none of the functions can be used, point it out. If the given question lacks the parameters required by the function,also point it out. You should only return the function call in tools call sections.
If you decide to invoke any of the function(s), you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]
You SHOULD NOT include any other text in the response.
Here is a list of functions in JSON format that you can invoke.[
{
"name": "get_user_info",
"description": "Retrieve details for a specific user by their unique identifier. Note that the provided function is in Python 3 syntax.",
"parameters": {
"type": "dict",
"required": [
"user_id"
],
"properties": {
"user_id": {
"type": "integer",
"description": "The unique identifier of the user. It is used to fetch the specific user details from the database."
},
"special": {
"type": "string",
"description": "Any special information or parameters that need to be considered while fetching user details.",
"default": "none"
}
}
}
}
]
<|eot_id|><|start_header_id|>user<|end_header_id|>
Can you retrieve the details for the user with the ID 7890, who has black as their special request?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
And the model responds with the function call that can fulfill the user’s query:
[get_user_info(user_id=7890, special='black')]<|eot_id|>
You could pass everything in the user prompt as well:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Questions: Can you retrieve the details for the user with the ID 7890, who has black as their special request?
Here is a list of functions in JSON format that you can invoke:
[
{
"name": "get_user_info",
"description": "Retrieve details for a specific user by their unique identifier. Note that the provided function is in Python 3 syntax.",
"parameters": {
"type": "dict",
"required": [
"user_id"
],
"properties": {
"user_id": {
"type": "integer",
"description": "The unique identifier of the user. It is used to fetch the specific user details from the database."
},
"special": {
"type": "string",
"description": "Any special information or parameters that need to be considered while fetching user details.",
"default": "none"
}
}
}
}
]
Should you decide to return the function call(s), Put it in the format of [func1(params_name=params_value, params_name2=params_value2...), func2(params)]
NO other text MUST be included.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
With the same response:
[get_user_info(user_id=7890, special='black')]<|eot_id|>
<|eot_id|>
tag indicating end of turn.The Llama 3.2 Vision multimodal large language models (LLMs) are a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2 Vision Instruct models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image.
<|image|>
which represents the passed image.There are 4 different roles that are supported by Llama text models:
system
: Sets the context in which to interact with the AI model. It typically includes rules, guidelines, or necessary information that help the model respond effectively.user
: Represents the human interacting with the model. It includes the inputs, commands, and questions to the model.ipython
: A new role introduced in Llama 3.1. Semantically, this role means "tool". This role is used to mark messages with the output of a tool call when sent back to the model from the executor.assistant
: Represents the response generated by the AI model based on the context provided in the system
, ipython
and user
prompts.[system, assistant, user, ipython]
<|image|>
tag along with the text to continue generating<|begin_of_text|><|image|>If I had to write a haiku for this one
<|image|>
tag if the input includes an image to reason about.<|begin_of_text|><|start_header_id|>user<|end_header_id|>
<|image|>Describe this image in two
sentences<|eot_id|><|start_header_id|>assistant<|end_header_id|>
<|image|>
tag and text query.<|image|>
tag is important! The image immediately preceding a query is used to answer the query, make sure the text query follows the <|image|>
tag. This is controlled by the cross-attention layer mask in the model.vision_prompt_format.md
in the meta-llama
GitHub repository.