Meta

Meta
FacebookXYouTubeLinkedIn
Documentation
OverviewModels Getting the Models Running Llama How-To Guides Integration Guides Community Support

Community
Community StoriesOpen Innovation AI Research CommunityLlama Impact Grants

Resources
CookbookCase studiesVideosAI at Meta BlogMeta NewsroomFAQPrivacy PolicyTermsCookie Policy

Llama Protections
OverviewLlama Defenders ProgramDeveloper Use Guide

Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide

Table Of Contents

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources
Deployment guides

Versioning, updates and migration

seotitle: "Versioning, updates and migration | Deployment guides" seodescription: "Learn how to effectively migrate between different versions of Llama models by understanding their versioning system, comparing model capabilities, and implementing a strategic migration plan. Discover how to evaluate performance, assess trade-offs, and optimize prompts for optimal results with the new model."

Overview

Meta releases Llama models following a clear versioning convention that helps you understand capabilities and plan upgrades.

This guide explains Llama's versioning system, helps you compare model capabilities using model cards, and provides migration strategies between versions. Understanding these patterns enables you to choose the right model for your use case and upgrade effectively as new releases become available.

A general ontology of Llama releases

Llama models vs Llama-based models

Llama models, such as Llama 4, represent the core line of models developed with significant architectural advancements. In contrast, Llama-based models like Llama Guard 4 are designed for specific use cases—such as safety and content moderation.

Major Versions

Major versions, such as Llama 3 and Llama 4, indicate generational changes in model architecture. For example, Llama 4 introduces a new mixture-of-experts (MoE) architecture, which offers different parameter scaling and efficiency characteristics. These changes often require users to adapt their usage strategies, such as prompt optimization, to achieve optimal results.

Minor Versions

Minor versions, like Llama 3.2, bring new functionality or improved performance to an existing model generation. These point releases focus on specific capabilities or optimizations while maintaining the core architecture, for easier migration between models in existing implementations.

New releases

Llama 4 line

  • Llama 4 Maverick: High-capability model with 17B active parameters from 400B total (128 experts), designed for complex reasoning and multimodal tasks. Features a mixture-of-experts (MoE) architecture and supports a 1M token context window.
  • Llama 4 Scout: Efficiency-focused model with 17B active parameters from 109B total (16 experts), optimized for cost-effective deployment and large document processing. Also features MoE architecture and multimodal support and supports a 10M token context window.

Llama 3.x line

  • Llama 3.3: 70B parameter model with enhanced multilingual and reasoning capabilities. Effectively replaces Llama 3.1 70B. Offers improved performance and broader language support.

  • Llama 3.2: Available in multiple parameter sizes (1B, 3B, 11B, 90B) with both text-only and vision-enabled variants. Focuses on efficiency and multimodal capabilities while maintaining backward compatibility with the Llama 3 architecture.

Understanding and comparing model capabilities

Model cards provide the authoritative source for detailed specifications, performance benchmarks, and capabilities for each Llama model. These cards are essential for making informed comparisons and migration decisions.

Locating official model cards: Visit the official Llama documentation for comprehensive model cards: Llama 3.1, Llama 3.2, Llama 3.3, and Llama 4. Each model has its own dedicated card with comprehensive technical specifications, benchmark results, and implementation guidance.

Reading model cards effectively: Start with the model overview to understand the intended use cases and key capabilities. Review the technical specifications section for parameter counts, context windows, and architectural details—pay special attention to active versus total parameters for MoE models like Llama 4. Examine the benchmark results across reasoning, coding, and domain-specific tasks that align with your use case. The prompt format section provides crucial implementation details for getting optimal performance. Finally, review the limitations and considerations section to understand potential challenges for your specific application.

Using model cards for migration planning: Compare benchmark scores between your current model and potential upgrades on tasks similar to your use case. Identify new capabilities that could benefit your application, such as multimodal processing or extended context windows. Review the prompt format requirements to understand any changes needed in your implementation. Use the performance characteristics to estimate cost and latency implications for your specific workload.

When to migrate

Performance considerations

Benchmarks

Consider migrating when your accuracy requirements increase, as newer versions typically demonstrate improved performance on industry-standard benchmarks. These benchmarks are designed to measure a model's ability to handle tasks such as reasoning, information extraction, and factual accuracy in a controlled, comparable way. Stronger benchmark results are a good indicator that the model will also perform better on similar real-world tasks within your application—such as extracting more accurate information, providing higher-quality reasoning, or reducing hallucinations.

Capabilities

Migrate when you need new capabilities, such as multimodal features, improved reasoning, or specialized domain performance. Also consider migration when efficiency gains are available, whether through better performance per parameter, faster inference, or important security and safety updates.

Trade-offs

Keep in mind, however, that newer models may introduce trade-offs, including increased latency, higher operational costs, or new integration requirements. Always evaluate migration holistically, weighing both the benefits and potential challenges for your specific use case.

Accuracy vs latency trade-offs

When deciding between accuracy and latency, favor accuracy for complex reasoning tasks, high-stakes applications, quality-critical content generation, and research and analysis workflows where the quality of output is paramount. Conversely, favor latency for interactive applications, scenarios requiring real-time responses, resource-constrained environments, and high-volume automated tasks where speed takes precedence over marginal quality improvements. Always weigh these trade-offs when planning a migration between models.

When to migrate to Llama 4

Llama 4 models offer significant architectural improvements but require specific considerations. These models use mixture-of-experts architecture for better performance per active parameter, providing MoE efficiency gains that can reduce computational costs while maintaining high performance. Both Scout and Maverick support multimodal capabilities with text and image processing, expanding the range of applications you can address. Llama 4 Scout particularly offers context window advantages with its 10M token context for large document processing. However, MoE models may require different prompt strategies for optimal performance, so plan for prompt optimization during your migration. Remember, migration to Llama 4 should be based on a careful assessment of both benefits and potential trade-offs.

MoE considerations

Understanding MoE architecture is crucial for effective migration. Only a subset of parameters are active for each request (17B active out of 109B-400B total), which provides cost efficiency by using fewer computational resources while maintaining large model performance.. Different experts may activate for different types of tasks, creating specialized pathways through the model. In practice, while MoE improves compute efficiency during inference, the full set of model weights (all experts) must still be loaded into memory, so total RAM requirements remain high—often comparable to dense models of similar total size.

Migration implications require careful testing and monitoring. Test prompt sensitivity carefully, as MoE models may respond differently to prompt variations compared to traditional models. Evaluate consistency across similar tasks, since expert routing can affect output stability. Consider your batch processing patterns, as expert activation may vary between single and batch requests. Monitor latency patterns closely, as first requests may have different performance characteristics due to expert loading and routing optimization. Always consider whether the migration introduces new operational or cost challenges.

Cost considerations

Newer models may have different pricing structures that require evaluation against your current costs. MoE models offer better performance per active parameter, but total parameter count still affects pricing, so you need to understand both the efficiency gains and cost implications. Evaluate performance improvements against cost increases to ensure the migration provides value for your specific use case. Factor in additional costs for migration, testing, prompt optimization, and potential potential fine-tuning costs when calculating the total cost of migrating.

Migration playbook

If it ain't broke don't fix it

Before rushing to migrate, evaluate whether migration is actually necessary. If your current model meets performance requirements and there are no critical security updates required, the migration may not be worth the effort. Consider whether migration costs outweigh the benefits, especially for stable, production-critical applications that are functioning well with existing models.

Planning your migration

Start by assessing your current performance to establish baseline metrics for your use case. Review new capabilities to identify features that would benefit your application, then test compatibility to ensure your integration works with the new model. Plan your rollout strategy by determining the deployment speed and which features to replace. This allows you to continuously monitor user behavior, evaluate the new model's performance, and decide whether to migrate all or only select features.

Implementation steps

Begin implementation by updating your model specification to change the model version in your API calls. Test core functionality to verify that existing features work as expected with the new model. Evaluate performance by comparing outputs and response quality against your baseline metrics. Monitor production metrics closely, tracking latency, accuracy, and user satisfaction to ensure the migration meets expectations. Prepare for rollback by maintaining the ability to revert to the previous model if issues arise.

Common migration patterns

A gradual rollout approach starts with non-critical applications, uses A/B testing between old and new versions, and gradually increases traffic to the new model as confidence builds. This method allows you to identify issues early and minimize risk to critical systems.

Shadow deployment runs the new model alongside your existing version, comparing outputs without affecting users. This approach lets you build confidence in the new model's performance before making the switch, providing comprehensive validation of the migration's impact.

Evaluation considerations

Model-specific testing considerations

When testing reasoning versus non-reasoning models, focus on prompt optimization since reasoning models may benefit from chain-of-thought prompting techniques. Evaluate step-by-step versus direct answer approaches to understand which works best for your use case. Assess temperature and sampling parameter sensitivity, as reasoning models may respond differently to these settings.

For vision-enabled models, test image understanding across different formats and resolutions to ensure consistent performance. Evaluate multimodal reasoning capabilities that combine text and visual information. Compare text-only versus multimodal prompt strategies to identify the most effective approaches for your use cases. Assess performance on visual reasoning tasks that require understanding relationships between text and images.

Parameter size considerations affect your testing approach significantly. Smaller models (1B-3B) require focus on efficiency and basic task performance, as they excel in straightforward applications with resource constraints. Medium models (11B-70B) balance capability and speed for most applications, making them suitable for general-purpose use cases. Large models (90B+) need testing on complex reasoning and specialized domain performance where their additional parameters provide the most value.

Prompt optimization testing

For general-purpose models such as Llama 4 Maverick and Llama 3.3, test across diverse task types to leverage their broad capabilities effectively. Evaluate multimodal prompt strategies that combine text and images to maximize the models' versatility. Test structured output formats for complex analysis tasks where consistent formatting is crucial. Evaluate tool-calling performance if relevant to your use case, as these models often excel at integrating with external tools and APIs.

For efficiency-optimized models such as Llama 4 Scout and Llama 3.2 1B/3B models, optimize for concise, direct prompts and use few-shot learning when necessary to achieve good performance without extensive context. Test performance with minimal context to understand the models' limitations and strengths in resource-constrained scenarios. Focus on single-turn versus multi-turn efficiency to optimize your conversation patterns. Evaluate response time versus quality trade-offs to find the right balance for your application's requirements.

Evaluation metrics for model migrations

Performance

Capability assessment should focus on measuring task-specific accuracy improvements that directly impact your use case. Evaluate reasoning quality on multi-step problems to understand how well the new model handles complex logical chains. Test instruction following precision to ensure the model responds appropriately to your specific prompts and requirements. Assess domain knowledge accuracy in areas relevant to your application to verify that the model maintains or improves specialized understanding.

Trust and safety

Model behavior changes require careful evaluation across multiple dimensions. Monitor response style and tone consistency to ensure the new model maintains the voice appropriate for your application. Evaluate safety and alignment behavior to verify that the model continues to meet your content standards and ethical requirements. Track hallucination rates and factual accuracy, as these can significantly impact user trust and application reliability.

Rollback criteria

Consider rolling back to the previous model if you observe degraded performance on core use case benchmarks that matter most to your application. Revert if prompt sensitivity increases significantly, requiring extensive re-optimization that outweighs the benefits of the migration. Roll back if you notice inconsistent behavior on previously stable tasks, as reliability is often more valuable than marginal performance improvements.

Was this page helpful?
Yes
No
On this page
Versioning, updates and migration
Overview
A general ontology of Llama releases
Llama models vs Llama-based models
Major Versions
Minor Versions
New releases
Llama 4 line
Llama 3.x line
Understanding and comparing model capabilities
When to migrate
Performance considerations
Accuracy vs latency trade-offs
When to migrate to Llama 4
MoE considerations
Cost considerations
Migration playbook
If it ain't broke don't fix it
Planning your migration
Implementation steps
Common migration patterns
Evaluation considerations
Model-specific testing considerations
Prompt optimization testing
Evaluation metrics for model migrations
Rollback criteria
Meta
Models & Products
Docs
Community
Resources
Llama API
Download models