Meta

Meta
FacebookXYouTubeLinkedIn
Documentation
OverviewModels Getting the Models Running Llama How-To Guides Integration Guides Community Support

Community
Community StoriesOpen Innovation AI Research CommunityLlama Impact Grants

Resources
CookbookCase studiesVideosAI at Meta BlogMeta NewsroomFAQPrivacy PolicyTermsCookie Policy

Llama Protections
OverviewLlama Defenders ProgramDeveloper Use Guide

Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Meta
Models & Products
Docs
Community
Resources
Llama API
Download models

Table Of Contents

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources

Documentation

Deployment

Deploy and operate Llama models at scale.

Production deployment with Llama

Deploying and operating Llama models at scale requires a comprehensive approach that spans infrastructure, model selection, experimentation, security, and cost management.

The guides in this section introduce foundational concepts and decision points that are critical to successfully integrate Llama models into production enterprise environments.


Private Cloud Deployment

Private cloud deployment offers full control over infrastructure, security, and compliance.

Learn about key architectural patterns, including VPC isolation, cross-region replication, and multi-cloud strategies.

Private cloud deployment

Production Deployment Pipelines

Transitioning from experimentation to production requires robust, automated pipelines that manage the full lifecycle of Llama models in production.

Learn how to use production deployment pipelines to automate the model lifecycle, from data ingestion and validation; through fine-tuning and evaluation; to rollouts, A/B testing, and performance monitoring.

Production deployment pipelines

Infrastructure Migration

Migrating from external providers to Llama involves a structured methodology that reduces risk and ensures continuity.

Learn how to use continuous validation of quality, latency, and cost for smooth migration from other providers to Llama.

Infrastructure migration

Model Versioning and Migration

Llama models use a clear versioning system. Major releases introduce architectural changes, such as mixture-of-experts, while minor versions add targeted improvements.

Learn how to choose the right Llama model version by evaluating performance, compatibility, and operational trade-offs, using baseline measurement and compatibility testing to minimize risk.

Model versioning and migration

Accelerator Management

Large language models require specialized hardware accelerators, such as GPUs or TPUs, to deliver cost-effective and low-latency inference.

Learn about key selection factors—including memory, compute power, availability, and cost–to determine hardware needs for specific models and applications. Maximize utilization through batching, caching, and job scheduling to improve cost-effectiveness.

Accelerator management

Autoscaling and Resource Optimization

Autoscaling addresses fluctuating demand and high memory requirements. Horizontal scaling (e.g., Kubernetes) and vertical scaling optimize resource allocation. Monitoring queue depth, GPU utilization, and latency enables proactive scaling.

Learn techniques like quantization, dynamic batching, and using spot instances to further reduce costs.

Autoscaling

Self-Hosting for Regulated Industries

For industries with strict data privacy requirements, self-hosting Llama models ensures data control and compliance.

Learn about deployment patterns including air-gapped, private network, and hybrid architectures. Implementing security controls—such as PHI detection, audit logging, and encryption—is essential for protecting sensitive data and meeting regulatory standards.

Regulated industry self-hosting

Security in Production

Securing Llama deployments requires a multi-layered approach that addresses threats at the infrastructure, data, application, and operational levels.

Read about industry-standard security techniques like zero-trust and least-privilege principles. Learn how to mitigate LLM-specific threats such as prompt injection and insecure output handling via security gateways and robust input/output validation. Understand how continuous monitoring, audit logging, and incident response planning can help ensure security & compliance in production environments.

Security in production

Cost Projection and Optimization

Accurate cost projection and total-cost management is essential for sustainable LLM deployments. Cost drivers include token processing (input, output, and cached tokens), GPU hardware, cloud infrastructure, and hidden factors such as compliance, monitoring, and model versioning.

Learn how to accurately forecast costs by understanding workload patterns and optimizing utilization; use batch processing, spot instances, and right-sizing to help control ongoing costs.

Cost projection and optimization

Comparing Costs

Choosing the optimal deployment model (managed APIs, serverless GPU, GPU rental, or bare metal ownership) depends on workload characteristics, privacy requirements, and operational priorities. Each option presents distinct trade-offs in terms of cost structure, scalability, latency, and control.

Learn how to evaluate these options by gaining a clear understanding of token throughput, utilization patterns, and the specific needs of the application.

Cost comparison framework

A/B Testing and Experimentation

A/B testing is a critical methodology for empirically evaluating changes to Llama-powered applications. By systematically comparing variants—such as different prompts, models, or retrieval strategies—on live user traffic, teams can measure the real-world impact of changes on quality, safety, latency, and cost. Effective A/B testing requires careful experimental design, including clear hypotheses, well-defined goals and guardrail metrics, and robust sample-size calculations.

Learn how to implement A/B testing for LLM-based applications, including independent variant deployment, consistent user assignment, and comprehensive logging—and make data-driven decisions for continuous improvement.

A/B testing and experimentation