Meta

Meta
FacebookXYouTubeLinkedIn
Documentation
OverviewModels Getting the Models Running Llama How-To Guides Integration Guides Community Support

Community
Community StoriesOpen Innovation AI Research CommunityLlama Impact Grants

Resources
CookbookCase studiesVideosAI at Meta BlogMeta NewsroomFAQPrivacy PolicyTermsCookie Policy

Llama Protections
OverviewLlama Defenders ProgramDeveloper Use Guide

Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide

Table Of Contents

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources
Deployment guides

Private cloud deployment for Llama models

Overview

Private cloud deployment of Llama models addresses enterprise requirements for data sovereignty, regulatory compliance, and complete infrastructure control. Organizations handling sensitive data--from financial records to intellectual property--require AI capabilities without exposing information to public cloud services or third-party APIs.

This guide outlines advanced deployment patterns for running Llama models in private cloud environments across Amazon Web Services (AWS), Azure, and Google Cloud Platform (GCP). You'll learn how to configure network isolation, implement encryption at every layer, establish audit trails for compliance, and maintain high availability while keeping all data within your controlled infrastructure perimeter.

Not every deployment will need to take advantage of all techniques in this guide. Included with this guide are starter cloud deployment Terraform scripts that demonstrate bare-bones deployments in both AWS and GCP. To use this guide, start from one of the example scripts and build the additional features you require on top.

Private cloud deployment differs fundamentally from public API consumption. You gain complete control over model versions, infrastructure specifications, and security configurations, but assume responsibility for capacity planning, performance optimization, and operational maintenance. This guide helps you navigate these tradeoffs with production-tested approaches that balance security, performance, and cost.

Architecture patterns for private cloud

VPC-isolated deployment

Virtual Private Cloud (VPC) isolation provides the foundation for private cloud deployments, ensuring network-level separation from public internet.

VPC isolation ensures zero internet exposure through several mechanisms. Private subnets operate without internet gateways, preventing any direct internet connectivity. All service communications happen through private endpoints, eliminating the need for traffic to traverse public networks. Network ACLs restrict traffic to known CIDR blocks at the subnet level, while security groups provide instance-level filtering. VPC flow logs capture all network traffic for audit and forensic purposes.

Cross-region deployment

Multi-region architectures provide disaster recovery capabilities and help meet data residency requirements. The primary region hosts your active workloads while the disaster recovery region maintains synchronized model artifacts and configuration. During normal operations, the disaster recovery region can serve local traffic to reduce latency or handle overflow during peak demand.

Key considerations for cross-region deployment include establishing secure connectivity through VPC peering or transit gateways, implementing data replication strategies for model artifacts and configurations, maintaining consistency in security policies across regions, and planning for failover scenarios with defined recovery targets.

Multi-cloud deployment

Organizations deploy across multiple cloud providers to avoid vendor lock-in, meet specific compliance requirements, or leverage best-of-breed services from each provider. This approach requires careful planning around network connectivity, identity federation, and operational complexity.

Multi-cloud strategies typically follow one of three patterns:

  • Active-active deployment runs workloads simultaneously across providers for maximum availability.
  • Active-passive maintains a warm standby in the secondary cloud for disaster recovery.
  • Workload distribution assigns specific models or use cases to different clouds based on their strengths.

Architecture patterns comparison

The table below summarizes the three private‑cloud architecture patterns and when to choose each.

PatternPrimary objectiveBest forTradeoffsExample use cases
VPC‑isolated deploymentMaximize network isolation; zero internet exposureSingle‑region workloads; strict data sovereignty; internal appsLimited geo redundancy; careful update/egress planning; simpler operationsInternal chat/coding assistants; batch inference; PII/PHI processing within one region
Cross‑region deploymentResilience and disaster recovery across regions; latency localityEnterprises needing RTO/RPO guarantees; user traffic across geosHigher cost/operational complexity; data replication and consistency managementPublic/customer‑facing inference APIs; 24/7 SLAs; regulated sectors with regional data residency
Multi‑cloud deploymentAvoid lock‑in; leverage best‑of‑breed services; jurisdictional complianceLarge orgs with mature platform teams; differentiated workload placementHighest complexity; identity/network federation; fragmented observability; higher costSplit model hosting by workload; sovereign cloud + hyperscaler; cross‑provider A/B testing

AWS private deployment

Amazon Web Services (AWS) is Amazon's cloud computing platform. The largest and most well-known cloud provider, AWS offers a range of features that allow for private cloud deployments. Many of the features available on AWS are also available on other cloud providers, albeit under different names and slightly different implementations.

Private Network

AWS VPC provides the foundation for private deployments through complete network isolation. Your private cloud environment operates within dedicated subnets that have no internet gateways, ensuring model inference traffic never reaches public networks.

Key network security components include:

  • Private subnets distributed across multiple availability zones for high availability
  • VPC endpoints that enable secure access to AWS services like S3 and Elastic Container Registry without internet routing
  • Network ACLs that control traffic at the subnet level with explicit allow and deny rules
  • Security groups that enforce least-privilege access between components.

VPC flow logs capture all network activity for security monitoring and compliance auditing. Private DNS zones resolve service names internally, preventing any external DNS lookups that could leak information about your infrastructure.

For complete implementation details, use our Amazon SageMaker Terraform template which includes production-ready VPC configuration, security groups, and private endpoints.

Identity and Access

AWS IAM enforces least-privilege access across your deployment. Service roles define precisely what actions each component can perform, while resource-based policies control access to model artifacts and inference endpoints.

Essential IAM components include SageMaker execution roles with minimal permissions for model loading and inference, custom roles that restrict endpoint access to VPC-only traffic, service-linked roles for autoscaling and monitoring, and cross-service trust relationships that prevent unauthorized access between components.

Multi-factor authentication requirements for administrative access, regular access reviews to remove unused permissions, and integration with your corporate identity provider ensure comprehensive access control. The Amazon SageMaker Terraform template provides example IAM configurations for managed model access.

Encryption and Key Management

AWS KMS provides comprehensive encryption for data at rest and in transit. Customer-managed keys give you complete control over encryption policies and rotation schedules. All model artifacts, training data, and inference logs use KMS encryption with automatic key rotation.

S3 buckets storing model weights implement server-side encryption with customer-managed keys, bucket policies that prevent public access, and versioning for artifact integrity. CloudTrail integration logs all key usage for compliance auditing.

Managed Model Serving

Amazon SageMaker supports fully private deployments through VPC mode, where models never access the internet. Private endpoints ensure inference traffic stays within your VPC boundaries. You can deploy using either real-time endpoints for low-latency applications or batch transform jobs for high-throughput processing.

SageMaker handles model scaling, health monitoring, and rolling updates automatically while respecting your security boundaries. Data capture capabilities log model inputs and outputs to S3 for quality monitoring and compliance, with all data remaining encrypted within your infrastructure perimeter.

Azure private deployment

Azure is Microsoft's cloud computing platform. Azure offers many similar features to AWS, with their unique value proposition being their native Windows integration and ecosystem that can make it a good fit for Windows-based enterprises and applications.

Private Network

Azure Virtual Network (VNet) provides network isolation similar to AWS VPC. Your deployment operates within private subnets that prevent direct internet access while enabling secure communication with Azure services through private endpoints and service endpoints.

Core networking components include dedicated subnets for inference workloads with no public IP assignment, Network Security Groups (NSGs) that filter traffic at the subnet and network interface levels, private endpoints for Azure services like Storage Accounts and Key Vault, and Azure Firewall or third-party NVAs for advanced threat protection.

Private DNS zones ensure internal name resolution without external queries. Azure Private Link enables secure access to Azure services and your own services without internet exposure. VNet peering connects multiple virtual networks securely for multi-region or hybrid deployments.

Identity and Access

Azure Role Based Access Control (RBAC) provides fine-grained permissions management through built-in and custom roles. Managed identities eliminate the need for storing credentials in code or configuration files.

User-assigned managed identities provide persistent identities that survive resource recreation, enabling consistent access control across deployment updates. Built-in roles like "Storage Blob Data Reader" and "Key Vault Secrets User" provide predefined permission sets for common operations. Custom roles enable precise permissions tailored to Llama model operations.

Conditional access policies can require MFA and compliant devices for administrative operations. Privileged Identity Management (PIM) provides just-in-time administrative access with approval workflows and audit trails.

Encryption and Key Management

Azure Key Vault manages encryption keys, certificates, and secrets with enterprise-grade security. Hardware Security Module (HSM) backing protects the most sensitive cryptographic operations.

Key Vault features essential for Llama deployments include customer-managed encryption keys with automatic rotation, network access restrictions to specific VNets and IP ranges, soft delete and purge protection against accidental deletion, and comprehensive audit logging for compliance requirements.

Secrets should be changed out via rotation policies to automate the replacement of API keys and certificates before expiration. Private endpoint connectivity ensures secret retrieval happens over private networks without internet access.

Managed Model Serving

Azure Machine Learning (Azure ML) provides managed endpoints for Llama model hosting within your private network. Managed online endpoints automatically handle scaling, health monitoring, and traffic routing while maintaining complete network isolation.

Key deployment features include private endpoint connectivity that keeps inference traffic within your VNet, managed identity authentication that eliminates credential management, automatic SSL termination and certificate management, and integrated monitoring with Azure Monitor and Application Insights.

Deployment slots enable blue-green deployments with instant traffic switching. Auto-scaling policies adjust capacity based on CPU, memory, or custom metrics. The service handles model loading, warm-up, and graceful shutdown automatically.

GCP private deployment

Google Cloud Platform (GCP) is Google's cloud computing platform. Similar to AWS and Azure, GCP offers a private cloud deployment option, and as a platform tends to focus on AI features.

Private Network

Google Cloud VPC provides network isolation through custom mode networks that give you complete control over subnet creation and IP address ranges. Private Google Access enables resources in private subnets to reach Google services without public IPs.

Essential network security features include custom firewall rules that control traffic at the network level, hierarchical firewall policies for organization-wide security enforcement, VPC Service Controls that create security perimeters around sensitive data, and Private Service Connect for consuming services privately.

VPC flow logs provide network traffic visibility for security analysis and troubleshooting. Cloud NAT enables outbound connectivity for system updates while maintaining inbound isolation. Network tags and service accounts provide identity-based network security beyond IP addresses.

For production-ready configurations, use our GCP Cloud Run Terraform template which implements VPC isolation, firewall rules, and private service connectivity.

Identity and Access

Google Cloud IAM uses service accounts as the primary identity for workloads. These accounts follow the principle of least privilege with granular permissions for specific operations.

Service accounts provide workload identity without embedded credentials, enabling secure authentication between services. Predefined roles like "Storage Object Viewer" and "Vertex AI User" offer vetted permission sets for common operations. Custom roles enable precise permission boundaries tailored to Llama deployments. Workload Identity Federation allows external systems to authenticate without service account keys.

Organization policies enforce security requirements across projects, such as restricting public IPs or requiring specific encryption types. Binary Authorization ensures only approved container images run in your environment.

Encryption and Key Management

Cloud KMS manages encryption keys with hardware security module (HSM) protection for the highest security requirements. Automatic key rotation reduces the risk of key compromise while maintaining operational continuity.

Key features for Llama deployments include customer-managed encryption keys (CMEK) for complete control over data encryption, key access justifications that require reasons for administrative key access, and integration with Secret Manager for API key and credential storage.

Secret Manager provides versioned storage of secrets with automatic replication across regions. Access controls at the secret and version level enable fine-grained permission management. Private service connectivity ensures secrets are retrieved without internet exposure.

Managed Model Serving

Vertex AI provides a more managed model serving solution compared to GCP Cloud Run or manual containerization. The platform enables model deployment within VPC-SC perimeters for complete network isolation. Private endpoints keep all inference traffic within your VPC boundaries while the service handles infrastructure management.

Key deployment capabilities include VPC network peering for private connectivity, customer-managed encryption for model artifacts and predictions, automatic scaling based on traffic patterns, and integrated monitoring through Cloud Monitoring and Cloud Logging.

The GCP Vertex Terraform template provides an initial template to get started using containerized models with automatic scaling and private ingress configuration.

Cloud differences summary

This quick reference maps core private‑cloud concepts to AWS, Azure, and GCP names.

ConceptAWSAzureGCP
Network isolationVPC (private subnets, no IGW)VNet (no public IPs)VPC (custom mode)
Private access to servicesVPC Endpoints / PrivateLinkPrivate Endpoint / Private LinkPrivate Service Connect + Private Google Access
Identity & authorizationIAM Roles + PoliciesManaged Identities + RBACService Accounts + IAM Roles/Policies
Key & secrets managementAWS KMSKey Vault (keys & secrets)Cloud KMS + Secret Manager
Managed ML servingSageMaker (VPC mode; Endpoints/Batch)Azure ML (Managed Online Endpoints)Vertex AI (Endpoints; VPC‑SC)
Monitoring & auditVPC Flow Logs; CloudTrailAzure Monitor; App Insights/Activity LogsCloud Monitoring; Cloud Logging/Audit Logs

Security and compliance implementation

Data encryption patterns

Private cloud deployments require comprehensive encryption at multiple layers to protect sensitive data throughout its lifecycle. Each cloud provider offers native encryption services that integrate with their key management systems.

Encryption at rest protects stored data using customer-managed keys (CMK) that you control completely. Model artifacts, training data, and inference logs all use envelope encryption where data encryption keys (DEK) are themselves encrypted by key encryption keys (KEK). This approach enables efficient key rotation without re-encrypting massive datasets. Storage services automatically handle encryption and decryption transparently to applications.

Encryption in transit uses TLS 1.3 for all network communications with perfect forward secrecy. Certificate pinning prevents man-in-the-middle attacks by validating server certificates against known good values. Mutual TLS (mTLS) provides bidirectional authentication between services. VPN or dedicated network connections add another encryption layer for highly sensitive environments.

Key management best practices include automatic key rotation every 30-90 days to limit exposure windows, key versioning to support gradual migration and rollback, hardware security module (HSM) backing for root keys, and split key custody where multiple parties must cooperate for critical operations.

Audit logging architecture

Comprehensive audit logging provides the foundation for security monitoring, compliance reporting, and forensic analysis. Every API call, configuration change, and data access generates an audit event with sufficient context for investigation.

Essential audit fields include timestamp with microsecond precision, unique event ID for correlation, user or service identity, source IP and network path, action performed and resources affected, success or failure status with error details, and custom metadata relevant to your compliance requirements.

Log collection architecture uses local agents that capture events at the source with minimal latency. Events stream to central aggregation points that enrich, normalize, and route to appropriate destinations. Hot storage supports real-time analysis and alerting while cold archives maintain long-term compliance records. Immutable storage with cryptographic signing prevents tampering.

Compliance considerations vary by industry and jurisdiction. HIPAA requires seven-year retention for healthcare data access logs. PCI-DSS mandates daily review of security events. GDPR necessitates data lineage tracking for privacy requests. Financial services often require trade reconstruction capabilities. Design your logging architecture to meet the strictest applicable requirements.

Network security controls

Defense-in-depth network security uses multiple layers to protect against various threat vectors. No single control provides complete protection, but combined they create robust security posture.

Perimeter security starts with eliminating internet-facing surfaces through private endpoints and service connectivity. Web application firewalls (WAF) filter malicious requests at the application layer. DDoS protection absorbs volumetric attacks before they reach your infrastructure. Intrusion detection systems (IDS) identify suspicious patterns and known attack signatures.

Microsegmentation divides your network into security zones with strict access controls between them. Zero-trust principles require explicit verification for every connection regardless of network location. Network policies define allowed communication paths at the application level. Service mesh provides encrypted service-to-service communication with fine-grained access control.

Access control mechanisms include IP allowlisting for known source addresses, certificate-based authentication for service identities, API keys with usage quotas and expiration, and context-aware access that considers device security posture and user behavior patterns.

High availability and disaster recovery

Multi-zone deployment

High availability requires distributing your deployment across multiple availability zones or even regions to survive infrastructure failures. Each cloud provider offers different constructs, but the principles remain consistent. While the provided Terraform scripts do not include multi-zone deployment, there are two common patterns for deployment.

Active-active deployment runs identical stacks in multiple zones with load balancing distributing traffic. This approach provides the highest availability but doubles infrastructure costs. Replication must be near real time: model artifacts and configuration replicate continuously; stateful data stores either use globally consistent databases (consensus/geo-distributed SQL) or a single-writer-per-shard pattern with deterministic conflict avoidance. Choose synchronous or very low‑lag asynchronous replication to meet RPO/RTO targets; make writes idempotent and, where multi‑write is unavoidable, implement conflict resolution/merge rules. Health checks automatically route traffic away from unhealthy instances. Session affinity and sticky routing reduce cross‑zone chatter.

Active-passive deployment maintains a warm standby in a secondary zone ready for rapid activation. This reduces costs while providing reasonable recovery times. Replication is typically asynchronous or semi-synchronous from the active to the standby, targeting seconds-to-minutes recovery point objective and minutes recovery time objective. Model artifacts and configuration replicate continuously; databases use read replicas or log shipping with promotion on failover. The passive side runs minimal infrastructure to stay warm. Automated failover procedures activate the standby within minutes, and replication lag SLOs plus regular failover testing validate recovery objectives.

Backup and recovery procedures

Comprehensive backup strategies protect against data loss and enable point-in-time recovery. Different data types require different backup approaches based on criticality and change frequency.

Model artifact backups preserve trained models that represent significant computational investment. Version control tracks model lineage and enables rollback to previous versions. Geographic redundancy protects against regional failures. Incremental backups reduce storage costs for large models. Automated verification ensures backups remain restorable.

Configuration backups capture infrastructure as code, deployment configurations, and security policies. These lightweight backups enable rapid infrastructure recreation. Daily snapshots provide recovery points for configuration drift. Encrypted storage protects sensitive configuration data. Documentation links backups to specific deployment versions.

Recovery procedures must be thoroughly tested before you need them. Recovery time objectives (RTO) define maximum acceptable downtime. Recovery point objectives (RPO) specify acceptable data loss windows. Runbooks document step-by-step recovery procedures. Regular drills validate procedures and train operators. Post-recovery validation ensures full functionality restoration.

Cost management and optimization

Resource tagging strategy

Comprehensive tagging enables accurate cost attribution and optimization opportunities. Tags should capture both technical and business context for complete visibility.

Essential tag categories include:

  • Business tags: cost center, project, team, owner
  • Technical tags: environment, application, version, component
  • Operational tags: backup schedule, maintenance window, criticality
  • Compliance tags: data classification, regulatory requirements
  • Automation tags: managed by, creation date, expiration

Consistent tag enforcement requires automated validation in your deployment pipelines. Tag policies prevent resource creation without required tags. Regular audits identify and remediate untagged resources. Cost allocation reports group spending by tag dimensions.

Cost monitoring and optimization

Proactive cost management prevents budget overruns and identifies optimization opportunities. Multi-layered monitoring provides visibility from infrastructure to business metrics.

Real-time monitoring tracks current spending against budgets with alerts for anomalies. Daily spend trends identify unusual activity requiring investigation. Hourly granularity reveals usage patterns for optimization. Service-level breakdowns show which components drive costs.

Predictive analytics forecast monthly spending based on current trends. Seasonal adjustments account for known usage patterns. Growth projections inform capacity planning. What-if scenarios evaluate optimization impacts.

Optimization strategies reduce costs without compromising performance:

  • Reserved capacity provides 40-60% savings for predictable workloads
  • Spot instances offer 60-90% discounts for interruptible processing
  • Right-sizing eliminates over-provisioned resources
  • Scheduling scales down non-production environments
  • Storage tiering moves cold data to cheaper storage classes

Additional resources

After establishing your private cloud deployment, enhance it with these capabilities:

  • Implement the Autoscaling guide for dynamic resource management
  • Study the Pipeline guide for end-to-end MLOps integration
  • Configure Evaluations for model quality monitoring
  • Review Cost comparison framework for optimization strategies
  • Explore Infrastructure migration guidelines for cloud migration patterns

For infrastructure automation, use our Terraform templates which include pre-configured modules for AWS, Azure, and GCP private deployments.

Was this page helpful?
Yes
No
On this page
Private cloud deployment for Llama models
Overview
Architecture patterns for private cloud
VPC-isolated deployment
Cross-region deployment
Multi-cloud deployment
Architecture patterns comparison
AWS private deployment
Private Network
Identity and Access
Encryption and Key Management
Managed Model Serving
Azure private deployment
Private Network
Identity and Access
Encryption and Key Management
Managed Model Serving
GCP private deployment
Private Network
Identity and Access
Encryption and Key Management
Managed Model Serving
Cloud differences summary
Security and compliance implementation
Data encryption patterns
Audit logging architecture
Network security controls
High availability and disaster recovery
Multi-zone deployment
Backup and recovery procedures
Cost management and optimization
Resource tagging strategy
Cost monitoring and optimization
Additional resources
Meta
Models & Products
Docs
Community
Resources
Llama API
Download models