Meta

Meta
FacebookXYouTubeLinkedIn
Documentation
OverviewModels Getting the Models Running Llama How-To Guides Integration Guides Community Support

Community
Community StoriesOpen Innovation AI Research CommunityLlama Impact Grants

Resources
CookbookCase studiesVideosAI at Meta BlogMeta NewsroomFAQPrivacy PolicyTermsCookie Policy

Llama Protections
OverviewLlama Defenders ProgramDeveloper Use Guide

Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Meta
Models & Products
Docs
Community
Resources
Llama API
Download models

Table Of Contents

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources
Deployment guides

Autoscaling self-hosted Llama models

Overview

Autoscaling self-hosted Llama models requires balancing GPU costs, model loading times, and inference performance. Unlike API-based deployments where infrastructure is abstracted away, self-hosted models demand careful orchestration of expensive GPU resources and strategic planning for traffic patterns.

The fundamental challenge lies in the nature of large language models. A 70B parameter model requires 140 GB of VRAM in FP16 precision, taking 2-5 minutes to load from storage to GPU memory. This cold start penalty makes traditional autoscaling patterns ineffective: by the time a new instance launches, the traffic burst is often over and request queues have overflowed. Additionally, GPU instances cost $2-50 per hour, making over-provisioning prohibitively expensive while under-provisioning degrades user experience.

This guide helps you navigate these challenges by explaining autoscaling patterns, implementation strategies, and operational best practices for self-hosted Llama deployments. Whether you're running a research cluster with sporadic usage or a production service requiring sub-second latency, you'll learn how to optimize resource utilization while maintaining performance.

Understanding self-hosted autoscaling

Key metrics and scaling triggers

Effective autoscaling for LLM inference relies on understanding which metrics truly indicate scaling needs. Request queue depth provides the most reliable signal, directly correlating with user wait times. When requests queue beyond acceptable thresholds, new instances should provision before users experience timeouts.

GPU utilization, while important for cost optimization, can be misleading as a primary scaling trigger. Inference workloads typically show high GPU utilization (80-95%) even under normal load due to the computational intensity of token generation. Memory utilization provides better insights, especially when implementing dynamic batching strategies that trade memory for throughput.

LLM-specific performance metrics serve as crucial quality indicators rather than direct scaling triggers. Monitor key Service Level Objectives (SLOs) like time to first token (TTFT) to measure perceived responsiveness, and output tokens per second for generation speed. For interactive applications, target a TTFT under 500ms and a generation rate of 20-50 tokens per second per user to ensure a good user experience.

Model sizing and GPU requirements

Selecting appropriate GPU instances requires understanding the relationship between model parameters, precision, and memory requirements:

Model SizeFP16 MemoryINT8 MemoryINT4 MemoryMinimum GPURecommended GPU
7B14 GB7 GB3.5 GB1x T4 (16 GB)1x A10G (24 GB)
13B26 GB13 GB6.5 GB1x V100 (32 GB)1x A100 (40 GB)
70B140 GB70 GB35 GB2x A100 (80 GB)4x A100 (40 GB)
405B810 GB405 GB203 GB8x H100 (80 GB)8x H200 (141 GB)

These calculations include model weights only. Add 20-30% overhead for KV cache, activations, and framework requirements. Quantization dramatically reduces memory requirements but impacts model quality, with INT4 showing noticeable degradation on complex reasoning tasks.

Cold start optimization strategies

Model loading dominates cold start times in LLM deployments. A 70B model loads in distinct phases: storage to system RAM (30-60 seconds), system RAM to GPU memory (60-120 seconds), and initialization of CUDA kernels and KV cache (10-30 seconds). Understanding this pipeline helps identify optimization opportunities.

Pre-loading models onto local NVMe storage reduces loading time by 50-70% compared to object storage. Maintaining warm standby instances eliminates cold starts entirely but doubles infrastructure costs. The optimal strategy depends on your traffic patterns and latency requirements.

Very large models may need to be sharded across GPUs, which introduces additional startup complexity. Pipeline parallelism minimizes communication overhead but can underutilize GPUs during the pipeline bubble. Tensor parallelism provides better utilization but requires high-bandwidth interconnects. For production deployments, tensor parallelism typically delivers superior performance despite the infrastructure requirements.

Implementation patterns

LLM autoscaling uses two primary patterns. Horizontal scaling adds or removes instances to handle unpredictable traffic and is the flexible, common choice for production workloads. Vertical scaling adjusts the resources of existing instances and works best for predictable load changes, like scaling down overnight.

Horizontal scaling architectures

Horizontal scaling provides the most flexibility but requires careful orchestration to manage costs and performance. Container orchestration platforms like Kubernetes provide the foundation, using Horizontal Pod Autoscalers (HPA) to trigger scaling on custom metrics like queue depth:

# Simplified HPA for queue-based scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: inference_queue_depth
      target:
        averageValue: "30"  # Scale when queue exceeds 30 requests

Cloud platforms offer managed solutions that simplify deployment. Amazon SageMaker provides built-in autoscaling with customizable metrics. Google Vertex AI offers similar capabilities with automatic model optimization. Azure ML supports both online endpoints and batch inference with integrated scaling. See deployment templates for production-ready configurations.

Vertical scaling patterns

Vertical scaling is most effective for scheduled, time-based resource changes. Time-based vertical scaling reduces costs during off-peak hours by scaling down to smaller instance types overnight and scaling up before business hours. This approach requires graceful handling of instance replacements and temporary capacity reduction during transitions.

Dynamic vertical scaling based on load remains challenging for GPU workloads. Unlike CPU and memory, GPUs cannot be dynamically added to running instances. Some cloud providers offer GPU partitioning (MIG on NVIDIA A100/H100) that enables partial GPU allocation, though framework support varies.

Framework considerations

Choosing an inference framework

Framework selection significantly impacts autoscaling capabilities and operational complexity:

FrameworkStrengthsLimitationsBest For
vLLMPagedAttention optimization, High throughput, OpenAI-compatible APIComplex configuration, Memory-intensiveHigh-volume production services
TGIProduction-ready, Built-in quantization, Speculation supportHugging Face ecosystem lock-inRapid prototyping and deployment
OllamaSimple deployment, Automatic model managementLimited customization, Single-node onlyDevelopment and small-scale production
TensorRT-LLMMaximum performance, Hardware optimizationNVIDIA-only, Complex build processLatency-critical applications

For autoscaling, vLLM and TGI provide the best balance of performance and operational features. Both support distributed inference, custom metrics export, and graceful shutdown handling required for production autoscaling.

Quantization trade-offs

Quantization enables running larger models on smaller GPUs, directly impacting scaling economics:

# Memory calculation example
def calculate_memory_gb(params_billions, precision_bits):
    bytes_per_param = precision_bits / 8
    return (params_billions * 1e9 * bytes_per_param) / (1024**3)

# 70B model comparison
fp16_memory = calculate_memory_gb(70, 16)  # 140 GB
int4_memory = calculate_memory_gb(70, 4)   # 35 GB

AWQ and GPTQ maintain quality better than simple round-to-nearest quantization. For Llama models, AWQ typically preserves 98-99% of FP16 quality at INT4 precision. However, complex reasoning and mathematical tasks show more degradation. Test quantized models thoroughly against your specific use cases before production deployment.

Production operations

Monitoring and observability

Effective monitoring enables proactive scaling and rapid issue resolution. Track these essential metrics:

  • System metrics: GPU utilization, memory usage, temperature, and power consumption indicate hardware health and capacity. Monitor at both node and cluster levels to identify hot spots and imbalanced load distribution.

  • Application metrics: Request latency (P50, P95, P99), token generation rate, and queue depth directly impact user experience. Set alerts on P99 latency exceeding SLA thresholds and queue depth growing beyond normal ranges.

  • Business metrics: Cost per request, daily active models, and capacity utilization help optimize infrastructure spending. Track spot instance savings and compare against on-demand costs to validate your mixed instance strategy.

Cost optimization strategies

GPU costs dominate self-hosted LLM expenses. Implement these strategies to optimize spending without sacrificing performance:

  • Reserved instances and aggressive scale-down: Use reserved instances for baseline capacity to get 30-60% discounts, and scale with on-demand or spot instances. During low-traffic periods, scale to zero and accept cold-start penalties, as LLM inference usage is often minimal overnight.

  • Spot instances: Use spot instances for burst scaling to save 60-90%, supplementing a baseline of on-demand instances. Handle interruptions gracefully by using the 30-120 second termination notice to drain requests, redirect traffic, and ensure enough on-demand capacity to cover the reclamation.

  • Fallback scaling: For extreme traffic bursts, maintain availability by routing overflow requests to a pool of smaller, faster, or quantized models. This "graceful degradation" serves a lower-quality response instead of an error, e.g., routing requests from a saturated 70B model to a 13B quantized model.

  • Multi-tenancy: Amortize GPU costs by serving multiple models or users from the same instances. Use request routing, priority queues, and model caching to manage resource allocation and keep frequently used models warm.

Troubleshooting common issues

Out-of-memory errors during scaling indicate insufficient headroom for batch processing. Calculate maximum batch sizes based on available memory after model loading. Implement dynamic batch sizing that adjusts based on current memory usage rather than fixed configurations.

Slow scaling response times suggest overly conservative scaling policies. Reduce stabilization windows and increase scale-up percentages for faster response to load spikes. Monitor false positive scaling events and adjust thresholds accordingly.

Uneven load distribution across instances points to ineffective load balancing. Ensure your load balancer uses least-connections rather than round-robin for long-running inference requests. Implement connection draining with sufficient timeout for in-flight requests.

Practical examples

Basic autoscaling setup

Deploy a simple autoscaled Llama inference service using managed platforms:

# SageMaker endpoint with autoscaling
aws sagemaker create-endpoint-configuration \
  --endpoint-config-name llama-70b-config \
  --production-variants VariantName=primary,ModelName=llama-70b,InstanceType=ml.g5.12xlarge

# Configure autoscaling
aws application-autoscaling register-scalable-target \
  --service-namespace sagemaker \
  --resource-id endpoint/llama-endpoint/variant/primary \
  --scalable-dimension sagemaker:variant:DesiredInstanceCount \
  --min-capacity 1 --max-capacity 10

Production deployment checklist

Before deploying autoscaled LLM inference to production, validate:

  • [ ] Model loads successfully on target GPU instances
  • [ ] Inference latency meets P99 SLA requirements under load
  • [ ] Autoscaling triggers on appropriate metrics (queue depth, not GPU util)
  • [ ] Graceful shutdown handles in-flight requests
  • [ ] Monitoring dashboards show all critical metrics
  • [ ] Cost alerts configured for unexpected scaling events
  • [ ] Spot instance interruption handling tested
  • [ ] Model weights cached on fast storage (NVMe preferred)
  • [ ] Health checks validate model readiness, not just container status
  • [ ] Load balancer timeout exceeds maximum inference time

Next steps

Deploy your first autoscaled Llama model using SageMaker terraform templates or Kubernetes configurations. These provide production-ready starting points with integrated monitoring and cost optimization.

Review the accelerator management guide for detailed GPU selection criteria and performance benchmarks. Understanding hardware capabilities helps optimize instance selection and scaling thresholds.

Explore quantization techniques to reduce infrastructure costs while maintaining acceptable model quality. Start with AWQ INT4 quantization for the best quality-to-compression ratio.

For framework-specific optimizations, consult vLLM documentation for high-throughput deployments or TGI documentation for rapid prototyping.

Note: Production autoscaling requires thorough testing under realistic load patterns. Use load testing tools to validate scaling behavior and identify bottlenecks before going live. Start with conservative scaling policies and adjust based on observed metrics.

Was this page helpful?
Yes
No
On this page
Autoscaling self-hosted Llama models
Overview
Understanding self-hosted autoscaling
Key metrics and scaling triggers
Model sizing and GPU requirements
Cold start optimization strategies
Implementation patterns
Horizontal scaling architectures
Vertical scaling patterns
Framework considerations
Choosing an inference framework
Quantization trade-offs
Production operations
Monitoring and observability
Cost optimization strategies
Troubleshooting common issues
Practical examples
Basic autoscaling setup
Production deployment checklist
Next steps