Meta

Meta
FacebookXYouTubeLinkedIn
Documentation
OverviewModels Getting the Models Running Llama How-To Guides Integration Guides Community Support

Community
Community StoriesOpen Innovation AI Research CommunityLlama Impact Grants

Resources
CookbookCase studiesVideosAI at Meta BlogMeta NewsroomFAQPrivacy PolicyTermsCookie Policy

Llama Protections
OverviewLlama Defenders ProgramDeveloper Use Guide

Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide

Table Of Contents

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources
Deployment guides

Infrastructure migration

Overview

This guide outlines a four-phase approach for migrating from OpenAI to Llama models: assessment, proof of concept, gradual migration, and optimization.

Migration methodology

Phase 1: Assessment and planning

Current state analysis

Document your current LLM infrastructure focusing on API call patterns, token-usage volumes, and response-time requirements. Identify which OpenAI endpoints you use (chat completions, embeddings, function calling) and where they're called. Capture monthly token consumption and costs per model to establish migration baselines. Note any custom prompt chains, RAG implementations, or fine-tuned models that require equivalent Llama capabilities.

Define technical requirements specific to LLM workloads: p50/p95/p99 latency targets, tokens-per-second throughput, and model size constraints based on available GPU memory. Establish cost targets comparing current OpenAI spending to projected Llama infrastructure costs. Categorize workloads by migration complexity and risk: simple chat completions (low risk), complex function calling or tool use (medium risk), and production systems with custom fine-tuning or strict compliance requirements (high risk).

OpenAI to Llama migration considerations

Because Llama models are open source, you'll select from various inference providers (Fireworks, Together AI, Groq, Amazon Bedrock, Azure AI, GCP Vertex AI) each with different API implementations. Key differences from OpenAI include function-calling syntax (OpenAI's functions parameter vs provider-specific tool schemas), response format variations (streaming implementation, token usage reporting), and model naming conventions (gpt-4o vs llama-3.3-70b-instruct). Most providers offer OpenAI-compatible endpoints to ease migration, but verify specific features such as JSON mode, system prompts, and logprobs support.

In case you have deployed a fine-tuned version of OpenAI's models, you can use the data to fine-tune in the same format for Llama's models. Refer to the Fine-tuning guide for more details on how to do so.

Prompt adaptation challenges

Prompt optimization represents one of the most critical and often underestimated aspects of model migration. The hand-tuning and optimization you performed for OpenAI models typically does not transfer to Llama models, requiring comprehensive prompt re-engineering rather than simple adjustments.

Use Llama Prompt Ops to automatically adapt prompts between OpenAI and Llama models as a starting point. Test your existing OpenAI prompts with minimal changes first, then iterate based on quality evaluations. Document baseline performance metrics before optimization to quantify improvements. Budget additional time for prompt engineering as part of your migration timeline.

Migration strategy selection

Start by assessing your setup requirements. If you need minimal setup with immediate deployment, choose Managed hosted APIs using Amazon Bedrock, GCP Vertex AI Fully Managed, or Azure AI Foundry Serverless. These services handle all infrastructure management and scaling automatically.

For moderate customization needs with some control over configurations, select Serverless GPU. Options include Amazon SageMaker JumpStart for pre-built model deployments, GCP Vertex AI Self-Deployed for containerized models, or Azure AI Foundry Managed for custom endpoints with managed compute.

When you require maximum control over infrastructure, model versions, and deployment configurations, implement GPU rental by hour. Deploy on AWS using EKS or EC2 instances, GCP with GKE or GCE, Azure through AKS or Azure VMs, or bare metal ownership for complete hardware control. This pattern provides flexibility for custom networking, specialized hardware configurations, and air-gapped deployments.

Phase 2: Proof of concept

Pilot environment setup

Select your model inference provider based on specific requirements: existing cloud providers (Amazon Bedrock, Azure AI, GCP Vertex AI) for integrated enterprise features, or specialized inference providers (Fireworks, Together AI, Groq, Replicate) for optimized performance and pricing. Evaluate providers based on regional availability, model selection, feature support (function calling, streaming, embeddings), and compliance certifications.

For managed hosted APIs, test provider-specific endpoints and authentication methods. Measure end-to-end latency including network overhead and compare token generation speeds across providers. Validate that response formats match your application's expectations since each provider may implement OpenAI compatibility differently.

Serverless GPU deployments require configuring your chosen platform's model serving infrastructure. Set up auto-scaling based on request patterns and GPU utilization. Implement VPC peering or private endpoints for secure communication. Test failover scenarios and validate monitoring integrations with your existing observability stack.

GPU rental by hour and bare metal ownership demand infrastructure automation from the start. Deploy GPU-optimized instances, with appropriate CUDA versions, for your chosen Llama model size. Configure model servers (vLLM, TGI, or TensorRT-LLM) with continuous batching, maximum batch sizes based on GPU memory, and KV cache allocation appropriate for your model size. Establish DevOps workflows for configuration management and implement comprehensive logging for troubleshooting inference performance.

Performance validation

Validate performance through systematic testing: measure latency across model sizes and providers to find optimal configurations, evaluate output quality using the Evaluations guide methodology, and track actual costs against projections. Test with production-representative workloads including peak traffic patterns and edge cases.

Since API compatibility varies by provider, validate each use case separately. Test streaming responses, function calling, error handling, and rate limit behavior specific to your chosen provider. Verify security controls including API key rotation, network isolation, and audit logging. Run full disaster recovery drills before declaring the pilot successful.

Phase 3: Gradual migration

Pre-migration preparation

Before beginning rollout, complete these essential preparations: backup configurations and API keys, test provider authentication and network connectivity, verify monitoring captures LLM-specific metrics (token-usage, model latency, error rates), validate rollback procedures with automated scripts, and schedule migrations during low-traffic periods.

Phased rollout strategy

Start with non-critical workloads: Begin with development environments, batch processing jobs, and internal tools. These low-risk migrations validate your deployment patterns, monitoring setup, and team processes without affecting production. Use A/B testing to compare Llama outputs against OpenAI baselines and track quality metrics through your evaluation framework.

Progress to pilot production workloads: Deploy to systems representing diverse use cases: simple chat interfaces, RAG applications, and function-calling implementations. Maintain parallel deployments with traffic splitting (start with a small percentage of traffic to Llama, increasing gradually) while monitoring p50/p95/p99 latency, cost, and quality metrics. Document any prompt adjustments needed for consistent quality.

Complete production migration: Proceed after achieving quality parity and cost targets. Implement automated failover between providers for high availability. Optimize infrastructure through dynamic batching, model quantization where appropriate, and regional deployment strategies. Decommission OpenAI access only after a defined stability period.

Migration execution and validation

Execute migrations with continuous validation: monitor response quality through automated evals, track performance against SLAs (p50/p95/p99 latency, throughput), validate cost tracking matches provider billing, and maintain runbooks for common issues (rate limits, timeout handling, prompt format errors). Test each integration thoroughly before proceeding to the next.

Post-migration validation ensures sustainability: compare quality metrics against baseline evaluations, verify all features work correctly (streaming, function calling, embeddings), confirm costs align with projections, and update documentation with provider-specific implementation details. Archive OpenAI configurations for potential rollback within your defined retention period.

Phase 4: Optimization and scaling (ongoing)

Performance optimization

Optimize costs through systematic improvements: implement request batching to maximize GPU utilization, use smaller models (8B or 70B vs 405B) where quality permits, cache common responses to reduce redundant inference, and leverage spot instances for async workloads. Track cost-per-request across different model sizes and providers to identify optimization opportunities.

Enhance performance with targeted tuning: adjust generation parameters (temperature, top_p) per use case, implement semantic caching for similar queries, optimize prompt lengths through iterative refinement, and use model-specific optimizations (Flash Attention for 70B+ models, continuous batching for high throughput). Deploy models closer to users with edge locations when latency requirements demand it.

Operations

Build operational maturity with LLM-specific monitoring: token-usage patterns, response quality scores from automated evals, provider API errors and rate limits, and cost anomalies. Configure alerts for quality degradation (eval score drops), latency spikes, and unexpected cost increases. Automate common responses such as provider failover and request retry logic.

Maintain excellence through continuous improvement: review model updates (Llama 3.2 → 3.3), optimize prompts based on eval results, update provider configurations for new features, and share learnings across teams. Establish monthly cost reviews comparing actual usage to projections and identifying optimization opportunities through usage pattern analysis.

Risk mitigation strategies

Define your risk tolerance thresholds based on business requirements: determine retention periods for rollback capability based on your change-management policies, set budget alert thresholds based on historical spending patterns and variance tolerance, and calculate capacity buffers based on peak traffic analysis and growth projections.

Implement technical safeguards: circuit breakers for provider failures with automatic fallback, secure API key management with regular rotation (as relevant), SLAs with inference providers matching your uptime requirements, and automated security patching for inference containers. Test disaster recovery procedures quarterly and document provider-specific limitations that affect your use cases.

Measuring success

Quality evaluations determine migration readiness. Follow the Evaluations guide to build automated evaluation suites comparing Llama outputs to OpenAI baselines. Track evaluation scores continuously and consider migration successful when Llama achieves your defined quality threshold across your use cases.

Technical metrics validate infrastructure decisions: p50/p95/p99 latency compared to baselines, tokens-per-second throughput under load, provider uptime and error rates, and cost per 1K tokens. Set clear thresholds based on your requirements.

Migration success ultimately depends on two key metrics: evaluation scores confirming quality parity and cost savings validating the business case. Document migration patterns and provider-specific optimizations to accelerate future transitions.

Tools and resources

Essential migration tools

Llama Prompt Ops: Automates prompt optimization when migrating from OpenAI to Llama models or for user-written prompts.

Provider testing: Compare inference providers by running your actual workloads during pilot phase. Test latency, throughput, feature compatibility, and pricing with production-representative traffic before committing.

Infrastructure as Code: Terraform modules for Llama deployments, Docker images with pre-compiled kernels, Flash Attention support, and minimal dependencies, and Kubernetes manifests for auto-scaling configurations. Start with provider quickstart templates then customize for production.

Monitoring solutions: OpenTelemetry instrumentation for LLM metrics, Grafana dashboards for token usage and latency tracking, and automated evaluation pipelines using evaluation and monitoring platforms.

Next steps

After completing your migration assessment, proceed to the Cost comparison framework and basic deployment patterns guide to understand possible deployment patterns and perform cost comparisons.

Reference Terraform Scripts for infrastructure automation.

Was this page helpful?
Yes
No
On this page
Infrastructure migration
Overview
Migration methodology
Phase 1: Assessment and planning
Phase 2: Proof of concept
Phase 3: Gradual migration
Phase 4: Optimization and scaling (ongoing)
Risk mitigation strategies
Measuring success
Tools and resources
Essential migration tools
Next steps
Meta
Models & Products
Docs
Community
Resources
Llama API
Download models