Meta

Meta
FacebookXYouTubeLinkedIn
Documentation
OverviewModels Getting the Models Running Llama How-To Guides Integration Guides Community Support

Community
Community StoriesOpen Innovation AI Research CommunityLlama Impact Grants

Resources
CookbookCase studiesVideosAI at Meta BlogMeta NewsroomFAQPrivacy PolicyTermsCookie Policy

Llama Protections
OverviewLlama Defenders ProgramDeveloper Use Guide

Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Meta
Models & Products
Docs
Community
Resources
Llama API
Download models

Table Of Contents

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources
Deployment guides

Accelerator management

Introduction

Large language models (LLMs) require significantly more computational resources to run than traditional software workloads. While possible to deploy on CPUs, LLMs are typically neither cost-effective nor low-latency when running on non-specialized hardware. Instead, models are typically deployed on accelerators, such as GPUs, which are optimized for the specific type of compute-intensive operations required for LLM inference. Deploying and utilizing these expensive accelerators effectively is a critical part of deploying LLMs on self-managed infrastructure, and can drive significant and unnecessary costs if not done correctly.

Note: The term "GPU" is often used interchangeably with "accelerator", but because GPUs are a specific type of accelerator, we will use the broader term "accelerator" in this guide to refer to any hardware that is used to accelerate LLM inference (for example, Google's TPU).

An important part of managing resources such as GPUs or other accelerators is managing cost and utilization. For a more in-depth guide, read the cost projection guide.

Purpose of the guide

This guide will help you:

  • Understand key considerations when selecting and managing accelerators, including factors impacting cost-effectiveness and performance.
  • Effectively deploy and optimize accelerator usage, ensuring maximum utilization and minimal idle time for LLM inference workloads.
  • Navigate both cloud-hosted and on-premises accelerator management strategies to achieve optimal efficiency and performance.

Scope and assumptions

Focus on inference

This guide will be mainly focused on using accelerators for inference--that is, the purpose of deploying, operating, and maintaining an LLM that has already been trained. Training or fine-tuning your own models is outside of the scope of this document.

Model agnostic

While all LLMs, both proprietary and open-source, have their own idiosyncrasies, most have similar methods of operation and thus have similar deployment patterns. For concrete examples, this guide will focus on the open-source Llama models, but the principles can be applied to any LLM.

Accelerator selection

What factors to consider when selecting an accelerator

  • Memory: The amount of video memory (VRAM) available on the accelerator is a critical factor in the size of model you can run. While it is possible to offload some of the model to system memory, in practice this reduces performance so much that it is never cost effective to do so. Additionally, some options that improve performance (like KV-caching) will consume memory.
  • Compute Power: The number of floating point operations per second (FLOPS) the accelerator can perform determines the throughput and latency of your model. Some accelerators include specialized hardware optimized for low-precision operations, such as INT8 or INT4 matrix multiplications, enabling more efficient and effective execution of quantized model versions.
  • Availability and cost: State-of-the-art accelerator hardware (especially for datacenter use) comes at a high cost. A single accelerator may cost tens of thousands of dollars. Additionally, accelerators are often in high demand and can be difficult to source, both for purchase and for cloud usage.
  • Support: While most accelerators will support some reference architectures, newer or less widely spread models may have operations that are either unsupported or poorly optimized on some cards.

How to estimate model requirements

The amount of VRAM required for a model is a function of:

  • Model parameters: The total number of parameters in the model (e.g., 7B, 13B, 70B). Each parameter requires memory to store its weight value. Models with more parameters require proportionally more VRAM.
  • Numerical precision: The data type used to store parameters and activations (FP32, FP16, BF16, INT8, INT4). Lower precision formats reduce memory requirements--for example, FP16 uses half the memory of FP32, while INT8 uses one-quarter.
  • Sequence length: The maximum context length your application will use. Longer sequences require more memory for the key-value (KV) cache that stores attention information for previously processed tokens.
  • Batch size: The number of requests processed simultaneously. Each additional request in a batch requires memory for its activations and KV cache.

As a general rule of thumb, for FP16 models you need approximately 2 bytes per parameter for the model weights, plus additional memory for KV cache (roughly 150-300 MB per 1,000 tokens depending on model size), activations, and system overhead. For example, a 70B parameter model in FP16 would require roughly 140 GB just for weights, plus additional memory for inference operations.

Sharding and multi-node

Many models (especially with the rise of mixture-of-experts) will be too large to fit on a single accelerator, even for the largest available cards. In these cases, it is possible to shard the model across multiple accelerators in the same machine, running part of the model on each and then combining the results. Nearly all frameworks support this out of the box, and it is the only way to run some models. It is also possible to shard the model across multiple machines (called multi-node), which has the same effect but with the added benefit of much larger possible scale. This multi-node sharding isn't supported by all frameworks.

Cloud-hosted

Instance types matrix

Each cloud provider has its own instance types, with different capabilities and price points. They also often make pricing adjustments and revamp or add new instances. While this table provides a reference overview of the largest cloud providers at the time of writing, it is important to check the latest pricing and instance types for each provider.

A100

ProviderInstance TypeRAMCPU# GPUs
AWSp4de.24xlarge1,152 GiB96 vCPU8× A100 80 GB
AzureND96amsr A100 v41,900 GiB96 vCPU8× A100 80 GB
GCPa2-ultragpu-8g1,360 GiB96 vCPU8× A100 80 GB

H100

ProviderInstance TypeRAMCPU# GPUs
AWSp5.48xlarge2,048 GiB192 vCPU8× H100 80 GB
AzureND96isr H100 v51,900 GiB96 vCPU8× H100 80 GB
GCPa3-highgpu-8g1,872 GiB208 vCPU8× H100 80 GB

H200

ProviderInstance TypeRAMCPU# GPUs
AWSp5e.48xlarge2,048 GiB192 vCPU8× H200 141 GB
AzureND96isr H200 v51,900 GiB96 vCPU8× H200 141 GB
GCPa3-ultragpu-8g2,952 GiB224 vCPU8× H200 141 GB

B200

ProviderInstance TypeRAMCPU# GPUs
AWSp6-b200.48xlarge2,048 GiB192 vCPU8× B200 180 GB
AzureND128isrNDRGB200_v6900 GiB128 vCPU (Grace)4× Blackwell (per VM)
GCPa4-highgpu-8g3,968 GiB224 vCPU8× B200 180 GB

Custom accelerators

Some cloud providers offer their own custom silicon accelerators designed for LLM workloads. These often offer a lower overall cost per token than GPUs, but may have limitations in the models they support or require you to use specialized software infrastructure (often locking you into one provider).

Google TPUs

Google's Tensor Processing Units (TPUs) are custom accelerators designed specifically for machine learning workloads. TPUs are the most mature of the custom AI accelerators, having been available for public use since 2017. They are ideal when working within Google Cloud Platform, especially for models that are optimized with TensorFlow, JAX, or frameworks that leverage XLA compilation. TPUs are generally more cost-effective at large-scale inference compared to standard GPUs but require specific optimization to fully leverage their capabilities.

AWS Inferentia

AWS Inferentia chips are designed primarily for inference workloads and optimized to deliver low latency and cost efficiency for running pre-trained LLMs. Inferentia supports INT8 and FP16 data types, enhancing performance while reducing cost per inference when using quantized models. They are tightly integrated into AWS's ecosystem, making them ideal for users already embedded within AWS infrastructure, but they may require specific use of AWS's Neuron SDK for optimal performance. Inferentia is best suited for production inference workloads with stable models.

Cerebras/Groq

Cerebras and Groq offer specialized silicon accelerators with large, wafer-scale architectures aimed at massively parallel computation. These chips feature large amounts of on-chip memory and communication bandwidth, reducing bottlenecks typically encountered in traditional GPU-based systems. They are especially suited for interactive workloads that require low latency and high throughput, usually at the expense of cost per token. These chips are typically available in a more managed fashion than other cloud providers, offering simpler setup but with less flexibility.

On-demand vs. reserved instances

Most cloud providers offer several levels of instance pricing. The most common types are:

  • On-demand: Pay-per-use pricing (often billed by the minute) with no upfront commitment. Instances can be launched and terminated at any time, making this the most flexible option but typically the most expensive per hour. Ideal for unpredictable workloads, development/testing, or when you need immediate access to GPU resources without planning ahead.
  • Reserved: With reserved instances you commit to using specific instance types for 1-3 years in exchange for discounts (typically 30-60% off on-demand pricing). Requires upfront payment or commitment but provides predictable costs and guaranteed capacity. Best for steady-state production workloads with consistent resource requirements.
  • Spot/Preemptible: Spot/preemptible instances allow you to buy unused capacity at a reduced price, but instances can be interrupted with short notice (typically 2-30 minutes) when capacity is needed elsewhere. Ideal for fault-tolerant batch processing, development environments, or workloads that can handle interruptions gracefully.

On-premises

Buying your own accelerators and deploying them on-premises provides you with maximum control over your infrastructure, and at scale is typically the most cost-effective option. However, this approach requires a sufficiently large and consistent workload to ensure high utilization of the hardware, as well as the capacity to manage the full hardware lifecycle, including maintenance and utility expenses.

Accelerator selection and sourcing

Selecting the right accelerators for your workload is an even more critical decision on-prem than in the cloud. Once hardware is purchased, making changes or upgrades can be difficult and costly.

Lifecycle management

Lifecycle management includes procurement, deployment, maintenance, and eventual disposal or recycling of hardware. Planning should encompass warranties, vendor support contracts, and include a clearly defined upgrade strategy. While used GPU pricing makes it difficult to estimate amortized cost, typical upgrade cycles are 2-3 years and previous generations of hardware can provide estimates of residual value.

Regularly scheduled maintenance, preventive checks, and monitoring for potential hardware issues will help maximize uptime and extend hardware longevity. Consider partnering with service providers who specialize in managing accelerator infrastructure if internal expertise or bandwidth is limited.

Siting considerations

Effective siting of accelerators involves careful consideration of facility power availability, cooling capacity, airflow management, physical space constraints, and environmental factors. Power and cooling are the two most important factors, as accelerators are very power-hungry and require a lot of cooling.

Colocation (Colo)

Colocation facilities offer infrastructure managed by third-party providers, including power, cooling, security, and networking. This approach allows you to focus more on hardware management and workload optimization rather than physical infrastructure maintenance. Key factors for selecting a colo facility include location, service level agreements (SLAs), scalability options, network connectivity, and facility certifications. Some colocation facilities may lack the power or cooling requirements for dense accelerator deployments.

Self-sited

Deploying accelerators in your own facilities provides full control but requires significant upfront investment and ongoing operational responsibilities. High-density accelerator deployments typically require enhanced cooling solutions such as liquid cooling or advanced airflow management techniques to maintain optimal operating temperatures. Location choices should factor in power availability and cost, ease of physical access for maintenance, noise considerations, regulatory compliance, and scalability for future expansions.

Deploying on accelerators

Numerical correctness

With a complex system like an LLM, it is important to validate the numerical correctness of your system prior to deployment. Numerical correctness involves ensuring that the outputs of your network exactly match your expectations. In some cases, subtle errors in implementation or deployment may result in degraded output that won't be obvious without running statistical tests. One challenge to validating this correctness is that the output of an LLM is probabilistic, and thus different runs may give different outputs.

Why LLMs may give different answers to the same prompt

Large language models typically sample from a set of likely next token options, and thus are probabilistic models and can give different outputs for the same input prompt. This randomness can be reduced via parameter choices like setting temperature to zero and picking a fixed random seed, however non-deterministic results may still arise. There are a few reasons why this randomness may occur:

  • Floating-point Precision Differences: GPU kernels may use mixed-precision (e.g., FP16, BF16) for performance gains. Minor changes in floating-point math can accumulate, causing drift in logits or token selection.
  • Determinism and Parallelism: Operations in CUDA or multi-GPU environments may be non-deterministic unless explicitly configured. Some systems may apply optimizations that sacrifice determinism for speed.
  • Library and Kernel Versions: Differences in versions of cuBLAS, cuDNN, or transformer libraries may lead to slightly different outputs. Libraries may require updates to correctly serve the latest architectures. Always try to use the latest software version possible, especially for new models.

Validating numerical correctness

Validating numerical correctness is a difficult challenge that even frontier labs can struggle with. Comparing your server's results with a reference model's output is the ideal solution. However, this is often impractical due to the inability to control for things like library versions, hardware differences, optimization choices, and implementation details.

A more practical approach is to run statistical tests on the output of your model. Many LLMs include published benchmark results that you can reproduce and compare to your own results. Small differences are expected, but large differences (e.g., 5% or more) often warrant a deeper look, especially when those differences show your results are all worse than the published numbers.

Improving utilization

Maximizing the utilization of your existing accelerators is critical for creating a cost-effective platform. Not only does adding additional accelerators cost more money, but in some cases and especially for on-prem deployments, there may be backlogs of several months for new hardware.

  • Batching requests: Grouping multiple inference requests together and processing them in a single batch can significantly improve accelerator throughput, especially for models that are optimized for batch processing. This is particularly effective for workloads with high concurrency or when slight increases in latency are acceptable.
  • Utilize caching: Enabling KV-caching can significantly improve accelerator throughput, especially for workloads that have similar prompt patterns. Modify your prompts to keep as much of the initial prompt unchanged as possible, for example do not include a timestamp early in the prompt or you will invalidate the entire cache.
  • Job scheduling: For non-interactive workloads (e.g., data processing, report generation), schedule jobs during off-peak hours to maximize accelerator usage and potentially take advantage of lower-cost compute resources.
  • Multi-tenancy: If your organization supports multiple applications or teams, consider sharing accelerator resources across workloads. Multi-tenancy can help smooth out usage spikes and improve overall utilization.
  • Model optimization: Use quantization, pruning, or distillation techniques to reduce model size and inference time, allowing each accelerator to serve more requests per second. Quantization in particular is an easy way to reduce the cost of running a model.
  • Monitoring and alerting: Continuously monitor accelerator utilization metrics and set up alerts for underutilization or overutilization. This enables you to make timely adjustments to your infrastructure and avoid unnecessary costs.

Additional resources

  • Cost projection guide: For a more complete guide on estimating the costs of running LLMs, see our cost projection guide.
  • Autoscaling guide: Learn more about how to automatically scale your infrastructure to handle varying loads in the autoscaling guide.
Was this page helpful?
Yes
No
On this page
Accelerator management
Introduction
Purpose of the guide
Scope and assumptions
Accelerator selection
What factors to consider when selecting an accelerator
How to estimate model requirements
Sharding and multi-node
Cloud-hosted
Instance types matrix
A100
H100
H200
B200
Custom accelerators
On-demand vs. reserved instances
On-premises
Accelerator selection and sourcing
Lifecycle management
Siting considerations
Deploying on accelerators
Numerical correctness
Improving utilization
Additional resources