Large language models (LLMs) require significantly more computational resources to run than traditional software workloads. While possible to deploy on CPUs, LLMs are typically neither cost-effective nor low-latency when running on non-specialized hardware. Instead, models are typically deployed on accelerators, such as GPUs, which are optimized for the specific type of compute-intensive operations required for LLM inference. Deploying and utilizing these expensive accelerators effectively is a critical part of deploying LLMs on self-managed infrastructure, and can drive significant and unnecessary costs if not done correctly.
Note: The term "GPU" is often used interchangeably with "accelerator", but because GPUs are a specific type of accelerator, we will use the broader term "accelerator" in this guide to refer to any hardware that is used to accelerate LLM inference (for example, Google's TPU).
An important part of managing resources such as GPUs or other accelerators is managing cost and utilization. For a more in-depth guide, read the cost projection guide.
This guide will help you:
This guide will be mainly focused on using accelerators for inference--that is, the purpose of deploying, operating, and maintaining an LLM that has already been trained. Training or fine-tuning your own models is outside of the scope of this document.
While all LLMs, both proprietary and open-source, have their own idiosyncrasies, most have similar methods of operation and thus have similar deployment patterns. For concrete examples, this guide will focus on the open-source Llama models, but the principles can be applied to any LLM.
The amount of VRAM required for a model is a function of:
As a general rule of thumb, for FP16 models you need approximately 2 bytes per parameter for the model weights, plus additional memory for KV cache (roughly 150-300 MB per 1,000 tokens depending on model size), activations, and system overhead. For example, a 70B parameter model in FP16 would require roughly 140 GB just for weights, plus additional memory for inference operations.
Many models (especially with the rise of mixture-of-experts) will be too large to fit on a single accelerator, even for the largest available cards. In these cases, it is possible to shard the model across multiple accelerators in the same machine, running part of the model on each and then combining the results. Nearly all frameworks support this out of the box, and it is the only way to run some models. It is also possible to shard the model across multiple machines (called multi-node), which has the same effect but with the added benefit of much larger possible scale. This multi-node sharding isn't supported by all frameworks.
Each cloud provider has its own instance types, with different capabilities and price points. They also often make pricing adjustments and revamp or add new instances. While this table provides a reference overview of the largest cloud providers at the time of writing, it is important to check the latest pricing and instance types for each provider.
| Provider | Instance Type | RAM | CPU | # GPUs |
|---|---|---|---|---|
| AWS | p4de.24xlarge | 1,152 GiB | 96 vCPU | 8× A100 80 GB |
| Azure | ND96amsr A100 v4 | 1,900 GiB | 96 vCPU | 8× A100 80 GB |
| GCP | a2-ultragpu-8g | 1,360 GiB | 96 vCPU | 8× A100 80 GB |
| Provider | Instance Type | RAM | CPU | # GPUs |
|---|---|---|---|---|
| AWS | p5.48xlarge | 2,048 GiB | 192 vCPU | 8× H100 80 GB |
| Azure | ND96isr H100 v5 | 1,900 GiB | 96 vCPU | 8× H100 80 GB |
| GCP | a3-highgpu-8g | 1,872 GiB | 208 vCPU | 8× H100 80 GB |
| Provider | Instance Type | RAM | CPU | # GPUs |
|---|---|---|---|---|
| AWS | p5e.48xlarge | 2,048 GiB | 192 vCPU | 8× H200 141 GB |
| Azure | ND96isr H200 v5 | 1,900 GiB | 96 vCPU | 8× H200 141 GB |
| GCP | a3-ultragpu-8g | 2,952 GiB | 224 vCPU | 8× H200 141 GB |
| Provider | Instance Type | RAM | CPU | # GPUs |
|---|---|---|---|---|
| AWS | p6-b200.48xlarge | 2,048 GiB | 192 vCPU | 8× B200 180 GB |
| Azure | ND128isrNDRGB200_v6 | 900 GiB | 128 vCPU (Grace) | 4× Blackwell (per VM) |
| GCP | a4-highgpu-8g | 3,968 GiB | 224 vCPU | 8× B200 180 GB |
Some cloud providers offer their own custom silicon accelerators designed for LLM workloads. These often offer a lower overall cost per token than GPUs, but may have limitations in the models they support or require you to use specialized software infrastructure (often locking you into one provider).
Google's Tensor Processing Units (TPUs) are custom accelerators designed specifically for machine learning workloads. TPUs are the most mature of the custom AI accelerators, having been available for public use since 2017. They are ideal when working within Google Cloud Platform, especially for models that are optimized with TensorFlow, JAX, or frameworks that leverage XLA compilation. TPUs are generally more cost-effective at large-scale inference compared to standard GPUs but require specific optimization to fully leverage their capabilities.
AWS Inferentia chips are designed primarily for inference workloads and optimized to deliver low latency and cost efficiency for running pre-trained LLMs. Inferentia supports INT8 and FP16 data types, enhancing performance while reducing cost per inference when using quantized models. They are tightly integrated into AWS's ecosystem, making them ideal for users already embedded within AWS infrastructure, but they may require specific use of AWS's Neuron SDK for optimal performance. Inferentia is best suited for production inference workloads with stable models.
Cerebras and Groq offer specialized silicon accelerators with large, wafer-scale architectures aimed at massively parallel computation. These chips feature large amounts of on-chip memory and communication bandwidth, reducing bottlenecks typically encountered in traditional GPU-based systems. They are especially suited for interactive workloads that require low latency and high throughput, usually at the expense of cost per token. These chips are typically available in a more managed fashion than other cloud providers, offering simpler setup but with less flexibility.
Most cloud providers offer several levels of instance pricing. The most common types are:
Buying your own accelerators and deploying them on-premises provides you with maximum control over your infrastructure, and at scale is typically the most cost-effective option. However, this approach requires a sufficiently large and consistent workload to ensure high utilization of the hardware, as well as the capacity to manage the full hardware lifecycle, including maintenance and utility expenses.
Selecting the right accelerators for your workload is an even more critical decision on-prem than in the cloud. Once hardware is purchased, making changes or upgrades can be difficult and costly.
Lifecycle management includes procurement, deployment, maintenance, and eventual disposal or recycling of hardware. Planning should encompass warranties, vendor support contracts, and include a clearly defined upgrade strategy. While used GPU pricing makes it difficult to estimate amortized cost, typical upgrade cycles are 2-3 years and previous generations of hardware can provide estimates of residual value.
Regularly scheduled maintenance, preventive checks, and monitoring for potential hardware issues will help maximize uptime and extend hardware longevity. Consider partnering with service providers who specialize in managing accelerator infrastructure if internal expertise or bandwidth is limited.
Effective siting of accelerators involves careful consideration of facility power availability, cooling capacity, airflow management, physical space constraints, and environmental factors. Power and cooling are the two most important factors, as accelerators are very power-hungry and require a lot of cooling.
Colocation facilities offer infrastructure managed by third-party providers, including power, cooling, security, and networking. This approach allows you to focus more on hardware management and workload optimization rather than physical infrastructure maintenance. Key factors for selecting a colo facility include location, service level agreements (SLAs), scalability options, network connectivity, and facility certifications. Some colocation facilities may lack the power or cooling requirements for dense accelerator deployments.
Deploying accelerators in your own facilities provides full control but requires significant upfront investment and ongoing operational responsibilities. High-density accelerator deployments typically require enhanced cooling solutions such as liquid cooling or advanced airflow management techniques to maintain optimal operating temperatures. Location choices should factor in power availability and cost, ease of physical access for maintenance, noise considerations, regulatory compliance, and scalability for future expansions.
With a complex system like an LLM, it is important to validate the numerical correctness of your system prior to deployment. Numerical correctness involves ensuring that the outputs of your network exactly match your expectations. In some cases, subtle errors in implementation or deployment may result in degraded output that won't be obvious without running statistical tests. One challenge to validating this correctness is that the output of an LLM is probabilistic, and thus different runs may give different outputs.
Large language models typically sample from a set of likely next token options, and thus are probabilistic models and can give different outputs for the same input prompt. This randomness can be reduced via parameter choices like setting temperature to zero and picking a fixed random seed, however non-deterministic results may still arise. There are a few reasons why this randomness may occur:
Validating numerical correctness is a difficult challenge that even frontier labs can struggle with. Comparing your server's results with a reference model's output is the ideal solution. However, this is often impractical due to the inability to control for things like library versions, hardware differences, optimization choices, and implementation details.
A more practical approach is to run statistical tests on the output of your model. Many LLMs include published benchmark results that you can reproduce and compare to your own results. Small differences are expected, but large differences (e.g., 5% or more) often warrant a deeper look, especially when those differences show your results are all worse than the published numbers.
Maximizing the utilization of your existing accelerators is critical for creating a cost-effective platform. Not only does adding additional accelerators cost more money, but in some cases and especially for on-prem deployments, there may be backlogs of several months for new hardware.