Meta
Models & Products
Docs
Community
Resources
Llama API
Download models

Table Of Contents

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources
Deployment guides

Cost projection

Introduction

This guide provides a comprehensive cost projection methodology for LLMs, including hosted APIs, cloud deployments, and on-premises deployments. It is designed to help you understand the total cost of operating LLMs, including both the initial setup and ongoing operational costs. This guide is generally targeted towards software and IT professionals who are responsible for making decisions about LLM usage and deployment at medium to large scale; however, the lessons within this guide can be applied to any organization, regardless of size.

Purpose of the guide

This guide will help you:

  • Understand the total cost of operating LLMs
  • Compare API vs. self-hosting costs comprehensively
  • Identify often-overlooked cost drivers

Scope and assumptions

Focus on inference costs

This guide will mainly focus on the costs of inference--that is, the cost of deploying, operating, and maintaining an LLM that has already been trained. While we will touch on a few options for cost management, such as fine-tuning a custom LLM, the costs of running large-scale training jobs are out of scope.

Model agnostic

While all LLMs, both proprietary and open-source, have their own idiosyncrasies, most have similar methods of operation and thus have similar cost patterns.

Important concepts

This section will provide an overview of critical concepts in LLM inference that will influence cost optimization.

Input vs. output tokens

The computation required to operate an LLM can be separated into two categories: input tokens and output tokens. Input tokens are the tokens that are fed into the LLM, and output tokens are the tokens that are generated by the LLM. During computation, the LLM re-runs the forward pass for each new output token, with the previous token being appended to the input. Because of this autoregressive nature, each additional output token incurs a computation cost that is proportional to the input tokens processed. Thus, output tokens are more expensive than input tokens. This is reflected in the costs of pay-as-you-go API prices, as well as in the costs required to host and run your own models.

To estimate costs, typically you will need to estimate the ratios of input, output, and cached tokens. These numbers are heavily dependent on the workload, and for the best estimates, it is recommended that you compute these based on a sample of your actual data. However, commonly used rule of thumb ratios are 4-1 or 3-1 for input/output token ratios.

KV-caching

Since each output token requires processing the entire previous sequence of tokens, during the generation of a full response there is a lot of repeated computation. This computation can be cached and reused, allowing the model to generate subsequent tokens more quickly than the first one since the only newly processed data will be the latest token seen. While it requires additional memory, this KV-caching can significantly improve the overall throughput of your model.

Additionally, it is possible to reuse these cached values for incoming queries. For requests that may use similar prompts, caching these values can improve the cost effectiveness of inference by up to orders of magnitude. This is especially useful for applications that have a high volume of similar requests, for example including tool definitions in each call. Hosted APIs typically process cached input tokens at a reduced price compared to normal input tokens.

Reasoning/thinking tokens

The latest generation of LLMs is able to reason and think about the input prompt, generating "thinking tokens" that are used to guide the generation of the final output. This is a powerful feature that can be used to improve the quality of the output, but it also incurs additional computation cost. In some cases, the model explicitly outputs these intermediate reasoning steps (e.g., a "chain-of-thought" process), which are billed as output tokens. In other cases, these steps are kept internal as the model calls tools or other functions. Even though these internal tokens are not returned to the user, they still consume computational resources and may be billed by API providers, significantly increasing the cost of inference even for short final outputs.

Vision models

Some models (including Llama 3.2 and Llama 4) are capable of processing images as input tokens. To process images, the models first tokenize the image by converting the pixel values into a series of tokens that the LLM can understand. Then, these tokens are processed alongside any text input provided to generate the output auto-regressively just like any other input tokens. There are two added costs from vision models. The first is the cost of tokenizing the image, which is incurred once per image and is typically a fraction of the overall processing. The second is the cost of the added image tokens, which typically scales with image resolution. The model's image encoder divides an image into patches to be converted into tokens; higher-resolution images require more patches, resulting in a higher token count and processing cost.

Latency considerations

Latency, or the time it takes for a model to return a response, is a critical factor in user experience and can have direct cost implications. While not a direct cost driver like tokens or GPUs, choosing a model or hosting solution based on latency requirements often involves a trade-off with price. There are two types of latency with LLMs that are visible to the user—time to first token latency and time to full response latency. Time to first token latency is the time it takes to generate the first token of the response and is influenced by the model size and input complexity. Meanwhile, the time to full response latency is the time it takes to generate the entire output sequence. This latency includes the time to first token latency, as well as the time to generate all the remaining output tokens and thus is additionally influenced by input and KV-caching.

For most user-facing applications, time to first token latency is the most important metric since streaming tokens allow the user to start reading even while generation occurs. For applications that require large-scale batch inference, time to full response latency directly affects throughput and thus is likely more important.

Latency is usually tied to the cost of the model, with more expensive models having higher latency of both types. If latency is a critical factor in your application, there are also low-latency providers such as Groq or Cerebras that offer low-latency hosting of existing models at a slightly higher cost.

Cost drivers: hosted API

Pricing models

Typically, hosted APIs charge a per-token cost that varies between input and output tokens. Some models also vary their token costs by context length, charging more per input token once the context length exceeds a threshold. Hosted APIs also offer input-token caching, charging less per token if those tokens are present in the cache. Depending on the provider, it may be possible to secure lower prices via committed use or large-scale discounts, but these deals usually require a custom contract with the provider.

Rate limits and quotas

All hosted APIs provide a rate limiting mechanism to avoid accidental overuse or intentional abuse of their systems.

The API may limit the number of tokens it will process in a given time, and/or the number of requests a user can make. Often these limits are applied on a per-model basis and are proportional to the cost of serving the model—that is, more expensive models will have lower rate limits.

Using this system correctly and to your advantage can help you prevent two costly mistakes.

Prevent overspending

If a bug in your code results in your application accidentally using far more tokens than designed, you may unexpectedly receive a large bill. It is highly recommended that you do not remove quotas and limits on your application, even if it is possible to do so. Instead:

  • Set realistic rate and quota limits, modeled either on historical traffic or predicted usage. A good rule of thumb is to set the limits at 2-3x your expected max traffic.
  • Use exponential backoff during automatic retries to slow token usage during error states.
  • Include appropriate internal error logging and alerting so that you receive notifications when your application detects errors.
  • Some services allow you to receive alerts from the platform when costs rise above a certain threshold, enabling you to take action before limits are exceeded and your application experiences downtime.

Prevent downtime

While rate limits can help protect you from accidentally spending too much due to misconfigurations or other errors, they also could prevent your service from running during times of legitimate high-volume activity. For example, the launch of a new product or having a post go viral that drives traffic to your product could suddenly cause rate limiting that brings down your entire application, potentially resulting in lost revenue. To mitigate against this risk:

  • Maintain comfortable rate limits: Ensure that your rate limits have comfortable headroom for spikes in traffic. Since increasing rate limits may require human intervention from the service provider, it is always best to have plenty of headroom in a production app.
  • Know your traffic patterns: Similar to other best practices for deployment, understanding and forecasting times of heavy traffic is critical. Your e-commerce application likely has predictable traffic spikes around major holidays, for example.
  • Fallback to another model: Most LLM-driven applications can operate using a different model and a different provider, especially for chat applications which provide common interfaces. Having an easy-to-set flag or, better yet, falling back automatically to a different model (either with the same provider or a different one) in the event of rate limiting can allow your service to degrade gracefully.
  • Heuristic fallbacks: Some applications may be able to fall back to a non-LLM option such as a set of heuristics or a simple search. Offering this final level of fallback can protect not only against rate limits but also general internet infrastructure failures.

Cloud costs

Depending on your hosting provider, you may be charged for additional cloud costs when using a hosted API. If you are working in an existing cloud, data egress and bandwidth costs may apply. For image/video processing, you may incur storage costs for media before they are processed by the LLM.

Cost drivers: self-hosting

Self-hosting refers to running your own LLM infrastructure, typically on GPUs located either rented from a public cloud or owned by your organization. This is an attractive option for organizations with a large amount of data or compute resources, or those with strict security requirements.

GPU costs

Large language models almost always require specialized hardware accelerators to run cost effectively, typically a graphics processing unit (GPU), although there are other options available. For self-hosting, the cost of the GPUs is the most important factor to consider. The GPU requirements are influenced by the amount of inference you expect to run as well as the size and type of the model.

Cloud hosted

In a typical cloud-hosted setup, you rent GPU-equipped machines from a company such as Amazon or Google, paying by the hour or month/year to fully control the computer and its resources. With this setup, you pay the same amount regardless of the utilization of the machine, and thus maximizing the throughput of each GPU is critical for cost effectiveness. Some cloud hosting providers offer specialized services atop their GPU infrastructure to permit time sharing, allowing users to pay for smaller increments of time such as minutes or even seconds of machine usage. These services are often more cost effective for spiky workloads, but may be less flexible on which models can be hosted.

On-premises

In an on-premises (on-prem) setup, you buy, assemble, and host GPU-equipped machines yourself. This setup offers the maximum flexibility, data control, and capital efficiency for your company when well utilized and configured, but at the cost of increased complexity/overhead to manage the servers as well as high capital outlay required to buy the GPUs outright. The major cost drivers in this setup are:

  • The cost of accelerator hardware. GPUs are expensive and depreciate over months, not years. This can be accounted as a fixed or monthly cost.
  • The cost of the datacenter itself, which can be either leased or purchased/built.
  • The cost of operating and maintaining the servers, which is typically viewed as a monthly cost and is primarily staffing.
  • The cost of electricity, which scales with the amount of inference performed.

On-premises hosting is usually the most cost effective for companies operating at large scale with predictable workloads. For certain critical data privacy requirements, on-prem hosting may be the only option.

Infrastructure costs

While GPU costs nearly always dominate for a self-hosted LLM application, there are still other costs to run your service. This includes other servers, storage, load balancers, networking equipment, and all of the other necessary hardware to run an internet application. While most of these costs are out of scope for this guide, there are a couple worth mentioning briefly.

  • Electricity and cooling costs: If you are running your own servers, the cost of electricity and cooling will be far greater than you may be used to from typical web applications. Modern GPUs operate at high temperatures and wattages, and often include exotic cooling systems like closed water cooling loops. Geographic location and power costs may be an important dimension when considering datacenter options.
  • GPU availability and locations: If you are using cloud services for your GPUs, it is worth considering the physical locations that your GPUs are located in. Unlike standard compute and storage products, GPUs are a scarce resource and entire cloud regions may stock out of certain types, forcing you to run your servers in a different region from your remaining infrastructure. Doing this can incur heavy data egress costs depending on your setup and your cloud provider.

Utilization and forecasting

The cost of running an LLM is heavily influenced by the utilization of the GPUs. A typical GPU has a useful lifespan of only a few years for a frontier model before becoming outdated, and thus the cost of the GPUs is a fixed cost that is spread over the lifespan of the model. Regardless of whether you are running your own GPUs or renting them from a cloud provider, you will need to estimate the utilization of your GPUs to project costs.

Projecting utilization

Projecting the utilization of your GPUs is a complex problem that requires a deep understanding of your application and its traffic patterns. However, there are some common patterns that can be used to estimate utilization.

  • Peak vs. average load: Most applications experience variable demand throughout the day, week, or year. It is important to distinguish between peak utilization (the highest sustained load you expect) and average utilization (the typical load over time). Sizing your infrastructure for peak load ensures reliability but may result in underutilized resources during off-peak times.
  • Workload predictability: Applications with predictable, steady workloads (e.g., batch processing, scheduled jobs) are easier to forecast and can achieve higher utilization. In contrast, interactive or user-driven applications (e.g., chatbots, search) may have spiky, unpredictable demand, making it harder to keep GPUs fully utilized.
  • Historical data analysis: If you have an existing application, analyze historical usage data to identify trends, spikes, and idle periods. This data can inform your projections and help you right-size your infrastructure.
  • Buffer for growth and failover: Always include a buffer in your projections to accommodate unexpected growth, failover scenarios, or maintenance windows. A common practice is to provision 20-30% extra capacity above your expected peak, but you may overprovision more based on hardware availability and expected peaks.
  • Simulation and load testing: For new applications, simulate expected traffic patterns using load testing tools. This can help you estimate the number of concurrent requests your GPUs can handle and identify bottlenecks.

Scaling strategies

Scaling your infrastructure effectively requires balancing cost, performance, and reliability as your workload grows. Start by sizing your GPUs to fit your model as well as possible on a single machine. This may be challenging for some models if they require odd-sized GPU slices, such as requiring three GPUs on a cloud that offers servers with either one or four accelerator cards.

As demand increases further, horizontal scaling will become necessary, distributing workloads across multiple machines or nodes to handle higher concurrency and provide redundancy. Automation tools and orchestration platforms, such as Kubernetes, can help manage scaling dynamically based on real-time demand, ensuring resources are allocated efficiently. Ultimately, a well-designed scaling strategy should minimize idle resources while maintaining the flexibility to respond quickly to traffic spikes. For more detailed guidance and tips, read the autoscaling guide.

Hidden cost drivers

While big-ticket costs such as per-GPU costs and GPU utilization are the most important factors to consider when projecting costs, there are also a few hidden cost drivers that are often overlooked. These costs are independent of self-hosting or hosted deployments, although for self-hosted use cases they may be higher.

Security and compliance

Ensuring the security and compliance of your LLM deployment can introduce significant, sometimes unexpected, costs. When selecting a cloud vendor, consider evaluating their ability to meet these requirements as a dimension of your cost analysis. These costs may include:

  • Data privacy and encryption: Protecting sensitive data in transit and at rest often requires implementing encryption protocols, secure storage solutions, and access controls. Some jurisdictions may have data residency requirements, and not all API or cloud providers can support these natively.
  • Audit logging: Maintaining detailed logs of data access and model usage is essential for compliance with regulations such as HIPAA, GDPR, or SOC 2. Storing, managing, and reviewing these logs can require additional storage, specialized logging infrastructure, and periodic audits.

Monitoring and observability

Effective monitoring and observability are critical for maintaining reliable and cost effective LLM operations. In cloud environments especially, where massive capacity may be available, the cost upper bound when accidentally making large API or provisioning requests can be high.

  • Logging and tracing: Capturing detailed logs and traces of inference requests, system performance, and errors is essential for troubleshooting and optimizing usage. This may require deploying log aggregation tools (e.g., ELK stack, Datadog, or CloudWatch) and allocating storage for large volumes of log data.
  • Cost attribution: Accurately attributing costs to specific teams, projects, or users can require additional instrumentation and reporting tools. This is especially important in multi-tenant environments or when running multiple models, as it helps identify cost centers and optimize resource allocation. Some teams may not be aware of best practices for reducing costs and may be unaware of the potential cost of misconfigurations.

Model updates and versioning

Keeping your LLMs up to date and managing multiple versions can introduce hidden costs that are easy to overlook:

  • Deciding when to switch: While typically, new models are better than previous versions in aggregate, there may be localized regressions on specific prompts that can hurt your application's overall performance. Investing in a rigorous evaluation procedure for each model release can prevent unexpected regressions or downtime.
  • Model versioning and migrations: Updating to new model versions or migrating between providers may require significant engineering effort to ensure compatibility. You may need to run multiple models at once during migration and provide rollback and recovery mechanisms in case of unexpected issues.
  • Consistency across environments: Ensuring that models behave consistently across development, staging, and production environments may require additional testing, validation, and infrastructure (e.g., model registries, CI/CD pipelines for models).

These hidden cost drivers can have a substantial impact on the total cost of ownership for LLM deployments, especially as systems scale or regulatory requirements become more stringent. Factoring them into your cost projections will help avoid surprises and ensure a more accurate understanding of your long-term operational expenses.

Use case examples

Let's look at a few common LLM use cases and how their token ratios and cost structures can be estimated.

Chatbot

Chatbots typically have a high ratio of input tokens to output tokens, since user prompts are often longer and responses are concise. For example, a user might send a 50-token message and receive a 30-token reply. In a multi-turn conversation, the input context grows as previous messages are included, increasing input token usage over time. KV-caching can help reduce repeated computation for prior context and reduce time to full response latency.

  • Typical input/output ratio: 2:1 to 4:1 (input:output)
  • Cost drivers: Input token volume (especially in long conversations), output token cost, and context window size.

Summarization

Summarization tasks usually involve a large input (the text to be summarized) and a much shorter output (the summary). For example, summarizing a 1,000-token article into a 100-token summary. Large documents may require models with larger supported context windows, which can increase the cost of the input tokens or the GPU requirements for hosting.

  • Typical input/output ratio: 10:1 or higher
  • Cost drivers: Large input token volume, relatively small output, potential for caching if summarizing similar documents.

Code generation

Code generation often involves a prompt (problem description or code context) and a generated code snippet. In most modern coding platforms, the system includes many additional files as context for any project that contains existing work. This means that token ratios are heavily skewed to input tokens for most work.

  • Typical input/output ratio: 100:1
  • Cost drivers: Input token cost (especially for existing codebases), effectively using input caching, amount of agentic tool calling and context engineering used by the tool.

By understanding the token patterns and cost drivers for your specific use case, you can make more accurate cost projections and optimize your LLM deployment accordingly.

Additional resources

  • Accelerator management guide: For a more complete guide on managing either your own GPUs or rented GPUs, see our GPU management guide.
  • Autoscaling guide: Learn more about how to automatically scale your infrastructure to handle varying loads in the autoscaling guide.
Was this page helpful?
Yes
No
On this page
Cost projection
Introduction
Purpose of the guide
Scope and assumptions
Important concepts
Input vs. output tokens
KV-caching
Reasoning/thinking tokens
Vision models
Latency considerations
Cost drivers: hosted API
Pricing models
Rate limits and quotas
Cloud costs
Cost drivers: self-hosting
GPU costs
Infrastructure costs
Utilization and forecasting
Scaling strategies
Hidden cost drivers
Security and compliance
Monitoring and observability
Model updates and versioning
Use case examples
Chatbot
Summarization
Code generation
Additional resources

Meta
FacebookXYouTubeLinkedIn
Documentation
OverviewModels Getting the Models Running Llama How-To Guides Integration Guides Community Support

Community
Community StoriesOpen Innovation AI Research CommunityLlama Impact Grants

Resources
CookbookCase studiesVideosAI at Meta BlogMeta NewsroomFAQPrivacy PolicyTermsCookie Policy

Llama Protections
OverviewLlama Defenders ProgramDeveloper Use Guide

Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide