Table Of Contents
Meta Llama in the Cloud
This tutorial is a part of our Build with Meta Llama series, where we demonstrate the capabilities and practical applications of Llama for developers like you, so that you can leverage the benefits that Llama has to offer and incorporate it into your own applications. This tutorial supports the video Many other ways to run Llama and resources | Build with Meta Llama, where we learn about some of the various other ways in which you can host or run Meta Llama models, and provide you with all the resources that can help you get started.
If you're interested in learning by watching or listening, check out our video on Many other ways to run Llama and resources.
Apart from running the models locally, one of the most common ways to run Meta Llama models is to run them in the cloud. We saw an example of this using a service called Hugging Face in our running Llama on Windows video. Let's take a look at some of the other services we can use to host and run Llama models such as AWS, Azure, Google, Kaggle, and VertexAI—among others.
Amazon Web Services
Amazon Web Services (AWS) provides multiple ways to host your Llama models such as SageMaker Jumpstart and Bedrock.
Bedrock is a fully managed service that lets you quickly and easily build generative AI-powered experiences. To use Meta Llama with Bedrock, check out their website that goes over how to integrate and use Meta Llama models in your applications.
You can also use AWS through SageMaker JumpStart, which enables you to build, train, and deploy ML models from a broad selection of publicly available foundation models, and deploy them on SageMaker Instances for model training and inference. Learn more about how to use Meta Llama on Sagemaker on their website.
Microsoft Azure
Another way to run Meta Llama models is on Microsoft Azure. You can access Meta Llama models on Azure in two ways:
- Models as a Service (MaaS) provides access to Meta Llama hosted APIs through Azure AI Studio
- Model as a Platform (MaaP) provides access to Meta Llama family of models with out of the box support for fine-tuning and evaluation though Azure Machine Learning Studio.
Please refer to our How to Guide for more details.
Google Cloud Platform
Google Cloud Platform (GCP) is a suite of cloud computing services offered by Google. Building on top of GCP services, Model Garden on Vertex AI offers infrastructure to jumpstart your ML project, providing a single place to discover, customize, and deploy a wide range of models. You can also use Google Cloud Platform (GCP) to run Meta Llama models on your own managed infrastructure.
We have collaborated with Vertex AI from Google Cloud to offer Meta Llama models through easy-to-use interfaces. You can choose to use fully managed Llama APIs, or fine-tune and self-deploy Llama models.
NVIDIA NIM
NVIDIA NIM inference microservice streamlines the deployment of Meta Llama models anywhere, including cloud, data center, and workstations. Instructions to download and run the NVIDIA-optimized models on your local and cloud environments are provided under the Docker tab on each model page in the NVIDIA API catalog, which includes Llama 3 70B Instruct and Llama 3 8B Instruct.
Additionally, you can deploy the Meta Llama models directly from Hugging Face on top of cloud platforms with just a few clicks.
You can also try the performance-optimized NVIDIA NIM, which uses industry standard APIs, for Llama 3 models from ai.nvidia.com.
Databricks Mosaic AI
Databricks Mosaic AI has partnered with Meta to support Llama models across the full suite of AI products.
- Llama models are available in Databricks Foundation Model API, which enables easy experimentation with easy ways to scale to production with enterprise-grade security and scalability. See Get started querying LLMs on Databricks.
- Customization support is available for models through Mosaic AI Model Training. See the getting started guide to start customizing with data from Unity Catalog.
- Databricks supports the full GenAI app development cycle, including through AI functions for large-scale batch processing and AI Agents Framework and Evaluation for building production-ready agentic and RAG apps.
Snowflake Cortex AI
Snowflake Cortex AI is a suite of integrated features and services that provides fully-managed LLM inference, fine-tuning, and RAG for both structured and unstructured data analysis. The platform enables quick integration of industry-leading models, both open source and proprietary, through LLM functions or REST APIs, while maintaining enterprise-grade security and governance, all within Snowflake’s secure perimeter.
For AI engineers, Cortex AI offers instant access to Meta’s collection of LLMs with serverless inference and fine-tuning capabilities. Choose from various model sizes and language support, or run custom deployments via Snowflake Container Services. Snowflake is innovating with Meta’s Llama models through initiatives coming from Snowflake AI research team like SwiftKV, which reduces inference costs by up to 75% while maintaining model accuracy through rewiring and fine-tuning enabling customers to build more cost-effective and high-performing AI solutions through Snowflake Cortex AI.
Data engineers can run LLMs directly inside Snowflake without data movement, using existing role-based access controls to secure both models and data. This native integration enables seamless analysis of unstructured data alongside structured data, making it simple to build comprehensive AI applications or easily apply custom or out-of-the-box task functions powered by Llama while maintaining consistent governance standards.
IBM watsonx
You can also use IBM's watsonx to run Meta Llama models. IBM watsonx is an advanced platform designed for AI builders, integrating generative AI capabilities, foundation models, and traditional machine learning. It provides a comprehensive suite of tools that span the AI lifecycle, enabling users to tune models with their enterprise data. The platform supports multi-model flexibility, client protection, AI governance, and hybrid, multi-cloud deployments. It offers features for extracting insights, discovering trends, generating synthetic tabular data, running jupyter notebooks, and creating new content and code. Watsonx.ai equips data scientists with the necessary tools, pipelines, and runtimes for building and deploying ML models, thereby automating the entire AI model lifecycle.
We've worked with IBM to make Llama and Code Llama models available on their platform. To test the platform and evaluate Llama on watsonx, creating an account is free and allows testing the available models through the Prompt Lab. For detailed instructions, refer to the getting started guide and the quick start tutorials.
Other hosting providers
You can also run Llama models using hosting providers such as Together AI, Anyscale, Replicate, Groq, Fireworks AI, Cloudflare, etc. Our team has worked on step by step examples to showcase how to run Llama on externally hosted providers. The examples can be found on our Llama-cookbook GitHub repo, which goes over the process of setting up and running inference for Llama models on some of these externally hosted providers.
Running Llama on premise
Many enterprise customers prefer to deploy Llama models on-premise and on their own servers. One way to deploy and run Llama models in this manner is by using TorchServe. TorchServe is an easy to use tool for deploying PyTorch models at scale. It is cloud and environment agnostic and supports features such as multi-model serving, logging, metrics and the creation of RESTful endpoints for application integration. To learn more about how TorchServe works, with setup, quickstart, and examples check out the Github repo.
Another way to deploy llama models on premise is by using Virtual Large Language Model (vLLM) or Text Generation Inference (TGI), two leading open-source tools to deploy and serve LLMs. A detailed step by step tutorial can be found on our llama-cookbook Github repo that showcases how to use Llama models with vLLM and Hugging Face TGI, and how to create vLLM and TGI hosted Llama instances with LangChain—a language model integration framework for the creation of applications using large language models.
Resources
You can find various demos and examples that can provide you with guidance—and that you can use as references to get started with Llama models—on our Llama-cookbook GitHub repo, where you’ll find several examples for inference and fine tuning, as well as running on various API providers.
Learn more about Llama 3 and how to get started by checking out our Getting to know Llama notebook that you can find in our llama-cookbook Github repo. Here you will find a guided tour of Llama 3, including a comparison to Llama 2, descriptions of different Llama 3 models, how and where to access them, Generative AI and Chatbot architectures, prompt engineering, RAG (Retrieval Augmented Generation), fine-tuning, and more. You will find all this implemented with starter code that you can take and adapt to use in your own Meta Llama 3 projects.
To learn more about our Llama 3 models, check out our announcement blog where you can find details about how the models work, data on performance and benchmarks, information about trust and safety, and various other resources to get you started.
Get the model source from our Llama 3 Github repo, where you can learn how the models work along with a minimalist example of how to load Llama 3 models and run inference. Here, you will also find steps to download and set up the models, and examples for running the text completion and chat models.
Dive deeper and learn more about the model in the model card, which goes over the model architecture, intended use, hardware and software requirements, training data, results, and licenses.
Check out our new Meta AI, built with Llama 3 technology, which is now one of the world’s leading AI assistants that can boost your intelligence and lighten your load, helping you learn, get things done, create content, and connect to make the most out of every moment.
You can use Meta AI on Facebook, Instagram, WhatsApp, Messenger, and the web to get things done, learn, create, and connect with the things that matter to you.
To learn more about the latest updates and releases of Llama models, check out our website, where you can learn more about the latest models as well as find resources to learn more about how these models work and how you can use them in your own applications.
Check out our Getting Started guide that provides information and resources to help you set up Llama including how to access the models, prompt formats, hosting, how-to and integration guides, as well as resources that you can reference to get started with your projects.
Take a look at some of our latest blogs that discuss new announcements, the latest on the Llama ecosystem, and our responsible approach to Meta AI and Meta Llama 3.
Check out the community resources on our website to help you get started with Meta Llama models, learn about performance & latency, fine tuning, and more.
Dive deeper into prompt engineering, learning best practices for prompting Meta Llama models and interacting with Meta Llama Chat, Code Llama, and Llama Guard models in our short course on Prompt Engineering with Llama 2 on DeepLearing.ai, recently updated to showcase both Llama 2 and Llama 3 models.
Check out our Community Stories that go over interesting use cases of Llama models in various fields such as in Business, Healthcare, Gaming, Pharmaceutical, and more!
Learn more about the Llama ecosystem, building product experiences with Llama, and examples that showcase how industry pioneers have adopted Llama to build and grow innovative products for users across their platforms at Connect 2023.
Also check out our Responsible Use Guide that provides developers with recommended best practices and considerations for safely building products powered by LLMs.
We hope you found the Build with Meta Llama videos and tutorials helpful to provide you with insights and resources that you may need to get started with using Llama models.
We at Meta strongly believe in an open approach to AI development, democratizing access through an open platform and providing you with AI models, tools, and resources to give you the power to shape the next wave of innovation. We want to kickstart that next wave of innovation across the stack—from applications to developer tools to evals to inference optimizations and more. We can’t wait to see what you build and look forward to your feedback.