Meta

Meta
FacebookXYouTubeLinkedIn
Documentation
OverviewModels Getting the Models Running Llama How-To Guides Integration Guides Community Support

Community
Community StoriesOpen Innovation AI Research CommunityLlama Impact Grants

Resources
CookbookCase studiesVideosAI at Meta BlogMeta NewsroomFAQPrivacy PolicyTermsCookie Policy

Llama Protections
OverviewLlama Defenders ProgramDeveloper Use Guide

Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Meta
Models & Products
Docs
Community
Resources
Llama API
Download models

Table Of Contents

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources

Running Meta Llama on Linux

This tutorial is a part of our Build with Meta Llama series, where we demonstrate the capabilities and practical applications of Llama for developers like you, so that you can leverage the benefits that Llama has to offer and incorporate it into your own applications. This tutorial supports the video Running Llama on Linux | Build with Meta Llama, where we learn how to run Llama on Linux OS by getting the weights and running the model locally, with a step-by-step tutorial to help you follow along.

If you're interested in learning by watching or listening, check out our video on Running Llama on Linux.

Introduction to llama models

At Meta, we strongly believe in an open approach to AI development, particularly in the fast-evolving domain of generative AI. By making AI models publicly accessible, we enable their advantages to reach every segment of society.

Meta Llama website hero

Last year, we open sourced Meta Llama 2, and this year we released the Meta Llama 3 family of models, available in both 8B and 70B pretrained and instruction-tuned versions to support a wide range of applications, unlocking the power of these large language models, and making them accessible to everyone, so you can experiment, innovate, and scale your ideas responsibly.

Meta Llama 3 Pre-trained model performance

Running Meta Llama on Linux

Setup

With a Linux setup having a GPU with a minimum of 16GB VRAM, you should be able to load the 8B Llama models in fp16 locally. If you have an Nvidia GPU, you can confirm your setup using the NVIDIA System Management Interface tool that shows you the GPU you have, the VRAM available, and other useful information by typing:

nvidia-smi

In our current setup, we are on Ubuntu, specifically Pop OS, and have an Nvidia RTX 4090 with a total VRAM of about 24GB.

Terminal with nvidia-smi showing NVIDIA GPU Configuration
Terminal with nvidia-smi showing NVIDIA GPU Configuration

Getting the weights

To download the weights, go to the Llama website. Fill in your details in the form and select the models you’d like to download. In our case, we will download the Llama 3 models.

Meta Llama model access with Meta Llama 3 and Meta Llama Guard 2 selected]                     Select Meta Llama 3 and Meta Llama Guard 2 on the download page
Select Meta Llama 3 and Meta Llama Guard 2 on the download page

Read and agree to the license agreement, then click Accept and continue. You will see a unique URL on the website. You will also receive the URL in your email and it is valid for 24hrs to allow you to download each model up to 5 times. You can always request a new URL.

Download page with unique pre-signed URL
Download page with unique pre-signed URL

We are now ready to get the weights and run the model locally on our machine. It is recommended to use a Python virtual environment for running this demo. In this demo, we are using Miniconda, but you can use any virtual environment of your choice.

Open your terminal, and make a new folder called llama3-demo in your workspace. Navigate to the new folder and clone the Llama repo:

mkdir llama3-demo
cd llama3-demo 
git clone https://github.com/meta-llama/llama3.git

For this demo, we’ll need two prerequisites installed: wget and md5sum. To confirm if your distribution has these, use:

wget --version
md5sum --version

which should return the installed versions. If your distribution does not have these, you can install them using

apt-get install wget
apt-get install md5sum

To make sure we have all the package dependencies installed, while in the newly cloned repo folder, type:

pip install -e .

We are now all set to download the model weights for our local setup. Our team has created a helper script to make it easy to download the model weights. In your terminal, type:

./download.sh

The script will ask for the URL from your email. Paste in the URL you received from Meta. It will then ask you to enter the list of models to download. For our example, we’ll download the 8B pretrained model and the fine-tuned 8B chat models. So we’ll enter “8B,8B-instruct”.

Downloading the 8B models
Downloading the 8B models

Running the model

We are all set to run the example inference script to test if our model has been set up correctly and works. Our team has created an example Python script called example_text_completion.py that you can use to test out the model.

The script defines a main function that uses the Llama class from the llama library to generate text completions for given prompts using the pre-trained models. It takes a few arguments:

Parameters

Descriptions

ckpt_dir: str

Directory containing the checkpoint files of the model.

tokenizer_path: str

Path to the tokenizer of the model.

temperature: float = 0.6

This parameter controls the randomness of the generation process. Higher values may lead to more creative but less coherent outputs, while lower values may lead to more conservative but more coherent outputs.

top_p: float = 0.9

This defines the maximum probability threshold for generating tokens.

max_seq_len: int = 128

Defines the maximum length of the input sequence or prompt allowed for the model to process.

max_gen_len: int = 64

Defines the maximum length of the generated text the model is allowed to produce.

max_batch_size: int = 4

Defines the maximum number of prompts to process in one batch.

The main function builds an instance of the Llama class, using the provided arguments, then defines a list of prompts for which the model will use generator.text_completion method to generate the completions.

To run the script, go back to our terminal, and while in the llama3 repo, type:

torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir Meta-Llama-3-8B/ --tokenizer_path Meta-Llama-3-8B/tokenizer.model --max_seq_len 128 --max_batch_size 4

Replace Meta-Llama-3-8B/ with the path to your checkpoint directory and tokenizer.model with the path to your tokenizer model. If you run it from this main directory, the path may not need to change.

Set the –nproc_per_node to the MP value for the model you are using. For 8B models, the value is set to 1.

Adjust the max_seq_len and max_batch_size parameters as needed. We have set them to 128 and 4 respectively.

Running the 8B model on the example text completion script
Running the 8B model on the example text completion script

To try out the fine-tuned chat model (8B-instruct), we have a similar example called example_chat_completion.py.

torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir Meta-Llama-3-8B-Instruct/ --tokenizer_path Meta-Llama-3-8B-Instruct/tokenizer.model --max_seq_len 512 --max_batch_size 6

Note that in this case, we use the Meta-Llama-3-8B-Instruct/ model and provide the correct tokenizer under the instruct model folder.

Running the 8B Instruct model on the example chat completion script
Running the 8B Instruct model on the example chat completion script

A detailed step-by-step process to run on this setup, as well as all the helper and example scripts can be found on our Llama3 GitHub repo, which goes over the process of downloading and quick-start, as well as examples for inference.

Was this page helpful?
Yes
No
On this page
Running Meta Llama on Linux
Introduction to llama models
Running Meta Llama on Linux
Setup
Getting the weights
Running the model