This tutorial is a part of our Build with Meta Llama series, where we demonstrate the capabilities and practical applications of Llama for developers like you, so that you can leverage the benefits that Llama has to offer and incorporate it into your own applications. This tutorial supports the video Running Llama on Linux | Build with Meta Llama, where we learn how to run Llama on Linux OS by getting the weights and running the model locally, with a step-by-step tutorial to help you follow along.
If you're interested in learning by watching or listening, check out our video on Running Llama on Linux.
At Meta, we strongly believe in an open approach to AI development, particularly in the fast-evolving domain of generative AI. By making AI models publicly accessible, we enable their advantages to reach every segment of society.
Last year, we open sourced Meta Llama 2, and this year we released the Meta Llama 3 family of models, available in both 8B and 70B pretrained and instruction-tuned versions to support a wide range of applications, unlocking the power of these large language models, and making them accessible to everyone, so you can experiment, innovate, and scale your ideas responsibly.
With a Linux setup having a GPU with a minimum of 16GB VRAM, you should be able to load the 8B Llama models in fp16 locally. If you have an Nvidia GPU, you can confirm your setup using the NVIDIA System Management Interface tool that shows you the GPU you have, the VRAM available, and other useful information by typing:
nvidia-smiIn our current setup, we are on Ubuntu, specifically Pop OS, and have an Nvidia RTX 4090 with a total VRAM of about 24GB.
To download the weights, go to the Llama website. Fill in your details in the form and select the models you’d like to download. In our case, we will download the Llama 3 models.
Read and agree to the license agreement, then click Accept and continue. You will see a unique URL on the website. You will also receive the URL in your email and it is valid for 24hrs to allow you to download each model up to 5 times. You can always request a new URL.
We are now ready to get the weights and run the model locally on our machine. It is recommended to use a Python virtual environment for running this demo. In this demo, we are using Miniconda, but you can use any virtual environment of your choice.
Open your terminal, and make a new folder called llama3-demo in your workspace. Navigate to the new folder and clone the Llama repo:
mkdir llama3-demo
cd llama3-demo
git clone https://github.com/meta-llama/llama3.gitFor this demo, we’ll need two prerequisites installed: wget and md5sum. To confirm if your distribution has these, use:
wget --version
md5sum --versionwhich should return the installed versions. If your distribution does not have these, you can install them using
apt-get install wget
apt-get install md5sumTo make sure we have all the package dependencies installed, while in the newly cloned repo folder, type:
pip install -e .We are now all set to download the model weights for our local setup. Our team has created a helper script to make it easy to download the model weights. In your terminal, type:
./download.shThe script will ask for the URL from your email. Paste in the URL you received from Meta. It will then ask you to enter the list of models to download. For our example, we’ll download the 8B pretrained model and the fine-tuned 8B chat models. So we’ll enter “8B,8B-instruct”.
We are all set to run the example inference script to test if our model has been set up correctly and works. Our team has created an example Python script called example_text_completion.py that you can use to test out the model.
The script defines a main function that uses the Llama class from the llama library to generate text completions for given prompts using the pre-trained models. It takes a few arguments:
Parameters | Descriptions |
|---|---|
| Directory containing the checkpoint files of the model. |
| Path to the tokenizer of the model. |
| This parameter controls the randomness of the generation process. Higher values may lead to more creative but less coherent outputs, while lower values may lead to more conservative but more coherent outputs. |
| This defines the maximum probability threshold for generating tokens. |
| Defines the maximum length of the input sequence or prompt allowed for the model to process. |
| Defines the maximum length of the generated text the model is allowed to produce. |
max_batch_size: int = 4 | Defines the maximum number of prompts to process in one batch. |
The main function builds an instance of the Llama class, using the provided arguments, then defines a list of prompts for which the model will use generator.text_completion method to generate the completions.
To run the script, go back to our terminal, and while in the llama3 repo, type:
torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir Meta-Llama-3-8B/ --tokenizer_path Meta-Llama-3-8B/tokenizer.model --max_seq_len 128 --max_batch_size 4Replace Meta-Llama-3-8B/ with the path to your checkpoint directory and tokenizer.model with the path to your tokenizer model. If you run it from this main directory, the path may not need to change.
Set the –nproc_per_node to the MP value for the model you are using. For 8B models, the value is set to 1.
Adjust the max_seq_len and max_batch_size parameters as needed. We have set them to 128 and 4 respectively.
To try out the fine-tuned chat model (8B-instruct), we have a similar example called example_chat_completion.py.
torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir Meta-Llama-3-8B-Instruct/ --tokenizer_path Meta-Llama-3-8B-Instruct/tokenizer.model --max_seq_len 512 --max_batch_size 6Note that in this case, we use the Meta-Llama-3-8B-Instruct/ model and provide the correct tokenizer under the instruct model folder.
A detailed step-by-step process to run on this setup, as well as all the helper and example scripts can be found on our Llama3 GitHub repo, which goes over the process of downloading and quick-start, as well as examples for inference.