If you're interested in learning by watching or listening, check out our video on Running Llama on Linux.
At Meta, we strongly believe in an open approach to AI development, particularly in the fast-evolving domain of generative AI. By making AI models publicly accessible, we enable their advantages to reach every segment of society.
Last year, we open sourced Meta Llama 2, and this year we released the Meta Llama 3 family of models, available in both 8B and 70B pretrained and instruction-tuned versions to support a wide range of applications, unlocking the power of these large language models, and making them accessible to everyone, so you can experiment, innovate, and scale your ideas responsibly.
With a Linux setup having a GPU with a minimum of 16GB VRAM, you should be able to load the 8B Llama models in fp16 locally. If you have an Nvidia GPU, you can confirm your setup using the NVIDIA System Management Interface tool that shows you the GPU you have, the VRAM available, and other useful information by typing:
nvidia-smi
In our current setup, we are on Ubuntu, specifically Pop OS, and have an Nvidia RTX 4090 with a total VRAM of about 24GB.
We are now ready to get the weights and run the model locally on our machine. It is recommended to use a Python virtual environment for running this demo. In this demo, we are using Miniconda, but you can use any virtual environment of your choice.
llama3-demo
in your workspace. Navigate to the new folder and clone the Llama repo:mkdir llama3-demo
cd llama3-demo
git clone https://github.com/meta-llama/llama3.git
wget
and md5sum
. To confirm if your distribution has these, use: wget --version
md5sum --version
which should return the installed versions. If your distribution does not have these, you can install them using
apt-get install wget
apt-get install md5sum
To make sure we have all the package dependencies installed, while in the newly cloned repo folder, type:
pip install -e .
We are now all set to download the model weights for our local setup. Our team has created a helper script to make it easy to download the model weights. In your terminal, type:
./download.sh
“8B,8B-instruct”
.example_text_completion.py
that you can use to test out the model. main
function that uses the Llama
class from the llama
library to generate text completions for given prompts using the pre-trained models. It takes a few arguments:Parameters | Descriptions |
---|---|
| Directory containing the checkpoint files of the model. |
| Path to the tokenizer of the model. |
| This parameter controls the randomness of the generation process. Higher values may lead to more creative but less coherent outputs, while lower values may lead to more conservative but more coherent outputs. |
| This defines the maximum probability threshold for generating tokens. |
| Defines the maximum length of the input sequence or prompt allowed for the model to process. |
| Defines the maximum length of the generated text the model is allowed to produce. |
max_batch_size: int = 4 | Defines the maximum number of prompts to process in one batch. |
main
function builds an instance of the Llama
class, using the provided arguments, then defines a list of prompts for which the model will use generator.text_completion
method to generate the completions.llama3
repo, type:torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir Meta-Llama-3-8B/ --tokenizer_path Meta-Llama-3-8B/tokenizer.model --max_seq_len 128 --max_batch_size 4
Meta-Llama-3-8B/
with the path to your checkpoint directory and tokenizer.model
with the path to your tokenizer model. If you run it from this main directory, the path may not need to change.–nproc_per_node
to the MP value for the model you are using. For 8B models, the value is set to 1.max_seq_len
and max_batch_size
parameters as needed. We have set them to 128 and 4 respectively.8B-instruct
), we have a similar example called example_chat_completion.py
.torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir Meta-Llama-3-8B-Instruct/ --tokenizer_path Meta-Llama-3-8B-Instruct/tokenizer.model --max_seq_len 512 --max_batch_size 6
Meta-Llama-3-8B-Instruct/
model and provide the correct tokenizer under the instruct
model folder.