If you're interested in learning by watching or listening, check out our video on Running Llama on Windows.
nvidia-smi
(NVIDIA System Management Interface), which will show you the GPU you have, the VRAM available, and other useful information about your setup.meta-llama repo
containing the model you’d like to use. For example, we will use the Meta-Llama-3-8B-Instruct model for this demo. Read and agree to the license agreement. Fill in your details and accept the license, and click on submit. Once your request is approved, you'll be granted access to all the Llama 3 models.For this tutorial, we will be using Meta Llama models already converted to Hugging Face format. However, if you’d like to download the original native weights, click on the "Files and versions" tab and download the contents of the original folder.
If you prefer, you can also download the original weights from the command line using the Hugging Face CLI:
pip install huggingface-hub
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --include "original/*" --local-dir meta-llama/Meta-Llama-3-8B-Instruct
In this example, we will showcase how you can use Meta Llama models already converted to Hugging Face format using Transformers. To use the model with Transformers, we will be using the pipeline class from Hugging Face. We recommend that you use a Python virtual environment for running this demo. In this demo, we are using Miniconda, but you can use any virtual environment of your choice.
transformers
.pip install -U transformers --upgrade
accelerate
library, which enables our code to be run across any distributed configuration.pip install accelerate
For our script, open the editor of your choice, and create a Python script. We’ll first add the imports that we need for our example:
import transformers
import torch
from transformers import AutoTokenizer
Let's define the model we’d like to use. In our demo, we will use the 8B instruct model which is fine tuned for chat:
model = "meta-llama/Meta-Llama-3-8B-Instruct"
from_pretrained
method of AutoTokenizer. This will download and cache the pre-trained tokenizer and return an instance of the appropriate tokenizer class.tokenizer = AutoTokenizer.from_pretrained(model)
To use our model for inference:
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
)
text-generation
in this case), the model that the pipeline should use to make predictions (specified by model
), the precision to use with this model (torch.float16
), the device on which the pipeline should run (device_map
), and various other options. We’ll also set the device_map
argument to auto
, which means the pipeline will automatically use a GPU if one is available.sequences:
sequences = pipeline(
'I have tomatoes, basil and cheese at home. What can I cook for dinner?\n',
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
truncation = True,
max_length=400,
)
do_sample
to True
, which allows us to specify the decoding strategy we’d like to use to select the next token from the probability distribution over the entire vocabulary. In our example, we are using top_k
sampling.max_length
, you can specify how long you’d like the generated response to be. Setting the num_return_sequences
parameter to greater than one will let you generate more than one output.Finally, we add the following to provide input, and information on how to run the pipeline:
for seq in sequences:
print(f"Result: {seq['generated_text']}")
llama3-hf-demo.py
. Before we run the script, let’s make sure we can access and interact with Hugging Face directly from the terminal. To do that, make sure you have the Hugging Face CLI installed:pip install -U "huggingface_hub[cli]"
followed by
huggingface-cli login
python llama3-hf-demo.py