Table Of Contents
Running Meta Llama on Windows
This tutorial is a part of our Build with Meta Llama series, where we demonstrate the capabilities and practical applications of Llama for developers like you, so that you can leverage the benefits that Llama has to offer and incorporate it into your own applications. This tutorial supports the video Running Llama on Windows | Build with Meta Llama, where we learn how to run Llama on Windows using Hugging Face APIs, with a step-by-step tutorial to help you follow along.
If you're interested in learning by watching or listening, check out our video on Running Llama on Windows.
Setup
For this demo, we will be using a Windows OS machine with an RTX 4090 GPU. If you have an Nvidia GPU, you can confirm your setup by opening the Terminal and typing nvidia-smi (NVIDIA System Management Interface), which will show you the GPU you have, the VRAM available, and other useful information about your setup.
Since we will be using the Hugging Face transformers library for this setup, this setup can also be used on other operating systems that the library supports such as Linux or Mac using similar steps as the ones shown in the video.
Getting the weights
To allow easy access to Meta Llama models, we are providing them on Hugging Face, where you can download the models in both transformers and native Llama 3 formats.
To download the weights, visit the meta-llama repo containing the model you’d like to use. For example, we will use the Llama-3.1-8B-Instruct model for this demo. Read and agree to the license agreement. Fill in your details and accept the license, and click on submit. Once your request is approved, you'll be granted access to all the Llama 3 models.
For this tutorial, we will be using Meta Llama models already converted to Hugging Face format. However, if you’d like to download the original native weights, click on the "Files and versions" tab and download the contents of the original folder.
If you prefer, you can also download the original weights from the command line using the Hugging Face CLI:
pip install huggingface-hub
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct --include "original/*" --local-dir meta-llama/Llama-3.1-8B-InstructRunning the model
In this example, we will showcase how you can use Meta Llama models already converted to Hugging Face format using Transformers. To use the model with Transformers, we will be using the pipeline class from Hugging Face. We recommend that you use a Python virtual environment for running this demo. In this demo, we are using Miniconda, but you can use any virtual environment of your choice.
Make sure to use the latest version of transformers.
pip install -U transformers --upgradeWe will also use the accelerate library, which enables our code to be run across any distributed configuration.
pip install accelerateWe will be using Python for our demo script. To install Python, visit the Python website, where you can choose your OS and download the version of Python you like. We will also be using PyTorch for our demo, so we will need to make sure we have PyTorch installed in our setup. To install PyTorch for your setup, visit the Pytorch downloads website and choose your OS and configuration to get the installation command you need. Paste that command in your terminal and press enter.
For our script, open the editor of your choice, and create a Python script. We’ll first add the imports that we need for our example:
import transformers
import torch
from transformers import AutoTokenizerLet's define the model we’d like to use. In our demo, we will use the 8B instruct model which is fine tuned for chat:
model = "meta-llama/Llama-3.1-8B-Instruct"We will also instantiate the tokenizer which can be derived from AutoTokenizer, based on the model we’ve chosen, using the from_pretrained method of AutoTokenizer. This will download and cache the pre-trained tokenizer and return an instance of the appropriate tokenizer class.
tokenizer = AutoTokenizer.from_pretrained(model)To use our model for inference:
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
)Hugging Face pipelines allow us to specify which type of task the pipeline needs to run (text-generation in this case), the model that the pipeline should use to make predictions (specified by model), the precision to use with this model (torch.float16), the device on which the pipeline should run (device_map), and various other options. We’ll also set the device_map argument to auto, which means the pipeline will automatically use a GPU if one is available.
Next, let's provide some text prompts as inputs to our pipeline for it to use when it runs to generate responses. Let’s define this as the variable, sequences:
sequences = pipeline(
'I have tomatoes, basil and cheese at home. What can I cook for dinner?\n',
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
truncation = True,
max_length=400,
)The pipeline sets do_sample to True, which allows us to specify the decoding strategy we’d like to use to select the next token from the probability distribution over the entire vocabulary. In our example, we are using top_k sampling.
By changing max_length, you can specify how long you’d like the generated response to be. Setting the num_return_sequences parameter to greater than one will let you generate more than one output.
Finally, we add the following to provide input, and information on how to run the pipeline:
for seq in sequences:
print(f"Result: {seq['generated_text']}")Save your script and head back to the terminal. We will save it as llama3-hf-demo.py. Before we run the script, let’s make sure we can access and interact with Hugging Face directly from the terminal. To do that, make sure you have the Hugging Face CLI installed:
pip install -U "huggingface_hub[cli]"followed by
huggingface-cli loginHere, it will ask us for our access token which we can get from our HF account under Settings. Copy it and provide it in the command line. We are now all set to run our script.
python llama3-hf-demo.pyTo check out the full example and run it on your local machine, see the detailed sample notebook that you can refer to in the llama-cookbook GitHub repo. Here you will find an example of how to run Llama 3 models using already converted Hugging Face weights, as well as an example that goes over how you can convert the original weights into Hugging Face format and run using those.
We’ve also created various other demos and examples to provide you with guidance and as references to help you get started with Llama models and to make it easier for you to integrate them into your own use cases. To try these examples, check out our llama-cookbook GitHub repo or install llama-cookbook from PyPI. You’ll find complete walkthroughs for how to get started with Llama models, including examples of inference, fine tuning, and training on custom data sets. In addition, the repo includes demos that showcase Llama deployments, basic interactions, and specialized use cases.