This tutorial is a part of our Build with Meta Llama series, where we demonstrate the capabilities and practical applications of Llama for developers like you, so that you can leverage the benefits that Llama has to offer and incorporate it into your own applications. This tutorial supports the video Running Llama on Mac | Build with Meta Llama, where we learn how to run Llama on Mac OS using Ollama, with a step-by-step tutorial to help you follow along.
If you're interested in learning by watching or listening, check out our video on Running Llama on Mac.
For this demo, we are using a Macbook Pro running Sonoma 14.4.1 with 64GB memory. Since we will be using Ollamap, this setup can also be used on other operating systems that are supported such as Linux or Windows using similar steps as the ones shown here.
Ollama lets you set up and run Large Language models like Llama models locally.
The first step is to install Ollama. To do that, visit their website, where you can choose your platform, and click on “Download” to download Ollama. For our demo, we will choose macOS, and select “Download for macOS”.
Next, we will make sure that we can test run Meta Llama 3 models on Ollama. Please note that Ollama provides Meta Llama models in the 4-bit quantized format. To test run the model, let’s open our terminal, and run ollama pull llama3 to download the 4-bit quantized Meta Llama 3 8B chat model, with a size of about 4.7 GB.
If you’d like to download the Llama 3 70B chat model, also in 4-bit, you can instead type
ollama pull llama3:70bwhich in quantized format, would have a size of about 39GB.
To run our model, in your terminal, type:
ollama run llama3We are all set to ask questions and chat with our Meta Llama 3 model. Let’s ask some questions:
“Who wrote the book godfather?"We can see that it gives the right answer, along with more information about the book as well as the movie that was based on the book. What if I just wanted the name of the author, without the extra information. Let’s adapt our prompt accordingly, specifying the kind of response we expect:
"Who wrote the book godfather? Answer with only the name."
We can see that it generates the answer in the format we requested.
You can also try running the 70B model:
ollama run llama3:70bbut the inference speed will likely be slower.
You can even run and test the Llama 3 8B model directly by using the curl command and specifying your prompt right in the command:
curl http://localhost:11434/api/chat -d '{
"model": "llama3",
"messages": [
{
"role": "user",
"content": "who wrote the book godfather?"
}
],
"stream": false
}'Here, we are sending a POST request to an API running on localhost. The API endpoint is for "chat", which will interact with our AI model hosted on the server. We are providing a JSON payload that contains a string specifying the name of the AI model to use for processing the input prompt (llama3), an array with a string indicating the role of the message sender (user) and a string with the user's input prompt ("who wrote the book godfather?"), and a boolean value stream indicating whether the response should be streamed or not. In our case, it is set to false, meaning the entire response will be returned at once.
As we can see, the model generated the response with the answer to our question.
This example can also be run using a Python script. To install Python, visit the Python website, where you can choose your OS and download the version of Python you like.
To run it using a Python script, open the editor of your choice, and create a new file. First, let’s add the imports we will need for this demo, and define a parameter called url, which will have the same value as the URL we saw in the curl demo:
import requests
import json
url = "http://localhost:11434/api/chat"We will now add a new function called llama3, which will take in prompt as an argument:
def llama3(prompt):
data = {
"model": "llama3",
"messages": [
{
"role": "user",
"content": prompt
}
],
"stream": False,
}
headers = {
"Content-Type": "application/json"
}
response = requests.post(url, headers=headers, json=data)
return response.json()["message"]["content"]
This function constructs a JSON payload containing the specified prompt and the model name, which is "llama3”. Then, it sends a POST request to the API endpoint with the JSON payload as the message body, using the requests library. Once the response is received, the function extracts the content of the response message from the JSON object returned by the API, and returns this extracted content.
Finally, we will provide the prompt and print the generated response:
response = llama3("who wrote the book godfather")
print(response)To run the script, write python <name of script>.py and press enter.
As we can see, it generated the response based on the prompt we provided in our script. To learn more about the complete Ollama APIs, check out their documentation.
To check out the full example, and run it on your own machine, our team has developed a detailed sample notebook that you can refer to and can be found in the llama-cookbook Github repo, where you will find an example of how to run Llama 3 models on a Mac as well as other platforms. You will find the examples we discussed here, as well as other ways to use Llama 3 locally with Ollama via LangChain.
We’ve also created various other demos and examples to provide you with guidance and as references to help you get started with Llama models and to make it easier for you to integrate Llama into your own use cases. These demos and examples are also located in our llama-cookbook GitHub repo and on PyPI, where you’ll find complete walkthroughs for how to get started with Llama models, including several examples for inference, fine tuning, and training on custom data sets—as well as demos that showcase Llama deployments, basic interactions, and specialized use cases.