If you're interested in learning by watching or listening, check out our video on Running Llama on Mac.
For this demo, we are using a Macbook Pro running Sonoma 14.4.1 with 64GB memory. Since we will be using Ollamap, this setup can also be used on other operating systems that are supported such as Linux or Windows using similar steps as the ones shown here.
ollama pull llama3
to download the 4-bit quantized Meta Llama 3 8B chat model, with a size of about 4.7 GB.If you’d like to download the Llama 3 70B chat model, also in 4-bit, you can instead type
ollama pull llama3:70b
which in quantized format, would have a size of about 39GB.
To run our model, in your terminal, type:
ollama run llama3
We are all set to ask questions and chat with our Meta Llama 3 model. Let’s ask some questions:
“Who wrote the book godfather?"
We can see that it gives the right answer, along with more information about the book as well as the movie that was based on the book. What if I just wanted the name of the author, without the extra information. Let’s adapt our prompt accordingly, specifying the kind of response we expect:
"Who wrote the book godfather? Answer with only the name."
We can see that it generates the answer in the format we requested.
You can also try running the 70B model:
ollama run llama3:70b
but the inference speed will likely be slower.
curl
command and specifying your prompt right in the command:curl http://localhost:11434/api/chat -d '{
"model": "llama3",
"messages": [
{
"role": "user",
"content": "who wrote the book godfather?"
}
],
"stream": false
}'
llama3
), an array with a string indicating the role of the message sender (user
) and a string with the user's input prompt ("who wrote the book godfather?
"), and a boolean value stream
indicating whether the response should be streamed or not. In our case, it is set to false, meaning the entire response will be returned at once.As we can see, the model generated the response with the answer to our question.
url
, which will have the same value as the URL we saw in the curl
demo:import requests
import json
url = "http://localhost:11434/api/chat"
llama3
, which will take in prompt
as an argument:def llama3(prompt):
data = {
"model": "llama3",
"messages": [
{
"role": "user",
"content": prompt
}
],
"stream": False,
}
headers = {
"Content-Type": "application/json"
}
response = requests.post(url, headers=headers, json=data)
return response.json()["message"]["content"]
requests
library. Once the response is received, the function extracts the content of the response message from the JSON object returned by the API, and returns this extracted content.Finally, we will provide the prompt and print the generated response:
response = llama3("who wrote the book godfather")
print(response)
python <name of script>.py
and press enter.