Evaluation is the process of systematically measuring how well your Llama-powered application meets its goals. Similar to unit tests in software development, evaluations (or "evals") provide an objective framework to compare application outputs against desired outcomes.
Evaluation of outputs is an important step in measuring and improving the performance of your AI application. Unlike traditional software, LLM outputs can be non-deterministic and are highly sensitive to changes in prompts, parameters, or model versions. Systematic evaluation provides an objective framework to measure the impact of these changes and improve key areas such as response quality, cost, and safety.
This guide provides a brief, hands-on example of a common classification task, followed by a comprehensive overview of evaluation strategies.
This example demonstrates a simple evaluation for a customer-support chatbot that must classify incoming queries into categories such as "Billing", "Technical", or "Account". The core process is:
First, ensure the Llama API client is installed:
pip install llama-api-client
Note on Inference Providers: This tutorial uses Llama API for demonstration purposes. However, you can run Llama models with any preferred inference provider. Common examples include Amazon Bedrock and Together AI.
The evaluation dataset consists of representative inputs (query) and the desired outputs (expected_category). This small set is for demonstration; a real-world evaluation would require a more comprehensive dataset, typically starting with at least a few hundred examples.
CATEGORIES = ["Billing", "Technical", "Account"]
TEST_DATA = [
{
"query": "My bill seems too high this month.",
"expected_category": "Billing",
},
{
"query": "I can't log into the dashboard.",
"expected_category": "Technical",
},
{
"query": "How do I update my email address?",
"expected_category": "Account",
},
{
"query": "What payment methods do you accept?",
"expected_category": "Billing",
},
{
"query": "The app keeps crashing on my phone.",
"expected_category": "Technical",
},
]
The classify_query function sends each query to the API. The system_prompt instructs the model to act as a classifier and respond only with a category name. We set temperature=0.0 to make the model's output more deterministic and reliable for this classification task.
import os
from llama_api_client import LlamaAPIClient
# Setup: API Key and Client
client = LlamaAPIClient(api_key=os.environ.get("LLAMA_API_KEY"))
def classify_query(query):
system_prompt = (
"Classify the user query into one of the following categories: "
f"{', '.join(CATEGORIES)}. Respond ONLY with the category name."
)
response = client.chat.completions.create(
model="Llama-4-Maverick-17B-128E-Instruct-FP8",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": query},
],
max_completion_tokens=10,
temperature=0.0,
)
return response.completion_message.content.text.strip()
Here, we loop through the evaluation dataset and compare the model's prediction to the expected category. For this classification task, we use accuracy (the percentage of correctly classified queries) as our evaluation metric.
correct_predictions = 0
print("Running evaluation...")
for item in TEST_DATA:
predicted = classify_query(item["query"])
expected = item["expected_category"]
is_correct = predicted == expected
if is_correct:
correct_predictions += 1
print(
f"Query: '{item['query']}' | Expected: {expected} | "
f"Predicted: {predicted} | Correct: {is_correct}"
)
# Calculate final accuracy
total_items = len(TEST_DATA)
accuracy = (correct_predictions / total_items) * 100 if total_items > 0 else 0
print("\n--- Evaluation summary ---")
print(f"Total queries evaluated: {total_items}")
print(f"Correct predictions: {correct_predictions}")
print(f"Accuracy: {accuracy:.2f}%")
The output should show that the model correctly classifies all queries.
Running evaluation...
Query: 'My bill seems too high this month.' | Expected: Billing | Predicted: Billing | Correct: True
Query: 'I can't log into the dashboard.' | Expected: Technical | Predicted: Technical | Correct: True
Query: 'How do I update my email address?' | Expected: Account | Predicted: Account | Correct: True
Query: 'What payment methods do you accept?' | Expected: Billing | Predicted: Billing | Correct: True
Query: 'The app keeps crashing on my phone.' | Expected: Technical | Predicted: Technical | Correct: True
--- Evaluation summary ---
Total queries evaluated: 5
Correct predictions: 5
Accuracy: 100.00%
The preceding example offered a hands-on look at a simple example. The following sections provide a comprehensive framework for testing your Llama application using a mix of automated and manual techniques. A systematic evaluation process enables you to measure the impact of changes, such as prompt engineering or fine-tuning, and verify that your customizations lead to meaningful performance gains. This guide focuses on the "how-to" of executing evaluations.
First, translate high-level product goals into concrete, measurable criteria. For a customer support agent, you might define criteria across several dimensions:
| Dimension | Metric | Example Target |
|---|---|---|
| Task performance | Resolution rate | >80% of billing queries resolved without human escalation. |
| Response quality | Helpfulness score (see LLM-as-judge) | Average score of >4.0/5.0. |
| Efficiency | Latency & token usage | <1.5s per turn, 500 tokens per turn |
| Safety | Harmful response rate | <0.1% failure rate using Llama Guard 4 on an internal red-teaming dataset. |
These criteria provide a clear benchmark for success.
A high-quality, use-case-specific dataset is the foundation for reliable measurement. While public benchmarks offer general comparisons, custom datasets provide the most actionable insights. A strong dataset represents real-world usage, including common scenarios, edge cases, and adversarial inputs.
Each entry in your dataset is an (input, ground_truth) pair. Common methods for creating these pairs include:
These methods can be combined. For instance, a teacher LLM can provide initial labels for real-world data, which are then reviewed and corrected by human experts to combine scale with accuracy.
The following are examples for a customer support chatbot, stored in JSON format.
Example: Intent classification This dataset tests the model's ability to categorize user queries. Note the inclusion of edge cases such as ambiguity, irrelevant information, and sarcasm.
[
{
"input_query": "I need to update my credit card on file.",
"ground_truth_category": "Billing"
},
{
"input_query": "The site is giving me a 500 error when I try to log in.",
"ground_truth_category": "Technical Support"
},
{
"input_query": "My bill is wrong and the login page is broken.",
"ground_truth_category": "Billing"
},
{
"input_query": "I just love spending my weekend trying to get your website to work. #not",
"ground_truth_category": "General Feedback"
}
]
Example: Summarization This dataset evaluates the model's ability to extract key information from a support ticket thread.
[
{
"input_document": "User: My app keeps crashing. Agent: Did you try restarting it? User: Yes, multiple times. Agent: Can you send a crash log? User: It's 'log-xyz'. Agent: Thanks. Looks like a known bug in v1.2. We have a fix in v1.3, which is now available. Please update your app.",
"ground_truth_summary": "User's app crash is caused by a known bug in v1.2; updating to v1.3 will resolve the issue."
},
{
"input_document": "To Whom It May Concern, I am writing to express my profound disappointment. I ordered a premium subscription last Tuesday, and my credit card was charged twice. When I tried to log in, I got a 'password incorrect' error, and the 'Forgot Password' link didn't work. This entire experience has been a nightmare.",
"ground_truth_summary": "User was double-charged for their subscription and is unable to log in or reset their password."
}
]
Example: Chatbot response quality For generative tasks, ground truth is often a high-quality reference response. This dataset tests for helpfulness, tone, and the ability to handle complex situations.
[
{
"conversation_context": "User: How do I reset my password?",
"ground_truth_response": "You can reset your password by navigating to Account Settings, selecting the 'Security' tab, and then clicking 'Update Password'."
},
{
"conversation_context": "User: It's not working.",
"ground_truth_response": "I'm sorry to hear that. To help me understand the problem, could you please tell me what you are trying to do and what happens when you try?"
},
{
"conversation_context": "User: I think my account has been hacked, I see logins from another country.",
"ground_truth_response": "We take security issues very seriously. I am escalating this to our security team immediately. They will investigate and reach out to you at your registered email address within the next 30 minutes."
}
]
A comprehensive strategy combines three approaches, trading off between scale and nuance:
Use code-based methods for tasks with clear, objective answers. They can be integrated into a CI/CD pipeline to automatically catch regressions.
An exact match metric checks if the model's output is identical to the ground truth. It is best for tasks with a single, definitive answer, such as classification.
def evaluate_exact_match(model_output, ground_truth):
return model_output.strip().lower() == ground_truth.strip().lower()
# For a classification task where the model must output "Billing"
evaluate_exact_match("Billing ", "billing") # True
A Semantic similarity match measures if the meaning of an output is equivalent to the ground truth. It converts texts into vector embeddings and calculates their cosine similarity. A score near 1.0 indicates strong alignment. We recommend using a high-performance, Meta-native model for generating embeddings.
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
def get_semantic_similarity(text1, text2):
embeddings = model.encode([text1, text2])
# Compute cosine similarity
return np.dot(embeddings[0], embeddings[1]) / (
np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1])
)
# For a chatbot response
generated = "Go to Account Settings, then 'Security' to update your password."
reference = "Reset your password via the 'Security' tab in Account Settings."
print(f"{get_semantic_similarity(generated, reference):.4f}") # e.g., 0.9315
Summarization metrics like ROUGE-L assess summary quality by measuring the longest common subsequence between a generated summary and a reference, reflecting sentence-level structure.
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
def evaluate_summary_rouge(generated, reference):
scores = scorer.score(reference, generated)
return scores['rougeL'].fmeasure
# For a summarization task
generated = "App crash is from a v1.2 bug; update to v1.3 to fix."
reference = (
"The user's app crash is caused by a bug in v1.2; "
"updating to v1.3 will resolve it."
)
print(f"{evaluate_summary_rouge(generated, reference):.4f}") # e.g., 0.8525
Assessing subjective quality requires nuance that goes beyond algorithmic comparison. This evaluation, performed by either a human or a powerful LLM, relies on a rubric (a scoring guide that translates "good" into a measurable framework).
A good rubric provides clear, consistent guidelines. It can be defined as a reusable constant in your code:
EVALUATION_RUBRIC = """
- **Helpfulness (1-5)**: How well does the response address the user's query?
- 5: Directly and accurately answers, anticipating follow-up needs.
- 3: Partially addresses the query or provides generic information.
- 1: Provides unsafe or misleading content.
- **Clarity (1-5)**: How easy is the response to understand?
- 5: Clear, concise, and well-structured.
- 3: Mostly clear but could be more direct.
- 1: Confusing, verbose, or poorly structured.
"""
A highly effective, Llama-native approach is to use a powerful LLM as a "judge" to automate judgment at scale. This method is ideal for assessing subjective qualities such as helpfulness or tone, making it perfect for A/B testing prompt changes across thousands of examples.
The core of this method is a well-crafted prompt that instructs a powerful Llama model on how to apply your rubric.
Best Practice: Always use your most capable model as the judge. A model such as Llama-4-Maverick-17B-128E-Instruct-FP8 has the capacity to apply a rubric with high nuance and consistency. For reliable, structured output, use a low temperature and instruct the model to return JSON.
import os, json
from llama_api_client import LlamaAPIClient
client = LlamaAPIClient(api_key=os.environ.get("LLAMA_API_KEY"))
def evaluate_with_llama_judge(question, response):
"""Uses a Llama model to evaluate a response against a rubric."""
system_prompt = f"""
You are an expert evaluator. Your task is to assess a chatbot's response
based on the provided rubric and return a JSON object with your scores
and reasoning.
**Rubric**:
{EVALUATION_RUBRIC}
Provide your evaluation as a JSON object with two keys: `scores`
(containing `helpfulness` and `clarity`) and `reason`
(a single-sentence explanation).
"""
user_prompt = f"""
USER QUESTION: "{question}"
CHATBOT RESPONSE: "{response}"
"""
api_response = client.chat.completions.create(
model="Llama-4-Maverick-17B-128E-Instruct-FP8",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
temperature=0.0,
)
return api_response.completion_message.content.text
# A good response to evaluate
judge_response = evaluate_with_llama_judge(
"How do I change my password?",
"Go to Account Settings, then 'Security'.",
)
print(judge_response)
# {
# "scores": {
# "helpfulness": 4,
# "clarity": 5
# },
# "reason": "The response is concise and clear, directly addressing the user's query."
# }
# A poor response to evaluate
judge_response = evaluate_with_llama_judge(
"How do I change my password?", "Reboot your computer."
)
print(judge_response)
# {
# "scores": {
# "helpfulness": 1,
# "clarity": 5
# },
# "reason": "The response is clear but completely unhelpful and potentially misleading."
# }
Human evaluation is the gold standard for assessing subjective quality and provides the most trustworthy feedback. Its primary role in a Llama-native workflow is to calibrate and validate your Llama-as-judge model. By comparing human scores to the LLM judge's scores on the same examples, you can adjust the judge's prompt or rubric to improve its alignment with human perception.
To ensure reliable results, use a consistent framework that applies your rubric to gather ratings.
| Framework | Description |
|---|---|
| Binary pass/fail | A simple Yes/No judgment on a specific criterion (e.g., "Did the response follow instructions?"). |
| Likert scale (1-5) | Rating a response on a numerical scale based on your rubric to measure dimensions such as helpfulness or clarity. |
| Pairwise comparison | Showing two responses (e.g., from an A/B test) and asking an evaluator to choose the better one. |
A robust evaluation workflow layers these methods to leverage their respective strengths:
This layered approach enables you to iterate quickly on your application while maintaining a high quality bar grounded in human judgment.
Evaluation scores are not an end goal; they are the start of an iterative improvement cycle. A successful evaluation process is a loop: analyze results to form a hypothesis, implement a change, and then re-evaluate to measure its impact.
An aggregate score such as "85% accuracy" can be misleading because not all errors carry the same weight. For instance, a chatbot may perform well on general queries but fail on urgent, security-related ones. A handful of such critical failures can be more damaging than a large number of minor mistakes. The first step in your analysis should be to understand the impact of different error types, not just their frequency.
Once you have categorized your errors by impact, the next step is to find their root cause. This process, known as error analysis, moves you from a simple score to a qualitative understanding of why your application fails, allowing you to form a hypothesis and guide your improvement efforts.
This analysis turns a generic score into a set of specific problems to solve. For example, a pattern of factual inaccuracies for recent events points toward a problem in your RAG system's data pipeline, not necessarily the prompt itself.
Once you have a hypothesis, implement a fix and use your evaluation suite to measure its impact. Below are some approaches that may help improve your application.
Prompt engineering is the fastest and most cost-effective way to improve performance in the Llama ecosystem and should always be your first step. Use your evaluation suite to A/B test prompt variations and objectively measure their impact.
# Before: A generic instruction
system_prompt_A = "You are a helpful customer support agent."
# After: A refined instruction with specific guidance
system_prompt_B = """
You are a technical support specialist. When a user needs help, follow these steps:
1. Ask clarifying questions to understand the problem.
2. Provide a solution as a numbered list.
3. If uncertain, escalate to a human agent. Never guess.
"""
If prompt engineering falls short, a more capable or specialized model may be necessary. An evaluation-driven approach is critical for making an objective, data-backed decision on which Llama model offers the best trade-off for your task.
Your goal is to find the optimal balance of performance, latency, and cost that aligns with your product requirements.
Llama-4-Scout-17B-16E-Instruct) is not powerful enough to identify the most salient points in a long, jargon-filled conversation.Llama-4-Maverick-17B-128E-Instruct). Compare them on quality, latency, and cost to make a holistic decision.| Model | Avg. helpfulness (LLM-as-judge) | Avg. latency (ms) | Est. cost / 1M tokens |
|---|---|---|---|
Llama-4-Scout-17B-16E | 3.8 / 5.0 | 800 | $0.30 |
Llama-4-Maverick-17B-128E | 4.6 / 5.0 | 1400 | $0.65 |
In this scenario Llama 4 Maverick offers a significant quality improvement for a moderate increase in cost and latency. Armed with this data, your team can decide if the better user experience is worth the trade-off.
In RAG systems, poor performance is often a retrieval problem, not a generation problem. If your evaluation reveals factual inaccuracies, always investigate the retrieval step—checking document chunks and queries—before modifying the generation prompt.
When other methods are insufficient, fine-tuning can specialize a Llama model for a unique task where general-purpose models fall short.
Llama-as-Judge evaluation suite to quantify the improvement in "brand voice adherence" between the base and fine-tuned models.Follow these principles to build a reliable evaluation framework for your Llama application.