Meta
Meta
Models & Products
Docs
Community
Resources
Llama API
Download models

Meta
FacebookXYouTubeLinkedIn
Documentation
OverviewModels Getting the Models Running Llama How-To Guides Integration Guides Community Support

Community
Community StoriesOpen Innovation AI Research CommunityLlama Impact Grants

Resources
CookbookCase studiesVideosAI at Meta BlogMeta NewsroomFAQPrivacy PolicyTermsCookie Policy

Llama Protections
OverviewLlama Defenders ProgramDeveloper Use Guide

Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide

Table Of Contents

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources
How-to guides

Evaluations

Evaluation is the process of systematically measuring how well your Llama-powered application meets its goals. Similar to unit tests in software development, evaluations (or "evals") provide an objective framework to compare application outputs against desired outcomes.

Evaluation of outputs is an important step in measuring and improving the performance of your AI application. Unlike traditional software, LLM outputs can be non-deterministic and are highly sensitive to changes in prompts, parameters, or model versions. Systematic evaluation provides an objective framework to measure the impact of these changes and improve key areas such as response quality, cost, and safety.

This guide provides a brief, hands-on example of a common classification task, followed by a comprehensive overview of evaluation strategies.

A practical example: Query intent classification

This example demonstrates a simple evaluation for a customer-support chatbot that must classify incoming queries into categories such as "Billing", "Technical", or "Account". The core process is:

  1. Define a small test set of queries and their correct categories, the "ground truth".
  2. Instruct a Llama model to act as a classifier.
  3. Run each query through the model using Llama API.
  4. Compare the model's prediction against the ground truth to calculate accuracy.

Prerequisites

First, ensure the Llama API client is installed:

pip install llama-api-client

Note on Inference Providers: This tutorial uses Llama API for demonstration purposes. However, you can run Llama models with any preferred inference provider. Common examples include Amazon Bedrock and Together AI.

Prepare the evaluation data

The evaluation dataset consists of representative inputs (query) and the desired outputs (expected_category). This small set is for demonstration; a real-world evaluation would require a more comprehensive dataset, typically starting with at least a few hundred examples.

CATEGORIES = ["Billing", "Technical", "Account"]
TEST_DATA = [
    {
        "query": "My bill seems too high this month.",
        "expected_category": "Billing",
    },
    {
        "query": "I can't log into the dashboard.",
        "expected_category": "Technical",
    },
    {
        "query": "How do I update my email address?",
        "expected_category": "Account",
    },
    {
        "query": "What payment methods do you accept?",
        "expected_category": "Billing",
    },
    {
        "query": "The app keeps crashing on my phone.",
        "expected_category": "Technical",
    },
]

Define the classification function

The classify_query function sends each query to the API. The system_prompt instructs the model to act as a classifier and respond only with a category name. We set temperature=0.0 to make the model's output more deterministic and reliable for this classification task.

import os
from llama_api_client import LlamaAPIClient

# Setup: API Key and Client
client = LlamaAPIClient(api_key=os.environ.get("LLAMA_API_KEY"))

def classify_query(query):
    system_prompt = (
        "Classify the user query into one of the following categories: "
        f"{', '.join(CATEGORIES)}. Respond ONLY with the category name."
    )

    response = client.chat.completions.create(
        model="Llama-4-Maverick-17B-128E-Instruct-FP8",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": query},
        ],
        max_completion_tokens=10,
        temperature=0.0,
    )

    return response.completion_message.content.text.strip()

Run the evaluation

Here, we loop through the evaluation dataset and compare the model's prediction to the expected category. For this classification task, we use accuracy (the percentage of correctly classified queries) as our evaluation metric.

correct_predictions = 0
print("Running evaluation...")
for item in TEST_DATA:
    predicted = classify_query(item["query"])
    expected = item["expected_category"]
    is_correct = predicted == expected
    if is_correct:
        correct_predictions += 1
    print(
        f"Query: '{item['query']}' | Expected: {expected} | "
        f"Predicted: {predicted} | Correct: {is_correct}"
    )

# Calculate final accuracy
total_items = len(TEST_DATA)
accuracy = (correct_predictions / total_items) * 100 if total_items > 0 else 0

print("\n--- Evaluation summary ---")
print(f"Total queries evaluated: {total_items}")
print(f"Correct predictions: {correct_predictions}")
print(f"Accuracy: {accuracy:.2f}%")

The output should show that the model correctly classifies all queries.

Running evaluation...
Query: 'My bill seems too high this month.' | Expected: Billing | Predicted: Billing | Correct: True
Query: 'I can't log into the dashboard.' | Expected: Technical | Predicted: Technical | Correct: True
Query: 'How do I update my email address?' | Expected: Account | Predicted: Account | Correct: True
Query: 'What payment methods do you accept?' | Expected: Billing | Predicted: Billing | Correct: True
Query: 'The app keeps crashing on my phone.' | Expected: Technical | Predicted: Technical | Correct: True

--- Evaluation summary ---
Total queries evaluated: 5
Correct predictions: 5
Accuracy: 100.00%

Evaluations deep dive

The preceding example offered a hands-on look at a simple example. The following sections provide a comprehensive framework for testing your Llama application using a mix of automated and manual techniques. A systematic evaluation process enables you to measure the impact of changes, such as prompt engineering or fine-tuning, and verify that your customizations lead to meaningful performance gains. This guide focuses on the "how-to" of executing evaluations.

Define evaluation criteria

First, translate high-level product goals into concrete, measurable criteria. For a customer support agent, you might define criteria across several dimensions:

DimensionMetricExample Target
Task performanceResolution rate>80% of billing queries resolved without human escalation.
Response qualityHelpfulness score (see LLM-as-judge)Average score of >4.0/5.0.
EfficiencyLatency & token usage<1.5s per turn, 500 tokens per turn
SafetyHarmful response rate<0.1% failure rate using Llama Guard 4 on an internal red-teaming dataset.

These criteria provide a clear benchmark for success.

Prepare a dataset

A high-quality, use-case-specific dataset is the foundation for reliable measurement. While public benchmarks offer general comparisons, custom datasets provide the most actionable insights. A strong dataset represents real-world usage, including common scenarios, edge cases, and adversarial inputs.

Building a dataset

Each entry in your dataset is an (input, ground_truth) pair. Common methods for creating these pairs include:

  • Human-labeled real-world data: Use anonymized production logs as inputs and have domain experts provide ground truth. This offers the highest relevance to user behavior.
  • Synthetic generation: Use a powerful "teacher" LLM to generate a large-scale dataset of diverse inputs and "golden" outputs. For details, see the Llama cookbook on Generating synthetic data for evals.
  • Manual authoring: Have domain experts write high-quality pairs from scratch to ensure coverage of critical scenarios.

These methods can be combined. For instance, a teacher LLM can provide initial labels for real-world data, which are then reviewed and corrected by human experts to combine scale with accuracy.

Example datasets

The following are examples for a customer support chatbot, stored in JSON format.

Example: Intent classification This dataset tests the model's ability to categorize user queries. Note the inclusion of edge cases such as ambiguity, irrelevant information, and sarcasm.

[
    {
        "input_query": "I need to update my credit card on file.",
        "ground_truth_category": "Billing"
    },
    {
        "input_query": "The site is giving me a 500 error when I try to log in.",
        "ground_truth_category": "Technical Support"
    },
    {
        "input_query": "My bill is wrong and the login page is broken.",
        "ground_truth_category": "Billing"
    },
    {
        "input_query": "I just love spending my weekend trying to get your website to work. #not",
        "ground_truth_category": "General Feedback"
    }
]

Example: Summarization This dataset evaluates the model's ability to extract key information from a support ticket thread.

[
    {
        "input_document": "User: My app keeps crashing. Agent: Did you try restarting it? User: Yes, multiple times. Agent: Can you send a crash log? User: It's 'log-xyz'. Agent: Thanks. Looks like a known bug in v1.2. We have a fix in v1.3, which is now available. Please update your app.",
        "ground_truth_summary": "User's app crash is caused by a known bug in v1.2; updating to v1.3 will resolve the issue."
    },
    {
        "input_document": "To Whom It May Concern, I am writing to express my profound disappointment. I ordered a premium subscription last Tuesday, and my credit card was charged twice. When I tried to log in, I got a 'password incorrect' error, and the 'Forgot Password' link didn't work. This entire experience has been a nightmare.",
        "ground_truth_summary": "User was double-charged for their subscription and is unable to log in or reset their password."
    }
]

Example: Chatbot response quality For generative tasks, ground truth is often a high-quality reference response. This dataset tests for helpfulness, tone, and the ability to handle complex situations.

[
    {
        "conversation_context": "User: How do I reset my password?",
        "ground_truth_response": "You can reset your password by navigating to Account Settings, selecting the 'Security' tab, and then clicking 'Update Password'."
    },
    {
        "conversation_context": "User: It's not working.",
        "ground_truth_response": "I'm sorry to hear that. To help me understand the problem, could you please tell me what you are trying to do and what happens when you try?"
    },
    {
        "conversation_context": "User: I think my account has been hacked, I see logins from another country.",
        "ground_truth_response": "We take security issues very seriously. I am escalating this to our security team immediately. They will investigate and reach out to you at your registered email address within the next 30 minutes."
    }
]

Dataset design principles

  • Be task-specific: Mirror your application's real-world task distribution.
  • Include edge cases: Test for ambiguous, irrelevant, or adversarial inputs. Caveat: While important, avoid over-focusing on solving every edge case, as achieving 100% accuracy is often an unrealistic goal.
  • Use focused test sets: Create smaller test sets targeting specific capabilities (such as tone, factuality, jargon).
  • Prevent data contamination: Never use evaluation data for training or fine-tuning. This invalidates results, as the model has already "seen" the answers.

Executing evaluations

A comprehensive strategy combines three approaches, trading off between scale and nuance:

  • Code-based: Fast, scalable, and objective. Ideal for automated regression testing.
  • Model-based (Llama-as-judge): The recommended Llama-native approach. Blends the scalability of code with the nuance of human-like judgment.
  • Human-based: The gold standard for subjective quality, used to calibrate model-based evaluations.

Code-based evaluation for objective tasks

Use code-based methods for tasks with clear, objective answers. They can be integrated into a CI/CD pipeline to automatically catch regressions.

An exact match metric checks if the model's output is identical to the ground truth. It is best for tasks with a single, definitive answer, such as classification.

def evaluate_exact_match(model_output, ground_truth):
    return model_output.strip().lower() == ground_truth.strip().lower()

# For a classification task where the model must output "Billing"
evaluate_exact_match("Billing ", "billing") # True

A Semantic similarity match measures if the meaning of an output is equivalent to the ground truth. It converts texts into vector embeddings and calculates their cosine similarity. A score near 1.0 indicates strong alignment. We recommend using a high-performance, Meta-native model for generating embeddings.

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2') 

def get_semantic_similarity(text1, text2):
    embeddings = model.encode([text1, text2])
    # Compute cosine similarity
    return np.dot(embeddings[0], embeddings[1]) / (
        np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1])
    )

# For a chatbot response
generated = "Go to Account Settings, then 'Security' to update your password."
reference = "Reset your password via the 'Security' tab in Account Settings."
print(f"{get_semantic_similarity(generated, reference):.4f}") # e.g., 0.9315

Summarization metrics like ROUGE-L assess summary quality by measuring the longest common subsequence between a generated summary and a reference, reflecting sentence-level structure.

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)

def evaluate_summary_rouge(generated, reference):
    scores = scorer.score(reference, generated)
    return scores['rougeL'].fmeasure

# For a summarization task
generated = "App crash is from a v1.2 bug; update to v1.3 to fix."
reference = (
    "The user's app crash is caused by a bug in v1.2; "
    "updating to v1.3 will resolve it."
)
print(f"{evaluate_summary_rouge(generated, reference):.4f}") # e.g., 0.8525

Judgment-based evaluation (for subjective quality)

Assessing subjective quality requires nuance that goes beyond algorithmic comparison. This evaluation, performed by either a human or a powerful LLM, relies on a rubric (a scoring guide that translates "good" into a measurable framework).

A good rubric provides clear, consistent guidelines. It can be defined as a reusable constant in your code:

EVALUATION_RUBRIC = """
- **Helpfulness (1-5)**: How well does the response address the user's query?
    - 5: Directly and accurately answers, anticipating follow-up needs.
    - 3: Partially addresses the query or provides generic information.
    - 1: Provides unsafe or misleading content.
- **Clarity (1-5)**: How easy is the response to understand?
    - 5: Clear, concise, and well-structured.
    - 3: Mostly clear but could be more direct.
    - 1: Confusing, verbose, or poorly structured.
"""

Llama-as-judge

A highly effective, Llama-native approach is to use a powerful LLM as a "judge" to automate judgment at scale. This method is ideal for assessing subjective qualities such as helpfulness or tone, making it perfect for A/B testing prompt changes across thousands of examples.

The core of this method is a well-crafted prompt that instructs a powerful Llama model on how to apply your rubric.

Best Practice: Always use your most capable model as the judge. A model such as Llama-4-Maverick-17B-128E-Instruct-FP8 has the capacity to apply a rubric with high nuance and consistency. For reliable, structured output, use a low temperature and instruct the model to return JSON.

import os, json
from llama_api_client import LlamaAPIClient 

client = LlamaAPIClient(api_key=os.environ.get("LLAMA_API_KEY"))

def evaluate_with_llama_judge(question, response):
    """Uses a Llama model to evaluate a response against a rubric."""

    system_prompt = f"""
    You are an expert evaluator. Your task is to assess a chatbot's response
    based on the provided rubric and return a JSON object with your scores
    and reasoning.

    **Rubric**:
    {EVALUATION_RUBRIC}

    Provide your evaluation as a JSON object with two keys: `scores`
    (containing `helpfulness` and `clarity`) and `reason`
    (a single-sentence explanation).
    """

    user_prompt = f"""
    USER QUESTION: "{question}"
    CHATBOT RESPONSE: "{response}"
    """

    api_response = client.chat.completions.create(
        model="Llama-4-Maverick-17B-128E-Instruct-FP8",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        temperature=0.0,
    )
    return api_response.completion_message.content.text

# A good response to evaluate
judge_response = evaluate_with_llama_judge(
    "How do I change my password?",
    "Go to Account Settings, then 'Security'.",
)
print(judge_response)
# {
#   "scores": {
#     "helpfulness": 4,
#     "clarity": 5
#   },
#   "reason": "The response is concise and clear, directly addressing the user's query."
# }

# A poor response to evaluate
judge_response = evaluate_with_llama_judge(
    "How do I change my password?", "Reboot your computer."
)
print(judge_response)
# {
#   "scores": {
#     "helpfulness": 1,
#     "clarity": 5
#   },
#   "reason": "The response is clear but completely unhelpful and potentially misleading."
# }

Human evaluation

Human evaluation is the gold standard for assessing subjective quality and provides the most trustworthy feedback. Its primary role in a Llama-native workflow is to calibrate and validate your Llama-as-judge model. By comparing human scores to the LLM judge's scores on the same examples, you can adjust the judge's prompt or rubric to improve its alignment with human perception.

To ensure reliable results, use a consistent framework that applies your rubric to gather ratings.

FrameworkDescription
Binary pass/failA simple Yes/No judgment on a specific criterion (e.g., "Did the response follow instructions?").
Likert scale (1-5)Rating a response on a numerical scale based on your rubric to measure dimensions such as helpfulness or clarity.
Pairwise comparisonShowing two responses (e.g., from an A/B test) and asking an evaluator to choose the better one.

Combined evaluation strategies

A robust evaluation workflow layers these methods to leverage their respective strengths:

  1. Code-based tests for CI/CD: Integrate fast, objective tests (such as exact match for classification) into your CI/CD pipeline to provide an automated backstop against regressions.
  2. Llama-as-judge for iteration: Use Llama-as-judge to rapidly A/B test changes to prompts, RAG systems, or compare a base model against a fine-tuned version.
  3. Human-based for calibration: Periodically, use human evaluators to score a sample of outputs. Use these results to validate that your Llama-as-judge scores are aligned with human perception.

This layered approach enables you to iterate quickly on your application while maintaining a high quality bar grounded in human judgment.

Analyzing results and driving improvements

Evaluation scores are not an end goal; they are the start of an iterative improvement cycle. A successful evaluation process is a loop: analyze results to form a hypothesis, implement a change, and then re-evaluate to measure its impact.

Finding failure patterns

An aggregate score such as "85% accuracy" can be misleading because not all errors carry the same weight. For instance, a chatbot may perform well on general queries but fail on urgent, security-related ones. A handful of such critical failures can be more damaging than a large number of minor mistakes. The first step in your analysis should be to understand the impact of different error types, not just their frequency.

Once you have categorized your errors by impact, the next step is to find their root cause. This process, known as error analysis, moves you from a simple score to a qualitative understanding of why your application fails, allowing you to form a hypothesis and guide your improvement efforts.

A quick guide to error analysis:
  1. Isolate failures: Using your evaluation results (especially from Llama-as-judge), collect all examples that produced a wrong, low-quality, or unsafe response.
  2. Identify failure patterns: Review the failures to find common patterns. This moves you from a single score to a qualitative understanding of the issues. Common patterns include:
    • Factual inaccuracies: The model provides incorrect information.
    • Tone and style violations: The response does not adhere to brand voice guidelines or specific formatting instructions (such as failing to use a numbered list).
    • Instruction-following issues: The model disregards a specific constraint from the prompt.
    • False Refusals: The model incorrectly refuses to answer a safe and valid query.
    • Safety issues: The model generates harmful, biased, or otherwise undesirable content.

This analysis turns a generic score into a set of specific problems to solve. For example, a pattern of factual inaccuracies for recent events points toward a problem in your RAG system's data pipeline, not necessarily the prompt itself.

Data-driven iteration

Once you have a hypothesis, implement a fix and use your evaluation suite to measure its impact. Below are some approaches that may help improve your application.

Prompt engineering

Prompt engineering is the fastest and most cost-effective way to improve performance in the Llama ecosystem and should always be your first step. Use your evaluation suite to A/B test prompt variations and objectively measure their impact.

  • Finding: A chatbot struggles with troubleshooting scenarios.
  • Hypothesis: The system prompt is too generic.
  • Action: Create a new, more specific system prompt and A/B test it against the original.
# Before: A generic instruction
system_prompt_A = "You are a helpful customer support agent."

# After: A refined instruction with specific guidance
system_prompt_B = """
You are a technical support specialist. When a user needs help, follow these steps:
1. Ask clarifying questions to understand the problem.
2. Provide a solution as a numbered list.
3. If uncertain, escalate to a human agent. Never guess.
"""

Model selection

If prompt engineering falls short, a more capable or specialized model may be necessary. An evaluation-driven approach is critical for making an objective, data-backed decision on which Llama model offers the best trade-off for your task.

Your goal is to find the optimal balance of performance, latency, and cost that aligns with your product requirements.

  • Finding: An agent's summaries of complex support tickets often miss critical technical details.
  • Hypothesis: The current model (Llama-4-Scout-17B-16E-Instruct) is not powerful enough to identify the most salient points in a long, jargon-filled conversation.
  • Action: Run the evaluation suite on both the current model and a more powerful alternative (such as Llama-4-Maverick-17B-128E-Instruct). Compare them on quality, latency, and cost to make a holistic decision.
Example evaluation: Model comparison for summarization
ModelAvg. helpfulness (LLM-as-judge)Avg. latency (ms)Est. cost / 1M tokens
Llama-4-Scout-17B-16E3.8 / 5.0800$0.30
Llama-4-Maverick-17B-128E4.6 / 5.01400$0.65

In this scenario Llama 4 Maverick offers a significant quality improvement for a moderate increase in cost and latency. Armed with this data, your team can decide if the better user experience is worth the trade-off.

Retrieval Augmented Generation (RAG)

In RAG systems, poor performance is often a retrieval problem, not a generation problem. If your evaluation reveals factual inaccuracies, always investigate the retrieval step—checking document chunks and queries—before modifying the generation prompt.

Fine-tuning

When other methods are insufficient, fine-tuning can specialize a Llama model for a unique task where general-purpose models fall short.

  • Finding: After prompt engineering and testing more capable models, a financial chatbot still fails to adopt the precise, cautious, and disclaimer-heavy tone required for the industry.
  • Hypothesis: The model requires deep stylistic adaptation that only supervised fine-tuning can provide.
  • Action:
    1. Fine-tune a base Llama model on a high-quality dataset of several hundred examples demonstrating the target tone.
    2. Use your Llama-as-Judge evaluation suite to quantify the improvement in "brand voice adherence" between the base and fine-tuned models.

Best practices

Follow these principles to build a reliable evaluation framework for your Llama application.

Key recommendations

  • Define goals upfront: Link evaluation criteria directly to product goals before writing code.
  • Build use-case-specific datasets: Prioritize custom datasets that mirror real-world usage, including edge cases and adversarial inputs.
  • Layer evaluation methods: Combine code-based tests (for CI/CD), Llama-as-judge (for rapid iteration), and human evaluation (for calibration).
  • Use Llama-as-judge for scale: Use a powerful Llama model with a clear rubric to automate the scoring of subjective qualities such as helpfulness or tone.
  • Analyze failure patterns with Llama-as-judge: Go beyond aggregate scores. Use your judge's outputs to diagnose root causes by finding patterns in low-scoring examples.
  • Test for safety and efficiency: Use tools such as Llama Guard to evaluate for potential harms, and always measure performance trade-offs such as latency and cost.

Common pitfalls

  • Don't contaminate your data: Never train or fine-tune on your evaluation dataset.
  • Don't rely on a single metric: An aggregate score is misleading. Use a mix of metrics for a complete picture.
  • Don't treat evaluation as a one-time task: Continuously evaluate throughout development and monitor performance in production.
  • Don't underestimate the effort: Building high-quality datasets and robust evaluation workflows is worthwhile and requires a significant investment.
Was this page helpful?
Yes
No
On this page
Evaluations
A practical example: Query intent classification
Prerequisites
Prepare the evaluation data
Define the classification function
Run the evaluation
Evaluations deep dive
Define evaluation criteria
Prepare a dataset
Executing evaluations
Analyzing results and driving improvements
Finding failure patterns
Data-driven iteration
Best practices
Key recommendations
Common pitfalls