At Meta, we’re pioneering an open source approach to generative AI development enabling everyone to benefit from our models and their powerful capabilities, while making it as easy as possible to build and innovate with trust by design. Our comprehensive system-level protections framework proactively identifies and mitigates potential risks, empowering developers to more easily deploy generative AI responsibly.
Llama Guard 4 is an 12B-parameter high-performance multimodal input and output moderation model designed to support developers to detect various common types of violating content.
It was built by pruning pre-trained Llama 4 Scout and finetuned and optimized to support the detection of the MLCommons standard taxonomy of hazards, catering to a range of developer use cases.
It supports 12 languages and works across modalities to detect and filter policy-violating inputs and outputs across text and images.
Llama Guard is designed to be usable across Llama model sizes, including Llama 4 Scout and Llama 4 Maverick.
Resources to get started with Llama Guard, such as how to fine-tune it for your own use case, are available in our Llama-recipe github repository.
Prompt Guard is a powerful tool for protecting LLM powered applications from malicious prompts to ensure their security and integrity.
Prompt Guard is designed to perform strongly and generalize well to new settings and distributions of adversarial attacks. We use a multilingual base model that significantly enhances the model's ability to recognize prompt attacks in non-English languages, providing comprehensive protection for your application.
We’re releasing two versions of Prompt Guard 2 as open source so you can fine tune them to your specific application and use cases:
Prompt Guard 2 86M is an even more effective and robust classifier for detecting malicious prompts with reduced instances of false positives.
Prompt Guard 2 22M is a smaller, faster tool with substantially lower latency with minimal performance trade-offs outside of multilingual tasks.
LlamaFirewall is a security guardrail tool designed to enable building of secure AI systems. LlamaFirewall can orchestrate across guard models and work with our suite of protection tools to detect and prevent risks such as prompt injection, insecure code and risky tool interactions.
It’s designed to be agnostic to the agentic framework.
LlamaFirewall includes integrations to the suite of Llama Protections like Llama Guard, Prompt Guard, and CodeShield.
Code Shield provides support for inference-time filtering of insecure code produced by LLMs. This offers mitigation of insecure code suggestions risk and secure command execution for 7 programming languages with an average latency of 200ms.
In line with the principles outlined in our Developer Use Guide: AI Protections, we recommend thorough checking and filtering of all inputs to and outputs from LLMs based on your unique content guidelines for your intended use case and audience.
There is no one-size-fits-all guardrail detection to prevent all risks. This is why we encourage users to combine all our system level safety tools with other guardrails for your use cases.
Evaluations
We are sharing new updates to the industry’s first and most comprehensive set of open source cybersecurity safety evaluations for large language models (LLMs).
Cybersec Eval 4 expands on its predecessor by augmenting the suite of benchmarks to measure not only the risks, but also the defensive cybersecurity capabilities of AI systems. These new tests include a benchmark (AutoPatchBench) to evaluate an AI system’s capability to automatically patch security vulnerabilities in native code as well as a set of benchmarks (CyberSOCEval) that evaluate its ability to help run a security operations center (SOC) by accurately reasoning about security incidents, recognizing complex malicious activity in system logs, and reasoning about information extracted from threat intelligence reports.
Prompt injection attacks of LLM-based applications are attempts to cause the LLM to behave in undesirable ways. The Prompt Injection tests evaluate the ability to recognize which part of an input is untrusted and the level of resilience against common text and image based prompt injection techniques.
In Cybersec Eval 1 we introduced tests to measure an LLM's propensity to help carry out cyberattacks as defined in the industry standard MITRE Enterprise ATT&CK ontology of cyberattack methods. Cybersec Eval 2 added tests to measure the false rejection rate of confusingly benign prompts. These prompts are similar to the cyber attack compliance tests in that they cover a wide variety of topics including cyberdefense, but they are explicitly benign, even if they may appear malicious.
This suite consists of capture-the-flag style security test cases that simulate program exploitation. We then use an LLM as the security tool to determine whether it can reach a specific point in the program where a security issue has been intentionally inserted. In some of these tests we explicitly check if the tool can execute basic exploits such as SQL injections and buffer overflows.
Cybersec Eval 3 adds evaluations for LLM ability to conduct (1) multi-turn spear phishing campaigns and (2) autonomous offensive cyber operations.
Code interpreters allow LLMs to run code in a sandboxed environment. This set of prompts try to manipulate an LLM into executing malicious code to either gain access to the system that runs the LLM, gather helpful information about the system, craft and execute social engineering attacks, or gather information about the external infrastructure of the host environment.
In fostering a collaborative approach, we have partnered with ML Commons, the newly formed AI alliance, and a number of leading AI companies.
New open source benchmarks to evaluate the efficacy of AI systems to automate and scale security operation center (SOC) operations were developed in collaboration and partnership with CrowdStrike.