Meta

Meta
FacebookXYouTubeLinkedIn
Documentation
OverviewModels Getting the Models Running Llama How-To Guides Integration Guides Community Support

Community
Community StoriesOpen Innovation AI Research CommunityLlama Impact Grants

Resources
CookbookCase studiesVideosAI at Meta BlogMeta NewsroomFAQPrivacy PolicyTermsCookie Policy

Llama Protections
OverviewLlama Defenders ProgramDeveloper Use Guide

Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Documentation
Overview
Models
Getting the Models
Running Llama
How-To Guides
Integration Guides
Community Support
Community
Community Stories
Open Innovation AI Research Community
Llama Impact Grants
Resources
Cookbook
Case studies
Videos
AI at Meta Blog
Meta Newsroom
FAQ
Privacy Policy
Terms
Cookie Policy
Llama Protections
Overview
Llama Defenders Program
Developer Use Guide
Meta
Models & Products
Docs
Community
Resources
Llama API
Download models

Table Of Contents

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources

Overview
Models
Llama 4
Llama Guard 4
Llama 3.3
Llama 3.2
Llama 3.1
Llama Guard 3
Llama Prompt Guard 2
Other models
Getting the Models
Meta
Hugging Face
Kaggle
1B/3B Partners
405B Partners
Running Llama
Linux
Windows
Mac
Cloud
Deployment (New)
Private cloud deployment
Production deployment pipelines
Infrastructure migration
Versioning
Accelerator management
Autoscaling
Regulated industry self-hosting
Security in production
Cost projection and optimization
Comparing costs
A/B testing
How-To Guides
Prompt Engineering (Updated)
Fine-tuning (Updated)
Quantization (Updated)
Distillation (New)
Evaluations (New)
Validation
Vision Capabilities
Responsible Use
Integration Guides
LangChain
Llamalndex
Community Support
Resources
Deployment guides

Self-hosted Llama deployments for regulated industries

Overview

Organizations that work in regulated industries like healthcare require complete control over user data to meet regulatory requirements, such as HIPAA, GDPR, and state privacy laws. Self-hosting Llama models eliminates data leakage risks by keeping protected health information (PHI) within your facility's secure perimeter while providing advanced AI capabilities for clinical decision support, medical documentation, and patient care optimization.

When you deploy Llama models on your own infrastructure, you download model weights directly from Meta and run them using optimized inference servers. This approach eliminates dependency on external APIs, addressing healthcare IT teams' primary concerns about data sovereignty, privacy compliance, and intellectual property protection. Your patient data never leaves your controlled environment, and you maintain complete visibility into all model interactions through comprehensive audit logging.

Self-hosted deployment provides deterministic performance with dedicated GPU resources. You control the entire stack from model weights to inference server configuration, ensuring stability for validated clinical workflows while maintaining flexibility to adopt new optimizations as they become available.

Architecture patterns

Healthcare deployments require choosing an architecture pattern that balances security requirements with operational complexity. Each pattern provides complete data isolation while supporting different operational models.

Air-gapped deployment

Air-gapped environments provide maximum security by completely isolating your infrastructure from external networks. Model weights transfer via physical media following chain-of-custody procedures. Your inference servers operate without internet connectivity, ensuring PHI never has a path to external systems. Updates happen through controlled processes where new model versions undergo validation in isolated test environments before promotion to production.

This pattern suits environments with the highest security requirements, such as genomics research facilities or military medical centers. The operational overhead of managing air-gapped systems is traded off for an exceptional reduction in attack surface area, especially from external sources.

Private network deployment

Private network deployment balances security with operational efficiency using network isolation rather than complete air-gapping. Your Llama models run on servers within private subnets with no direct internet routing. Model weights are stored in your internal artifact repository, and inference servers expose endpoints only to authorized clinical applications.

External communication happens through carefully controlled gateways that allow specific outbound connections for security updates and monitoring data stripped of PHI. This pattern suits most hospital systems that need strong security while maintaining manageable operations.

image
# Network isolation configuration (conceptual)
network:
  type: private_vpc
  subnets:
    inference: 10.0.1.0/24  # No internet gateway
    management: 10.0.2.0/24  # NAT gateway for updates
  firewall_rules:
    - allow: clinical_apps -> inference:8000
    - deny: inference -> internet
    - allow: management -> security_updates

Hybrid deployment

Hybrid architectures separate workloads based on data sensitivity. PHI processing stays on self-hosted infrastructure while general medical queries use cloud resources. A classification system internal to the private network inspects requests to detect protected health information and routes them appropriately. This optimizes costs by using cloud elasticity for non-sensitive tasks while maintaining compliance for patient data.

Model selection and infrastructure

Selecting the appropriate Llama model depends on your use cases and available GPU resources. Each model size offers different tradeoffs between capability and resource requirements.

Model sizing guide

ModelUse CasesGPU RequirementsMemory
Llama 3.3 70BComplex reasoning, research2× A100 80 GB or H100140 GB (FP16)
Llama 3.2 11BClinical notes, decision support1× A100 40 GB or L40S22 GB (FP16)
Llama 3.2 3BTranscription, simple Q&A1× A10 24 GB or T46 GB (FP16)

For detailed GPU selection and cost analysis, refer to the Accelerator management guide. Consider Quantization techniques to reduce memory requirements while maintaining accuracy.

Inference server selection

Choose your inference server based on performance requirements and operational complexity:

vLLM provides the highest throughput for production deployments with continuous batching, PagedAttention for memory efficiency, and OpenAI-compatible APIs. It integrates seamlessly with Hugging Face model formats and supports various quantization methods.

TensorRT-LLM offers maximum performance on NVIDIA GPUs through deep optimization and kernel fusion. While requiring more setup effort, it provides the lowest latency for real-time clinical decision support.

llama.cpp aims to target ultra-portable, on-device inference with minimal dependencies. It natively supports the efficient GGUF format and runs fully offline on CPUs, Apple Silicon, and modest GPUs. While it typically does not match the performance of vLLM or TensorRT-LLM on high capacity servers, it is great for laptops, edge devices, and embedded applications.

# vLLM configuration example (conceptual)
config = {
  "model": "meta-llama/Llama-3.2-11B-Instruct",
  "tensor_parallel_size": 1,
  "gpu_memory_utilization": 0.9,
  "max_model_len": 8192,
  "download_dir": "/models/weights",
  "trust_remote_code": False
}
# Additional security and performance settings...

Security and compliance

Healthcare deployments require comprehensive security controls that protect PHI while maintaining clinical usability. Every component implements defense-in-depth against both external threats and insider risks.

PHI detection and protection

Implement automatic PHI detection at the request ingestion layer using pattern matching and named entity recognition. The system identifies potential PHI including medical record numbers, Social Security numbers, and patient names. Detected PHI undergoes tokenization with reversible encryption, allowing the model to maintain context while preventing exposure in logs or error messages.

# PHI detection patterns (conceptual)
class PHIDetector:
    patterns = {
        'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
        'mrn': r'\bMRN:\s*\d{6,10}\b',
        'dob': r'\b\d{1,2}/\d{1,2}/\d{4}\b'
    }

    def detect_and_mask(self, text: str) -> tuple:
        # Returns masked text and PHI locations
        # Implementation handles various PHI types
        pass

Audit logging architecture

Comprehensive audit logging captures metadata about every model interaction while excluding PHI from log entries. Each request generates an audit record containing timestamp, user identity, department, request type, model version, and response status. These logs feed into your security incident and event management (SIEM) system for real-time security monitoring and generate compliance reports for regulatory audits.

{
  "timestamp": "2024-01-15T14:30:00Z",
  "user_id": "usr_12345",
  "department": "radiology",
  "model": "llama-3.2-11b",
  "request_hash": "sha256_hash",
  "tokens_used": 1250,
  "phi_detected": true,
  "latency_ms": 450
}

For general security patterns applicable beyond healthcare, see the Security in production guide.

Data retention and encryption

HIPAA requires seven-year retention for audit logs, enforced through automated lifecycle management. Model outputs remain available for quality review before automatic deletion. All storage uses encryption at rest with keys managed through your HSM infrastructure. Network traffic uses mTLS for internal communications with certificate rotation every 90 days.

Deployment approach

Choose your deployment method based on operational requirements and existing infrastructure.

Containerized deployment

Docker containers provide consistency across environments and simplify deployment. Build minimal base images with only required dependencies, run containers as non-root users, and implement security scanning in your CI/CD pipeline.

# Secure container configuration (conceptual)
FROM nvidia/cuda:12.1-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y --no-install-recommends \
    python3.11 python3-pip && \
    rm -rf /var/lib/apt/lists/*
# Copy model weights and inference code
USER 1000:1000
EXPOSE 8000
# Security configurations and health checks...

Kubernetes orchestration

Kubernetes provides production-grade orchestration with automatic scaling and self-healing. Deploy using Helm charts configured for healthcare compliance, implement pod security policies, and use network policies to control traffic between namespaces.

# Kubernetes deployment structure (conceptual)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-inference
spec:
  replicas: 2
  template:
    spec:
      securityContext:
        runAsNonRoot: true
      containers:
      - name: vllm
        resources:
          limits:
            nvidia.com/gpu: 1
        env:
        - name: TRANSFORMERS_OFFLINE
          value: "1"  # Prevent downloads

High availability patterns

Implement high availability through multiple inference replicas behind a load balancer. Use health checks to detect and route around failed instances. Maintain warm standby models to minimize cold start latency. Configure automatic failover between primary and secondary sites for disaster recovery.

Monitoring and operations

Effective monitoring ensures performance, compliance, and cost optimization without exposing PHI.

Key metrics

Monitor system health through GPU utilization, memory usage, and inference latency. Track request patterns by department to identify usage trends and capacity needs. Measure model performance through tokens per second and queue depth. Generate compliance metrics including PHI detection rates and audit log completeness.

# Prometheus metrics configuration (conceptual)
metrics:
  - name: llama_request_duration_seconds
    type: histogram
    labels: [department, model]
  - name: llama_gpu_utilization_percent
    type: gauge
    labels: [gpu_id]
  - name: llama_phi_detections_total
    type: counter
    labels: [department, phi_type]

Compliance validation

Schedule quarterly compliance audits validating encryption, access controls, and audit log integrity. Test disaster recovery procedures monthly including failover and data restoration. Document all security controls with evidence for regulatory reviews. Maintain runbooks for incident response and security breach procedures.

For cost optimization strategies, refer to the Cost projection guide which includes TCO calculations for self-hosted deployments.

Additional resources

After establishing your base deployment, enhance your implementation with these advanced capabilities:

  • Review the Fine-tuning guide to customize models for medical terminology
  • Implement the Evaluations framework for clinical accuracy testing
  • Explore Quantization techniques to optimize resource utilization
  • Study the prompting guide for prompt engineering tips that may be useful in clinical contexts
  • Consult the Accelerator management guide for detailed GPU configuration
Was this page helpful?
Yes
No
On this page
Self-hosted Llama deployments for regulated industries
Overview
Architecture patterns
Air-gapped deployment
Private network deployment
Hybrid deployment
Model selection and infrastructure
Model sizing guide
Inference server selection
Security and compliance
PHI detection and protection
Audit logging architecture
Data retention and encryption
Deployment approach
Containerized deployment
Kubernetes orchestration
High availability patterns
Monitoring and operations
Key metrics
Compliance validation
Additional resources