RAG Deployment Strategies
Overview
Section titled “Overview”Deploying RAG systems at scale requires careful consideration of containerization strategies, orchestration patterns, versioning approaches, and CI/CD integration. This page explores production-ready deployment patterns for enterprise RAG implementations on Azure Arc.
Container-Based RAG Deployment
Section titled “Container-Based RAG Deployment”RAG Component Architecture
Section titled “RAG Component Architecture”A production RAG system consists of multiple containerized services:
graph TB
subgraph App[Application Layer]
UI[Chat Interface<br/>React/Vue]
end
subgraph Orch[Orchestration Layer]
Orchestrator[RAG Orchestration Service<br/>Query routing<br/>Context assembly<br/>Response formatting]
end
subgraph Core[Core Services]
LLM[LLM Service<br/>Ollama/VLLM<br/>Quantized Models]
Vector[Vector Search<br/>Weaviate/Qdrant/Milvus]
Data[Data Layer<br/>Postgres<br/>MongoDB<br/>Elastic]
end
UI --> Orchestrator
Orchestrator --> LLM
Orchestrator --> Vector
Orchestrator --> Data
style App fill:#E8F4FD,stroke:#0078D4,stroke-width:2px,color:#000
style Orch fill:#FFF4E6,stroke:#FF8C00,stroke-width:2px,color:#000
style Core fill:#F3E8FF,stroke:#7B3FF2,stroke-width:2px,color:#000
Kubernetes Deployment Manifest
Section titled “Kubernetes Deployment Manifest”Structure for RAG on AKS Arc:
apiVersion: v1kind: Namespacemetadata: name: rag-system---# LLM Service DeploymentapiVersion: apps/v1kind: Deploymentmetadata: name: llm-service namespace: rag-systemspec: replicas: 3 selector: matchLabels: app: llm-service template: metadata: labels: app: llm-service spec: containers: - name: ollama image: ollama:latest resources: requests: memory: "24Gi" nvidia.com/gpu: "1" limits: memory: "32Gi" nvidia.com/gpu: "1" volumeMounts: - name: model-cache mountPath: /root/.ollama volumes: - name: model-cache persistentVolumeClaim: claimName: llm-models-pvc---# Vector Database DeploymentapiVersion: apps/v1kind: Deploymentmetadata: name: vector-db namespace: rag-systemspec: replicas: 3 selector: matchLabels: app: vector-db template: metadata: labels: app: vector-db spec: containers: - name: weaviate image: weaviate:latest resources: requests: memory: "16Gi" limits: memory: "24Gi" volumeMounts: - name: vector-data mountPath: /var/lib/weaviate volumes: - name: vector-data persistentVolumeClaim: claimName: vector-db-pvcMulti-Container Service Mesh
Section titled “Multi-Container Service Mesh”For large deployments, use service mesh for advanced traffic management:
graph TB
Ingress[Ingress LoadBalancer]
subgraph ServiceMesh[Istio/Linkerd Service Mesh]
Control[Control Plane]
end
LLM1[LLM Service v1.0<br/>80% Traffic]
LLM2[LLM Service v1.1<br/>20% Traffic - Canary]
Vector[Vector DB Service]
DataConn[Data Connector Service]
Ingress --> Control
Control --> LLM1
Control --> LLM2
Control --> Vector
Control --> DataConn
style Ingress fill:#E8F4FD,stroke:#0078D4,stroke-width:2px,color:#000
style ServiceMesh fill:#FFF4E6,stroke:#FF8C00,stroke-width:2px,color:#000
style LLM1 fill:#D4E9D7,stroke:#107C10,stroke-width:2px,color:#000
style LLM2 fill:#FFFACD,stroke:#FF8C00,stroke-width:2px,stroke-dasharray: 5 5,color:#000
Benefits:
- Canary deployments (new models with 20% traffic)
- Circuit breaking (fail-safe degradation)
- Distributed tracing (latency visibility)
- Automatic retry policies
Kubernetes Orchestration Patterns
Section titled “Kubernetes Orchestration Patterns”Pattern 1: Stateless LLM Service
Section titled “Pattern 1: Stateless LLM Service”Use Case: Horizontally scaled inference servers
Configuration: - Replicas: 3-10 depending on load - Each pod: 1-2 GPUs, 24-32GB VRAM - Shared model cache (volume) - Load balancing: Round-robin or least-connection
Scaling Trigger: - CPU > 80% → Add replica - GPU memory > 90% → Add replica - Latency p95 > 1s → Add replica - Response queue > 50 → Add replicaPattern 2: Stateful Vector Database
Section titled “Pattern 2: Stateful Vector Database”Use Case: Persistent vector storage with replication
Configuration: - StatefulSet (not Deployment) - Persistent volumes per pod - Multi-replica for HA - Backup strategy: Nightly snapshots
Topology: - Pod 0: Primary (read/write) - Pod 1: Replica 1 (read-only) - Pod 2: Replica 2 (read-only)
Write operations → Pod 0 Read operations → Round-robin (Pod 0, 1, 2)Pattern 3: Data Connector Service
Section titled “Pattern 3: Data Connector Service”Use Case: Real-time data synchronization
Configuration: - CronJob for scheduled imports - Service for real-time connectors - Init containers for schema setup
Workflow: 1. Fetch from source (DB, API, file) 2. Process/tokenize 3. Generate embeddings 4. Index in vector DB 5. Update metadataVersioning & Model Management
Section titled “Versioning & Model Management”Model Versioning Strategy
Section titled “Model Versioning Strategy”File Structure:
/models├── llm│ ├── phi-3-4b│ │ ├── v1.0 (current production)│ │ │ ├── model.gguf (quantized 4-bit)│ │ │ ├── tokenizer.model│ │ │ └── config.json│ │ ├── v1.1 (canary testing)│ │ └── v0.9 (previous stable)│ └── mistral-7b│ ├── v1.0│ └── v1.1│├── embeddings│ ├── bge-base-en│ │ ├── v1.0 (current)│ │ └── v1.1 (canary)│ └── all-minilm-l6-v2│ └── v1.0│└── indices ├── vector-index-v1 ├── vector-index-v2 └── vector-index-v3Blue-Green Deployment
Section titled “Blue-Green Deployment”Minimize downtime during model updates:
Current State (Blue): ┌──────────────────┐ │ LLM v1.0 (Prod) │ │ Vector DB v1 │ │ Embeddings v1 │ └──────────────────┘ ↑ Traffic
New State (Green - Staging): ┌──────────────────┐ │ LLM v1.1 (Test) │ │ Vector DB v2 │ │ Embeddings v2 │ └──────────────────┘ (offline)
Switchover: 1. Test Green fully 2. Redirect traffic to Green 3. Keep Blue as instant rollback 4. Blue becomes staging for next versionCanary Deployment
Section titled “Canary Deployment”Reduce risk with gradual rollout:
Timeline: Hour 0: Deploy v1.1 (10% traffic) Hour 1: If metrics good, 25% traffic Hour 2: If metrics good, 50% traffic Hour 4: If metrics good, 100% traffic
Metrics to Monitor: - Error rate < 1% - Latency p95 < 1s - Model confidence > 0.7 - User satisfaction > 4.5/5
Rollback Trigger: - Error rate > 2% - Latency p95 > 2s - Hallucination rate > 10%CI/CD for RAG Systems
Section titled “CI/CD for RAG Systems”Build Pipeline
Section titled “Build Pipeline”Source Code (Git) │ ▼1. Lint & Test - Code quality checks - Unit tests for prompt templates - Embedding validation │ ▼2. Build Containers - LLM service image - Vector DB image - Data connector image │ ▼3. Security Scanning - Vulnerability scan (Trivy) - Secret detection - Image signing │ ▼4. Push to Registry - Container registry (ACR) - Tag: :v1.2.3, :latest │ ▼5. Deploy to Dev - Automated deployment - Smoke tests - E2E tests │ ▼6. Deploy to Staging - Full test suite - Performance testing - User acceptance testing │ ▼7. Deploy to Production - Canary rollout (10% → 50% → 100%) - Continuous monitoring - Instant rollback capabilityGitOps Workflow
Section titled “GitOps Workflow”Infrastructure as Code approach:
1. Developer commits model update git commit -m "Update phi-3 to v1.1"
2. PR triggers CI pipeline - Build new container - Run tests - Generate deployment manifests
3. Merge PR to main branch - ArgoCD detects change - Applies manifests to cluster - Monitors deployment health
4. Continuous Compliance - Policy checks (no unapproved images) - Quota validation (GPU, memory) - Audit loggingEdge-Specific Deployment Considerations
Section titled “Edge-Specific Deployment Considerations”Offline-First Deployment
Section titled “Offline-First Deployment”For disconnected/air-gapped environments:
-
Prepare on-premises:
- Download model artifacts
- Pre-build container images
- Create deployment manifests
- Generate offline documentation
-
Deploy to edge:
- Load container images locally
- Mount model files from NAS/storage
- Configure with local DNS
- Validate in air-gapped environment
-
Updates:
- Prepare update package offline
- Validate on staging cluster first
- Deploy with manual approval
Resource Constraints
Section titled “Resource Constraints”Optimize for edge hardware (vs. cloud):
Cloud Deployment: Edge Deployment:───────────────── ────────────────- 8+ GPUs available - 1-2 GPUs per node- 128GB+ VRAM - 24-32GB VRAM- Unlimited scaling - Fixed hardware- High availability - Local resilience
Strategies: 1. Model quantization (4-bit → 75% size reduction) 2. Smaller base models (7B vs. 70B) 3. Local caching (reduce network calls) 4. Batch processing (amortize overhead) 5. GPU sharing (time-slicing for multiple models)Low-Bandwidth Considerations
Section titled “Low-Bandwidth Considerations”When network bandwidth is limited:
-
Model Delivery:
- Pre-download models to edge
- Use compression (20-50% reduction)
- Incremental updates (delta sync)
-
Data Ingestion:
- Batch imports (daily/weekly vs. real-time)
- Lossy compression for non-critical data
- Local processing before cloud sync
-
Monitoring:
- Local metrics collection
- Batch telemetry upload (hourly)
- Low-bandwidth observability (sampling)
Troubleshooting Deployment Issues
Section titled “Troubleshooting Deployment Issues”Common Issues & Solutions
Section titled “Common Issues & Solutions”| Issue | Cause | Solution |
|---|---|---|
| High latency (>2s) | Model too large | Reduce quantization, use smaller model |
| OOM errors | Insufficient VRAM | Enable disk offloading, batch smaller |
| Vector search slow | Poor index | Rebuild with HNSW, reduce vectors |
| Models not loaded | Network timeout | Pre-download, check storage |
| High error rate | Hallucination | Increase retrieval context, improve prompt |
Health Checks
Section titled “Health Checks”Kubernetes liveness & readiness probes:
livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 60 periodSeconds: 10 timeoutSeconds: 5
readinessProbe: httpGet: path: /ready port: 8000 initialDelaySeconds: 30 periodSeconds: 5 successThreshold: 1Health Endpoints:
GET /health Response: {"status": "healthy", "uptime": 3600}
GET /ready Response: {"ready": true, "models": 3, "vectors": 1000000}
GET /metrics Response: Prometheus metrics (latency, throughput, errors)Related Topics
Section titled “Related Topics”- Main Page: Edge RAG Implementation
- Vector Databases: Vector Databases & Indexing
- LLM Optimization: LLM Inference Optimization
- Operations: RAG Operations & Monitoring
- Assessment: RAG Implementation Knowledge Check
Last Updated: October 21, 2025