RAG Deployment Strategies
Overview
Deploying RAG systems at scale requires careful consideration of containerization strategies, orchestration patterns, versioning approaches, and CI/CD integration. This page explores production-ready deployment patterns for enterprise RAG implementations on Azure Arc.
Container-Based RAG Deployment
RAG Component Architecture
A production RAG system consists of multiple containerized services:
graph TB
subgraph App[Application Layer]
UI[Chat Interface<br/>React/Vue]
end
subgraph Orch[Orchestration Layer]
Orchestrator[RAG Orchestration Service<br/>Query routing<br/>Context assembly<br/>Response formatting]
end
subgraph Core[Core Services]
LLM[LLM Service<br/>Ollama/VLLM<br/>Quantized Models]
Vector[Vector Search<br/>Weaviate/Qdrant/Milvus]
Data[Data Layer<br/>Postgres<br/>MongoDB<br/>Elastic]
end
UI --> Orchestrator
Orchestrator --> LLM
Orchestrator --> Vector
Orchestrator --> Data
style App fill:#E8F4FD,stroke:#0078D4,stroke-width:2px,color:#000
style Orch fill:#FFF4E6,stroke:#FF8C00,stroke-width:2px,color:#000
style Core fill:#F3E8FF,stroke:#7B3FF2,stroke-width:2px,color:#000
Kubernetes Deployment Manifest
Structure for RAG on AKS Arc:
apiVersion: v1
kind: Namespace
metadata:
name: rag-system
---
# LLM Service Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-service
namespace: rag-system
spec:
replicas: 3
selector:
matchLabels:
app: llm-service
template:
metadata:
labels:
app: llm-service
spec:
containers:
- name: ollama
image: ollama:latest
resources:
requests:
memory: "24Gi"
nvidia.com/gpu: "1"
limits:
memory: "32Gi"
nvidia.com/gpu: "1"
volumeMounts:
- name: model-cache
mountPath: /root/.ollama
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: llm-models-pvc
---
# Vector Database Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: vector-db
namespace: rag-system
spec:
replicas: 3
selector:
matchLabels:
app: vector-db
template:
metadata:
labels:
app: vector-db
spec:
containers:
- name: weaviate
image: weaviate:latest
resources:
requests:
memory: "16Gi"
limits:
memory: "24Gi"
volumeMounts:
- name: vector-data
mountPath: /var/lib/weaviate
volumes:
- name: vector-data
persistentVolumeClaim:
claimName: vector-db-pvc
Multi-Container Service Mesh
For large deployments, use service mesh for advanced traffic management:
graph TB
Ingress[Ingress LoadBalancer]
subgraph ServiceMesh[Istio/Linkerd Service Mesh]
Control[Control Plane]
end
LLM1[LLM Service v1.0<br/>80% Traffic]
LLM2[LLM Service v1.1<br/>20% Traffic - Canary]
Vector[Vector DB Service]
DataConn[Data Connector Service]
Ingress --> Control
Control --> LLM1
Control --> LLM2
Control --> Vector
Control --> DataConn
style Ingress fill:#E8F4FD,stroke:#0078D4,stroke-width:2px,color:#000
style ServiceMesh fill:#FFF4E6,stroke:#FF8C00,stroke-width:2px,color:#000
style LLM1 fill:#D4E9D7,stroke:#107C10,stroke-width:2px,color:#000
style LLM2 fill:#FFFACD,stroke:#FF8C00,stroke-width:2px,stroke-dasharray: 5 5,color:#000
Benefits:
- Canary deployments (new models with 20% traffic)
- Circuit breaking (fail-safe degradation)
- Distributed tracing (latency visibility)
- Automatic retry policies
Kubernetes Orchestration Patterns
Pattern 1: Stateless LLM Service
Use Case: Horizontally scaled inference servers
Configuration:
- Replicas: 3-10 depending on load
- Each pod: 1-2 GPUs, 24-32GB VRAM
- Shared model cache (volume)
- Load balancing: Round-robin or least-connection
Scaling Trigger:
- CPU > 80% → Add replica
- GPU memory > 90% → Add replica
- Latency p95 > 1s → Add replica
- Response queue > 50 → Add replica
Pattern 2: Stateful Vector Database
Use Case: Persistent vector storage with replication
Configuration:
- StatefulSet (not Deployment)
- Persistent volumes per pod
- Multi-replica for HA
- Backup strategy: Nightly snapshots
Topology:
- Pod 0: Primary (read/write)
- Pod 1: Replica 1 (read-only)
- Pod 2: Replica 2 (read-only)
Write operations → Pod 0
Read operations → Round-robin (Pod 0, 1, 2)
Pattern 3: Data Connector Service
Use Case: Real-time data synchronization
Configuration:
- CronJob for scheduled imports
- Service for real-time connectors
- Init containers for schema setup
Workflow:
1. Fetch from source (DB, API, file)
2. Process/tokenize
3. Generate embeddings
4. Index in vector DB
5. Update metadata
Versioning & Model Management
Model Versioning Strategy
File Structure:
/models
├── llm
│ ├── phi-3-4b
│ │ ├── v1.0 (current production)
│ │ │ ├── model.gguf (quantized 4-bit)
│ │ │ ├── tokenizer.model
│ │ │ └── config.json
│ │ ├── v1.1 (canary testing)
│ │ └── v0.9 (previous stable)
│ └── mistral-7b
│ ├── v1.0
│ └── v1.1
│
├── embeddings
│ ├── bge-base-en
│ │ ├── v1.0 (current)
│ │ └── v1.1 (canary)
│ └── all-minilm-l6-v2
│ └── v1.0
│
└── indices
├── vector-index-v1
├── vector-index-v2
└── vector-index-v3
Blue-Green Deployment
Minimize downtime during model updates:
Current State (Blue):
┌──────────────────┐
│ LLM v1.0 (Prod) │
│ Vector DB v1 │
│ Embeddings v1 │
└──────────────────┘
↑
Traffic
New State (Green - Staging):
┌──────────────────┐
│ LLM v1.1 (Test) │
│ Vector DB v2 │
│ Embeddings v2 │
└──────────────────┘
(offline)
Switchover:
1. Test Green fully
2. Redirect traffic to Green
3. Keep Blue as instant rollback
4. Blue becomes staging for next version
Canary Deployment
Reduce risk with gradual rollout:
Timeline:
Hour 0: Deploy v1.1 (10% traffic)
Hour 1: If metrics good, 25% traffic
Hour 2: If metrics good, 50% traffic
Hour 4: If metrics good, 100% traffic
Metrics to Monitor:
- Error rate < 1%
- Latency p95 < 1s
- Model confidence > 0.7
- User satisfaction > 4.5/5
Rollback Trigger:
- Error rate > 2%
- Latency p95 > 2s
- Hallucination rate > 10%
CI/CD for RAG Systems
Build Pipeline
Source Code (Git)
│
▼
1. Lint & Test
- Code quality checks
- Unit tests for prompt templates
- Embedding validation
│
▼
2. Build Containers
- LLM service image
- Vector DB image
- Data connector image
│
▼
3. Security Scanning
- Vulnerability scan (Trivy)
- Secret detection
- Image signing
│
▼
4. Push to Registry
- Container registry (ACR)
- Tag: :v1.2.3, :latest
│
▼
5. Deploy to Dev
- Automated deployment
- Smoke tests
- E2E tests
│
▼
6. Deploy to Staging
- Full test suite
- Performance testing
- User acceptance testing
│
▼
7. Deploy to Production
- Canary rollout (10% → 50% → 100%)
- Continuous monitoring
- Instant rollback capability
GitOps Workflow
Infrastructure as Code approach:
1. Developer commits model update
git commit -m "Update phi-3 to v1.1"
2. PR triggers CI pipeline
- Build new container
- Run tests
- Generate deployment manifests
3. Merge PR to main branch
- ArgoCD detects change
- Applies manifests to cluster
- Monitors deployment health
4. Continuous Compliance
- Policy checks (no unapproved images)
- Quota validation (GPU, memory)
- Audit logging
Edge-Specific Deployment Considerations
Offline-First Deployment
For disconnected/air-gapped environments:
- Prepare on-premises:
- Download model artifacts
- Pre-build container images
- Create deployment manifests
- Generate offline documentation
- Deploy to edge:
- Load container images locally
- Mount model files from NAS/storage
- Configure with local DNS
- Validate in air-gapped environment
- Updates:
- Prepare update package offline
- Validate on staging cluster first
- Deploy with manual approval
Resource Constraints
Optimize for edge hardware (vs. cloud):
Cloud Deployment: Edge Deployment:
───────────────── ────────────────
- 8+ GPUs available - 1-2 GPUs per node
- 128GB+ VRAM - 24-32GB VRAM
- Unlimited scaling - Fixed hardware
- High availability - Local resilience
Strategies:
1. Model quantization (4-bit → 75% size reduction)
2. Smaller base models (7B vs. 70B)
3. Local caching (reduce network calls)
4. Batch processing (amortize overhead)
5. GPU sharing (time-slicing for multiple models)
Low-Bandwidth Considerations
When network bandwidth is limited:
- Model Delivery:
- Pre-download models to edge
- Use compression (20-50% reduction)
- Incremental updates (delta sync)
- Data Ingestion:
- Batch imports (daily/weekly vs. real-time)
- Lossy compression for non-critical data
- Local processing before cloud sync
- Monitoring:
- Local metrics collection
- Batch telemetry upload (hourly)
- Low-bandwidth observability (sampling)
Troubleshooting Deployment Issues
Common Issues & Solutions
| Issue | Cause | Solution |
|---|---|---|
| High latency (>2s) | Model too large | Reduce quantization, use smaller model |
| OOM errors | Insufficient VRAM | Enable disk offloading, batch smaller |
| Vector search slow | Poor index | Rebuild with HNSW, reduce vectors |
| Models not loaded | Network timeout | Pre-download, check storage |
| High error rate | Hallucination | Increase retrieval context, improve prompt |
Health Checks
Kubernetes liveness & readiness probes:
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 30
periodSeconds: 5
successThreshold: 1
Health Endpoints:
GET /health
Response: {"status": "healthy", "uptime": 3600}
GET /ready
Response: {"ready": true, "models": 3, "vectors": 1000000}
GET /metrics
Response: Prometheus metrics (latency, throughput, errors)
Related Topics
- Main Page: Edge RAG Implementation
- Vector Databases: Vector Databases & Indexing
- LLM Optimization: LLM Inference Optimization
- Operations: RAG Operations & Monitoring
- Assessment: RAG Implementation Knowledge Check
Last Updated: October 21, 2025