Skip to content

RAG Deployment Strategies

Deploying RAG systems at scale requires careful consideration of containerization strategies, orchestration patterns, versioning approaches, and CI/CD integration. This page explores production-ready deployment patterns for enterprise RAG implementations on Azure Arc.


A production RAG system consists of multiple containerized services:

graph TB
    subgraph App[Application Layer]
        UI[Chat Interface<br/>React/Vue]
    end

    subgraph Orch[Orchestration Layer]
        Orchestrator[RAG Orchestration Service<br/>Query routing<br/>Context assembly<br/>Response formatting]
    end

    subgraph Core[Core Services]
        LLM[LLM Service<br/>Ollama/VLLM<br/>Quantized Models]
        Vector[Vector Search<br/>Weaviate/Qdrant/Milvus]
        Data[Data Layer<br/>Postgres<br/>MongoDB<br/>Elastic]
    end

    UI --> Orchestrator
    Orchestrator --> LLM
    Orchestrator --> Vector
    Orchestrator --> Data

    style App fill:#E8F4FD,stroke:#0078D4,stroke-width:2px,color:#000
    style Orch fill:#FFF4E6,stroke:#FF8C00,stroke-width:2px,color:#000
    style Core fill:#F3E8FF,stroke:#7B3FF2,stroke-width:2px,color:#000

Structure for RAG on AKS Arc:

apiVersion: v1
kind: Namespace
metadata:
name: rag-system
---
# LLM Service Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-service
namespace: rag-system
spec:
replicas: 3
selector:
matchLabels:
app: llm-service
template:
metadata:
labels:
app: llm-service
spec:
containers:
- name: ollama
image: ollama:latest
resources:
requests:
memory: "24Gi"
nvidia.com/gpu: "1"
limits:
memory: "32Gi"
nvidia.com/gpu: "1"
volumeMounts:
- name: model-cache
mountPath: /root/.ollama
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: llm-models-pvc
---
# Vector Database Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: vector-db
namespace: rag-system
spec:
replicas: 3
selector:
matchLabels:
app: vector-db
template:
metadata:
labels:
app: vector-db
spec:
containers:
- name: weaviate
image: weaviate:latest
resources:
requests:
memory: "16Gi"
limits:
memory: "24Gi"
volumeMounts:
- name: vector-data
mountPath: /var/lib/weaviate
volumes:
- name: vector-data
persistentVolumeClaim:
claimName: vector-db-pvc

For large deployments, use service mesh for advanced traffic management:

graph TB
    Ingress[Ingress LoadBalancer]

    subgraph ServiceMesh[Istio/Linkerd Service Mesh]
        Control[Control Plane]
    end

    LLM1[LLM Service v1.0<br/>80% Traffic]
    LLM2[LLM Service v1.1<br/>20% Traffic - Canary]
    Vector[Vector DB Service]
    DataConn[Data Connector Service]

    Ingress --> Control
    Control --> LLM1
    Control --> LLM2
    Control --> Vector
    Control --> DataConn

    style Ingress fill:#E8F4FD,stroke:#0078D4,stroke-width:2px,color:#000
    style ServiceMesh fill:#FFF4E6,stroke:#FF8C00,stroke-width:2px,color:#000
    style LLM1 fill:#D4E9D7,stroke:#107C10,stroke-width:2px,color:#000
    style LLM2 fill:#FFFACD,stroke:#FF8C00,stroke-width:2px,stroke-dasharray: 5 5,color:#000

Benefits:

  • Canary deployments (new models with 20% traffic)
  • Circuit breaking (fail-safe degradation)
  • Distributed tracing (latency visibility)
  • Automatic retry policies

Use Case: Horizontally scaled inference servers

Configuration:
- Replicas: 3-10 depending on load
- Each pod: 1-2 GPUs, 24-32GB VRAM
- Shared model cache (volume)
- Load balancing: Round-robin or least-connection
Scaling Trigger:
- CPU > 80% → Add replica
- GPU memory > 90% → Add replica
- Latency p95 > 1s → Add replica
- Response queue > 50 → Add replica

Use Case: Persistent vector storage with replication

Configuration:
- StatefulSet (not Deployment)
- Persistent volumes per pod
- Multi-replica for HA
- Backup strategy: Nightly snapshots
Topology:
- Pod 0: Primary (read/write)
- Pod 1: Replica 1 (read-only)
- Pod 2: Replica 2 (read-only)
Write operations → Pod 0
Read operations → Round-robin (Pod 0, 1, 2)

Use Case: Real-time data synchronization

Configuration:
- CronJob for scheduled imports
- Service for real-time connectors
- Init containers for schema setup
Workflow:
1. Fetch from source (DB, API, file)
2. Process/tokenize
3. Generate embeddings
4. Index in vector DB
5. Update metadata

File Structure:

/models
├── llm
│ ├── phi-3-4b
│ │ ├── v1.0 (current production)
│ │ │ ├── model.gguf (quantized 4-bit)
│ │ │ ├── tokenizer.model
│ │ │ └── config.json
│ │ ├── v1.1 (canary testing)
│ │ └── v0.9 (previous stable)
│ └── mistral-7b
│ ├── v1.0
│ └── v1.1
├── embeddings
│ ├── bge-base-en
│ │ ├── v1.0 (current)
│ │ └── v1.1 (canary)
│ └── all-minilm-l6-v2
│ └── v1.0
└── indices
├── vector-index-v1
├── vector-index-v2
└── vector-index-v3

Minimize downtime during model updates:

Current State (Blue):
┌──────────────────┐
│ LLM v1.0 (Prod) │
│ Vector DB v1 │
│ Embeddings v1 │
└──────────────────┘
Traffic
New State (Green - Staging):
┌──────────────────┐
│ LLM v1.1 (Test) │
│ Vector DB v2 │
│ Embeddings v2 │
└──────────────────┘
(offline)
Switchover:
1. Test Green fully
2. Redirect traffic to Green
3. Keep Blue as instant rollback
4. Blue becomes staging for next version

Reduce risk with gradual rollout:

Timeline:
Hour 0: Deploy v1.1 (10% traffic)
Hour 1: If metrics good, 25% traffic
Hour 2: If metrics good, 50% traffic
Hour 4: If metrics good, 100% traffic
Metrics to Monitor:
- Error rate < 1%
- Latency p95 < 1s
- Model confidence > 0.7
- User satisfaction > 4.5/5
Rollback Trigger:
- Error rate > 2%
- Latency p95 > 2s
- Hallucination rate > 10%

Source Code (Git)
1. Lint & Test
- Code quality checks
- Unit tests for prompt templates
- Embedding validation
2. Build Containers
- LLM service image
- Vector DB image
- Data connector image
3. Security Scanning
- Vulnerability scan (Trivy)
- Secret detection
- Image signing
4. Push to Registry
- Container registry (ACR)
- Tag: :v1.2.3, :latest
5. Deploy to Dev
- Automated deployment
- Smoke tests
- E2E tests
6. Deploy to Staging
- Full test suite
- Performance testing
- User acceptance testing
7. Deploy to Production
- Canary rollout (10% → 50% → 100%)
- Continuous monitoring
- Instant rollback capability

Infrastructure as Code approach:

1. Developer commits model update
git commit -m "Update phi-3 to v1.1"
2. PR triggers CI pipeline
- Build new container
- Run tests
- Generate deployment manifests
3. Merge PR to main branch
- ArgoCD detects change
- Applies manifests to cluster
- Monitors deployment health
4. Continuous Compliance
- Policy checks (no unapproved images)
- Quota validation (GPU, memory)
- Audit logging

For disconnected/air-gapped environments:

  1. Prepare on-premises:

    • Download model artifacts
    • Pre-build container images
    • Create deployment manifests
    • Generate offline documentation
  2. Deploy to edge:

    • Load container images locally
    • Mount model files from NAS/storage
    • Configure with local DNS
    • Validate in air-gapped environment
  3. Updates:

    • Prepare update package offline
    • Validate on staging cluster first
    • Deploy with manual approval

Optimize for edge hardware (vs. cloud):

Cloud Deployment: Edge Deployment:
───────────────── ────────────────
- 8+ GPUs available - 1-2 GPUs per node
- 128GB+ VRAM - 24-32GB VRAM
- Unlimited scaling - Fixed hardware
- High availability - Local resilience
Strategies:
1. Model quantization (4-bit → 75% size reduction)
2. Smaller base models (7B vs. 70B)
3. Local caching (reduce network calls)
4. Batch processing (amortize overhead)
5. GPU sharing (time-slicing for multiple models)

When network bandwidth is limited:

  1. Model Delivery:

    • Pre-download models to edge
    • Use compression (20-50% reduction)
    • Incremental updates (delta sync)
  2. Data Ingestion:

    • Batch imports (daily/weekly vs. real-time)
    • Lossy compression for non-critical data
    • Local processing before cloud sync
  3. Monitoring:

    • Local metrics collection
    • Batch telemetry upload (hourly)
    • Low-bandwidth observability (sampling)

IssueCauseSolution
High latency (>2s)Model too largeReduce quantization, use smaller model
OOM errorsInsufficient VRAMEnable disk offloading, batch smaller
Vector search slowPoor indexRebuild with HNSW, reduce vectors
Models not loadedNetwork timeoutPre-download, check storage
High error rateHallucinationIncrease retrieval context, improve prompt

Kubernetes liveness & readiness probes:

livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 30
periodSeconds: 5
successThreshold: 1

Health Endpoints:

GET /health
Response: {"status": "healthy", "uptime": 3600}
GET /ready
Response: {"ready": true, "models": 3, "vectors": 1000000}
GET /metrics
Response: Prometheus metrics (latency, throughput, errors)


Last Updated: October 21, 2025