Lab 3: Edge RAG Setup

🚧 Lab Under Development
This lab content is complete but hands-on exercises are currently being validated and refined.
Expected Release: Q1 2026
You can review the lab steps and prepare your environment in advance.

🚧 Lab Under Development
This lab content is complete but hands-on exercises are currently being validated and refined.
Expected Release: Q1 2026
You can review the lab steps and prepare your environment in advance.

🚧 Lab Under Development
This lab content is complete but hands-on exercises are currently being validated and refined.
Expected Release: Q1 2026
You can review the lab steps and prepare your environment in advance.

Objective

Deploy a complete Edge RAG (Retrieval-Augmented Generation) solution on Azure Local, including vector database, embedding models, LLM inference engine, and RAG pipeline. This is the most comprehensive lab demonstrating AI at the edge.


Pre-Lab Checklist

PREREQUISITES
═════════════════════════════════════════════════════════════

Required:
☐ Completion of Lab 1 (Azure Local) and Lab 2 (Arc)
☐ Azure subscription with resources from prior labs
☐ 8+ GB RAM available for containers
☐ 50+ GB disk space for models
☐ Docker/Podman installed locally
☐ Python 3.10+ (for RAG script)
☐ curl or Postman for API testing

Optional but Recommended:
☐ GPU (NVIDIA/AMD) for model acceleration
☐ LLM model experience
☐ Vector database knowledge (Weaviate/Qdrant)
☐ REST API debugging tools

Estimated Time: 3-4 hours
Difficulty: Advanced
Cost: $50-100 Azure credits (GPU usage)

Lab Architecture

EDGE RAG SYSTEM ARCHITECTURE
═════════════════════════════════════════════════════════════

Azure Local (On-Premises)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Edge RAG Solution                                       β”‚
β”‚                                                         β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ RAG Application Layer                              β”‚ β”‚
β”‚ β”‚ β”œβ”€ FastAPI/Flask RAG Endpoint (:8000)             β”‚ β”‚
β”‚ β”‚ β”œβ”€ Document Ingestion Service                      β”‚ β”‚
β”‚ β”‚ └─ Query Processing Pipeline                       β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚          ↓           ↓           ↓                      β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚ β”‚ Embedding   β”‚ Vector Store  β”‚ LLM Inferenceβ”‚        β”‚
β”‚ β”‚ Model       β”‚ (Weaviate)    β”‚ Engine       β”‚        β”‚
β”‚ β”‚ (LLaMA-    β”‚ (:8080)       β”‚ (Ollama:11434)       β”‚
β”‚ β”‚ Embeddings) β”‚               β”‚              β”‚        β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β”‚                                                         β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Storage & Persistence                              β”‚ β”‚
β”‚ β”‚ β”œβ”€ Volume: /data/weaviate (vector store)          β”‚ β”‚
β”‚ β”‚ β”œβ”€ Volume: /data/ollama (model cache)             β”‚ β”‚
β”‚ β”‚ └─ Volume: /data/documents (ingested docs)        β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                         β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Monitoring & Logging                               β”‚ β”‚
β”‚ β”‚ β”œβ”€ Prometheus metrics (:9090)                      β”‚ β”‚
β”‚ β”‚ β”œβ”€ Loki logs aggregation (:3100)                   β”‚ β”‚
β”‚ β”‚ └─ Grafana dashboards (:3000)                      β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          ↓ (Arc Integration)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Azure (Monitoring & Backup)                             β”‚
β”‚ β”œβ”€ Azure Monitor ingests metrics                        β”‚
β”‚ β”œβ”€ Log Analytics receives logs                          β”‚
β”‚ └─ Storage Account backs up embeddings                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Lab Steps

Step 1: Prepare Edge RAG Environment

Objective: Set up prerequisites and namespace for RAG system

Step 1.1: Create Namespace

# Create dedicated namespace for RAG
kubectl create namespace edge-rag

# Label namespace for monitoring
kubectl label namespace edge-rag monitoring=enabled

# Verify namespace
kubectl get namespace edge-rag

Expected Output: Namespace β€œedge-rag” created

Step 1.2: Create Storage for Models and Data

# Create PVC for persistent storage
@"
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: rag-data-pvc
  namespace: edge-rag
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: weaviate-pvc
  namespace: edge-rag
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 30Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-pvc
  namespace: edge-rag
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
"@ | kubectl apply -f -

# Verify PVCs
kubectl get pvc -n edge-rag

Expected Output: Three PVCs created and bound

Step 1.3: Create ConfigMap for RAG Configuration

# Create configuration for RAG pipeline
@"
apiVersion: v1
kind: ConfigMap
metadata:
  name: rag-config
  namespace: edge-rag
data:
  rag-settings.yaml: |
    vector_store:
      type: weaviate
      url: http://weaviate:8080
      batch_size: 50
      consistency_level: ALL
    
    embeddings:
      model: sentence-transformers/all-MiniLM-L6-v2
      device: cpu
      batch_size: 32
    
    llm:
      engine: ollama
      url: http://ollama:11434
      model: mistral
      temperature: 0.7
      max_tokens: 512
    
    retrieval:
      top_k: 5
      similarity_threshold: 0.7
      
    ingestion:
      chunk_size: 512
      chunk_overlap: 50
      document_path: /data/documents
"@ | kubectl apply -f -

# Verify ConfigMap
kubectl get configmap -n edge-rag

Expected Output: ConfigMap β€œrag-config” created


Step 2: Deploy Vector Database (Weaviate)

Objective: Set up Weaviate vector database for embedding storage

Step 2.1: Deploy Weaviate Service

# Deploy Weaviate vector database
@"
apiVersion: apps/v1
kind: Deployment
metadata:
  name: weaviate
  namespace: edge-rag
spec:
  replicas: 1
  selector:
    matchLabels:
      app: weaviate
  template:
    metadata:
      labels:
        app: weaviate
    spec:
      containers:
      - name: weaviate
        image: semitechnologies/weaviate:1.18.0
        ports:
        - containerPort: 8080
          name: graphql
        - containerPort: 50051
          name: grpc
        env:
        - name: AUTHENTICATION_APIKEY_ENABLED
          value: "false"
        - name: PERSISTENCE_DATA_PATH
          value: /var/lib/weaviate
        - name: ENABLE_MODULES
          value: "text2vec-transformers,text2vec-openai"
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
          limits:
            memory: "4Gi"
            cpu: "2000m"
        livenessProbe:
          httpGet:
            path: /v1/.well-known/ready
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /v1/.well-known/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        volumeMounts:
        - name: weaviate-storage
          mountPath: /var/lib/weaviate
      volumes:
      - name: weaviate-storage
        persistentVolumeClaim:
          claimName: weaviate-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: weaviate
  namespace: edge-rag
spec:
  type: ClusterIP
  ports:
  - port: 8080
    targetPort: 8080
    name: graphql
  - port: 50051
    targetPort: 50051
    name: grpc
  selector:
    app: weaviate
"@ | kubectl apply -f -

# Wait for deployment
Write-Host "Weaviate deploying (2-3 minutes)..."
kubectl wait --for=condition=ready pod -l app=weaviate -n edge-rag --timeout=300s

# Verify service
kubectl get svc -n edge-rag
kubectl get pods -n edge-rag

Expected Output: Weaviate pod running, service created

Step 2.2: Verify Weaviate Health

# Port-forward to test locally (optional)
# kubectl port-forward -n edge-rag svc/weaviate 8080:8080 &

# Get Weaviate pod IP for testing
$weaviatePod = kubectl get pods -n edge-rag -l app=weaviate -o jsonpath='{.items[0].metadata.name}'
$weaviateIP = kubectl get pod $weaviatePod -n edge-rag -o jsonpath='{.status.podIP}'

Write-Host "Weaviate Pod: $weaviatePod"
Write-Host "Weaviate IP: $weaviateIP"

# Test connectivity from another pod
kubectl run -it --rm debug --image=curlimages/curl -n edge-rag -- sh
# Inside pod: curl http://weaviate:8080/v1/.well-known/ready

Expected Output: Weaviate is ready and accessible


Step 3: Deploy LLM Inference Engine (Ollama)

Objective: Set up Ollama for local LLM inference

Step 3.1: Deploy Ollama Service

# Deploy Ollama for LLM inference
@"
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: edge-rag
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
          name: api
        env:
        - name: OLLAMA_MODELS_DIR
          value: /root/.ollama/models
        resources:
          requests:
            memory: "4Gi"
            cpu: "2000m"
          limits:
            memory: "8Gi"
            cpu: "4000m"
        livenessProbe:
          exec:
            command: ["sh", "-c", "curl -f http://localhost:11434/api/tags || exit 1"]
          initialDelaySeconds: 60
          periodSeconds: 30
        readinessProbe:
          exec:
            command: ["sh", "-c", "curl -f http://localhost:11434/api/tags || exit 1"]
          initialDelaySeconds: 30
          periodSeconds: 10
        volumeMounts:
        - name: ollama-storage
          mountPath: /root/.ollama
      volumes:
      - name: ollama-storage
        persistentVolumeClaim:
          claimName: ollama-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: edge-rag
spec:
  type: ClusterIP
  ports:
  - port: 11434
    targetPort: 11434
    name: api
  selector:
    app: ollama
"@ | kubectl apply -f -

# Wait for deployment
Write-Host "Ollama deploying (1-2 minutes)..."
kubectl wait --for=condition=ready pod -l app=ollama -n edge-rag --timeout=300s

Expected Output: Ollama pod running, service created

Step 3.2: Pull and Verify Model

# Get Ollama pod name
$ollamaPod = kubectl get pods -n edge-rag -l app=ollama -o jsonpath='{.items[0].metadata.name}'

# Pull lightweight model (Mistral 7B)
# Note: First time takes 5-10 minutes for download
Write-Host "Pulling Mistral model (this may take several minutes)..."
kubectl exec -it $ollamaPod -n edge-rag -- ollama pull mistral

# Verify model is available
kubectl exec $ollamaPod -n edge-rag -- ollama list

# Test model responsiveness
kubectl exec $ollamaPod -n edge-rag -- ollama generate --model mistral "Hello, what is retrieval augmented generation?" | head -20

Expected Output: Model pulled and responding to queries


Step 4: Deploy RAG Application

Objective: Deploy the RAG pipeline connecting embeddings, vector store, and LLM

Step 4.1: Create RAG Application Image (Local Build)

# Create Dockerfile for RAG application
$dockerfile = @"
FROM python:3.10-slim

WORKDIR /app

# Install dependencies
RUN pip install --no-cache-dir \
    fastapi==0.104.1 \
    uvicorn==0.24.0 \
    requests==2.31.0 \
    weaviate-client==3.25.0 \
    sentence-transformers==2.2.2 \
    torch==2.1.0 \
    PyYAML==6.0 \
    pydantic==2.5.0

# Copy RAG application
COPY app.py /app/
COPY rag_pipeline.py /app/
COPY config.yaml /app/

EXPOSE 8000

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
"@

$dockerfile | Out-File -Path Dockerfile

Write-Host "Dockerfile created"

Step 4.2: Create RAG Pipeline Code

# Save as rag_pipeline.py
cat > rag_pipeline.py << 'EOF'
import requests
import json
from typing import List, Dict
import logging

logger = logging.getLogger(__name__)

class RAGPipeline:
    def __init__(self, config: Dict):
        self.config = config
        self.weaviate_url = config['vector_store']['url']
        self.ollama_url = config['llm']['url']
        self.top_k = config['retrieval']['top_k']
        
    def embed_text(self, text: str) -> List[float]:
        """Generate embeddings for text"""
        try:
            # Use sentence-transformers locally
            from sentence_transformers import SentenceTransformer
            model = SentenceTransformer('all-MiniLM-L6-v2')
            embedding = model.encode(text, convert_to_tensor=True)
            return embedding.tolist()
        except Exception as e:
            logger.error(f"Embedding error: {e}")
            raise
    
    def store_document(self, doc_id: str, text: str, metadata: Dict) -> bool:
        """Store document in Weaviate"""
        try:
            embedding = self.embed_text(text)
            
            payload = {
                "class": "Document",
                "id": doc_id,
                "properties": {
                    "content": text,
                    "source": metadata.get("source", "unknown"),
                    "timestamp": metadata.get("timestamp", ""),
                },
                "vector": embedding
            }
            
            response = requests.post(
                f"{self.weaviate_url}/v1/objects",
                json=payload
            )
            return response.status_code == 200
        except Exception as e:
            logger.error(f"Storage error: {e}")
            raise
    
    def retrieve_context(self, query: str) -> List[str]:
        """Retrieve relevant documents for query"""
        try:
            query_embedding = self.embed_text(query)
            
            graphql_query = f"""
            {{
              Get {{
                Document(
                  nearVector: {{
                    vector: {query_embedding}
                  }}
                  limit: {self.top_k}
                ) {{
                  content
                  source
                }}
              }}
            }}
            """
            
            response = requests.post(
                f"{self.weaviate_url}/v1/graphql",
                json={"query": graphql_query}
            )
            
            if response.status_code == 200:
                results = response.json().get("data", {}).get("Get", {}).get("Document", [])
                return [doc["content"] for doc in results]
            return []
        except Exception as e:
            logger.error(f"Retrieval error: {e}")
            raise
    
    def generate_answer(self, query: str, context: List[str]) -> str:
        """Generate answer using LLM with context"""
        try:
            context_text = "\n".join(context)
            prompt = f"""Context:
{context_text}

Question: {query}

Answer:"""
            
            response = requests.post(
                f"{self.ollama_url}/api/generate",
                json={
                    "model": "mistral",
                    "prompt": prompt,
                    "stream": False
                }
            )
            
            if response.status_code == 200:
                return response.json()["response"]
            raise Exception(f"Generation error: {response.status_code}")
        except Exception as e:
            logger.error(f"Generation error: {e}")
            raise
    
    def query(self, query: str) -> Dict:
        """Full RAG query pipeline"""
        try:
            context = self.retrieve_context(query)
            answer = self.generate_answer(query, context)
            return {
                "query": query,
                "answer": answer,
                "context_documents": len(context),
                "sources": [doc[:100] + "..." for doc in context]
            }
        except Exception as e:
            logger.error(f"Query error: {e}")
            return {
                "query": query,
                "error": str(e),
                "answer": "Unable to generate answer"
            }
EOF

Step 4.3: Create FastAPI Application

# Save as app.py
cat > app.py << 'EOF'
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import yaml
import logging
from rag_pipeline import RAGPipeline

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI(title="Edge RAG Service", version="1.0.0")

# Load configuration
with open("config.yaml", "r") as f:
    config = yaml.safe_load(f)

# Initialize RAG pipeline
rag_pipeline = RAGPipeline(config)

class QueryRequest(BaseModel):
    query: str

class DocumentRequest(BaseModel):
    doc_id: str
    content: str
    source: str = "unknown"

@app.get("/health")
async def health_check():
    return {"status": "healthy"}

@app.get("/config")
async def get_config():
    return config

@app.post("/query")
async def query_endpoint(request: QueryRequest):
    try:
        result = rag_pipeline.query(request.query)
        return result
    except Exception as e:
        logger.error(f"Query failed: {e}")
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/ingest")
async def ingest_document(request: DocumentRequest):
    try:
        success = rag_pipeline.store_document(
            request.doc_id,
            request.content,
            {"source": request.source}
        )
        return {"success": success, "doc_id": request.doc_id}
    except Exception as e:
        logger.error(f"Ingestion failed: {e}")
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/stats")
async def get_stats():
    return {
        "vector_store": config['vector_store']['url'],
        "llm_model": config['llm']['model'],
        "embeddings_model": config['embeddings']['model'],
        "retrieval_top_k": config['retrieval']['top_k']
    }

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)
EOF

Step 4.4: Deploy RAG Service

# Deploy RAG application
@"
apiVersion: apps/v1
kind: Deployment
metadata:
  name: rag-api
  namespace: edge-rag
spec:
  replicas: 2
  selector:
    matchLabels:
      app: rag-api
  template:
    metadata:
      labels:
        app: rag-api
    spec:
      containers:
      - name: rag-api
        image: rag-service:latest
        imagePullPolicy: Never
        ports:
        - containerPort: 8000
          name: http
        env:
        - name: WEAVIATE_URL
          value: http://weaviate:8080
        - name: OLLAMA_URL
          value: http://ollama:11434
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
          limits:
            memory: "4Gi"
            cpu: "2000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 10
          periodSeconds: 5
        volumeMounts:
        - name: rag-data
          mountPath: /data
      volumes:
      - name: rag-data
        persistentVolumeClaim:
          claimName: rag-data-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: rag-api
  namespace: edge-rag
spec:
  type: LoadBalancer
  ports:
  - port: 8000
    targetPort: 8000
    name: http
  selector:
    app: rag-api
"@ | kubectl apply -f -

# Wait for deployment
Write-Host "RAG API deploying..."
kubectl wait --for=condition=ready pod -l app=rag-api -n edge-rag --timeout=300s

Expected Output: RAG API pods running


Step 5: Test RAG Pipeline

Objective: Validate end-to-end RAG functionality

Step 5.1: Ingest Sample Documents

# Get RAG API service IP
$ragApiIP = kubectl get service rag-api -n edge-rag -o jsonpath='{.status.loadBalancer.ingress[0].ip}'

Write-Host "RAG API available at: http://$ragApiIP:8000"

# Create sample documents
$doc1 = @{
    doc_id = "doc-001"
    content = "Azure Local is Microsoft's edge computing platform for sovereign cloud deployments. It enables organizations to run cloud services on-premises with guaranteed data residency and compliance."
    source = "Azure Local Overview"
}

$doc2 = @{
    doc_id = "doc-002"
    content = "Retrieval-Augmented Generation (RAG) combines the power of large language models with targeted document retrieval. This approach improves accuracy and reduces hallucinations by grounding responses in actual data."
    source = "RAG Fundamentals"
}

$doc3 = @{
    doc_id = "doc-003"
    content = "Azure Arc enables unified management of resources across on-premises, edge, and cloud environments. It provides policy enforcement, monitoring, and governance at scale for hybrid infrastructure."
    source = "Azure Arc Overview"
}

# Ingest documents
foreach ($doc in @($doc1, $doc2, $doc3)) {
    $response = Invoke-RestMethod -Uri "http://$ragApiIP:8000/ingest" `
        -Method Post `
        -ContentType "application/json" `
        -Body ($doc | ConvertTo-Json)
    
    Write-Host "Ingested: $($doc.doc_id) - $($response.success)"
}

Expected Output: Documents ingested successfully

Step 5.2: Query RAG System

# Test RAG queries
$queries = @(
    "What is Azure Local?",
    "How does RAG work?",
    "Tell me about Azure Arc"
)

foreach ($query in $queries) {
    Write-Host "`nQuery: $query"
    Write-Host "─" * 60
    
    $response = Invoke-RestMethod -Uri "http://$ragApiIP:8000/query" `
        -Method Post `
        -ContentType "application/json" `
        -Body (@{ query = $query } | ConvertTo-Json)
    
    Write-Host "Answer: $($response.answer)"
    Write-Host "Sources: $($response.context_documents) documents"
}

Expected Output: RAG system returning contextual answers

Step 5.3: Monitor Performance

# Check pod logs
kubectl logs -n edge-rag -l app=rag-api --tail=50

# Monitor resource usage
kubectl top nodes
kubectl top pods -n edge-rag

# Get RAG API stats
$stats = Invoke-RestMethod -Uri "http://$ragApiIP:8000/stats" -Method Get
Write-Host "RAG System Configuration:"
Write-Host ($stats | ConvertTo-Json -Depth 3)

Expected Output: All services healthy with reasonable resource usage


Step 6: Configure Monitoring

Objective: Set up observability for RAG system

Step 6.1: Add Prometheus Metrics

# Deploy Prometheus for metrics collection
@"
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: edge-rag
data:
  prometheus.yml: |
    global:
      scrape_interval: 30s
    scrape_configs:
    - job_name: 'rag-api'
      static_configs:
      - targets: ['rag-api:8000']
    - job_name: 'weaviate'
      static_configs:
      - targets: ['weaviate:8080']
    - job_name: 'kubernetes'
      kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
          - edge-rag
"@ | kubectl apply -f -

Write-Host "Prometheus ConfigMap created"

Step 6.2: Deploy Observability Stack

# Deploy Prometheus, Loki, and Grafana
Write-Host "In production, use Helm for full monitoring stack"
Write-Host "For this lab, monitoring is simplified via Kubernetes metrics"

# Verify metrics are available
kubectl top pods -n edge-rag

Step 7: Validation and Performance Testing

Objective: Verify RAG system meets performance requirements

Step 7.1: Load Testing

# Simple performance test
$queries = @(
    "What is sovereign cloud?",
    "Explain data residency",
    "What is edge computing?",
    "How does AI inference work?",
    "What is vector similarity?"
)

Write-Host "Running performance test..."
Write-Host "─" * 60

$results = @()

foreach ($query in $queries) {
    $start = Get-Date
    
    $response = Invoke-RestMethod -Uri "http://$ragApiIP:8000/query" `
        -Method Post `
        -ContentType "application/json" `
        -Body (@{ query = $query } | ConvertTo-Json)
    
    $duration = ((Get-Date) - $start).TotalMilliseconds
    
    $results += [PSCustomObject]@{
        Query = $query
        DurationMs = [math]::Round($duration, 2)
        Success = -not $response.error
    }
}

# Display results
$results | Format-Table -AutoSize
$avgTime = ($results.DurationMs | Measure-Object -Average).Average
Write-Host "`nAverage Response Time: $([math]::Round($avgTime, 2))ms"

Expected Output: Response times under 5 seconds, high success rate

Step 7.2: Resource Efficiency Check

# Check resource efficiency
$podMetrics = kubectl top pods -n edge-rag

Write-Host "Resource Usage Summary:"
Write-Host "─" * 60

$podMetrics | ForEach-Object {
    $cpu = $_.CPU
    $mem = $_.MEMORY
    Write-Host "$($_.NAME): CPU=$cpu, Memory=$mem"
}

# Check storage usage
kubectl exec -it $(kubectl get pods -n edge-rag -l app=rag-api -o jsonpath='{.items[0].metadata.name}') -n edge-rag -- df -h /data

Expected Output: Efficient resource utilization


Step 8: Next Steps and Scaling

Objective: Plan for production deployment

Step 8.1: Document System Capacity

# Get current deployment info
Write-Host "Current RAG System Configuration:"
Write-Host "═" * 60

$deploymentInfo = kubectl get deployment -n edge-rag -o jsonpath='{.items[*].spec.replicas}' | Measure-Object -Sum
Write-Host "Total Replicas: $($deploymentInfo.Sum)"

$serviceInfo = kubectl get service -n edge-rag
Write-Host "Services: $($serviceInfo.Count - 1)"

$pvcInfo = kubectl get pvc -n edge-rag
Write-Host "Storage Allocated: $(($pvcInfo | Measure-Object).Count) PVCs"

Write-Host "`nScaling Recommendations:"
Write-Host "- RAG API: Current 2 replicas, can scale to 5+"
Write-Host "- Weaviate: Requires persistent storage, single instance optimal"
Write-Host "- Ollama: Consider GPU-enabled node for better performance"

Step 8.2: Export Configuration for Lab 4

# Export current RAG setup for reference in Lab 4
kubectl get all -n edge-rag -o yaml > edge-rag-backup.yaml

Write-Host "Configuration exported to edge-rag-backup.yaml"
Write-Host "This will be referenced in Lab 4 for policy governance"

Learning Outcomes

What You Learned

βœ“ Edge RAG architecture and components βœ“ Vector database deployment (Weaviate) βœ“ LLM inference at the edge (Ollama) βœ“ Embedding generation and vector search βœ“ RAG pipeline implementation βœ“ API endpoint design for ML workloads βœ“ Performance monitoring for AI applications βœ“ Resource optimization for inference

Skills Gained

βœ“ Deploy production-grade vector databases βœ“ Configure local LLM inference engines βœ“ Build RAG applications with Python/FastAPI βœ“ Manage AI model lifecycle at the edge βœ“ Monitor and optimize ML workload performance βœ“ Design scalable inference architectures βœ“ Integrate AI with existing infrastructure

Knowledge Applied From Previous Modules

βœ“ Module 1 (Azure Local): Deployed on Azure Local compute βœ“ Module 2 (Arc): Integrated with Arc management in Lab 2 βœ“ Module 3 (Edge RAG): Core content for this lab


Troubleshooting

Issue Solution
Ollama model pull timeout Increase timeout or use smaller model (tinyllama)
Weaviate connection errors Check Pod IP: kubectl get pods -n edge-rag -o wide
RAG API pods crashing Check logs: kubectl logs <pod> -n edge-rag
Out of memory errors Reduce model size or increase Pod limits
Embedding generation slow Consider GPU or batch processing
Vector search returning no results Verify documents were ingested: check Weaviate logs

Last Updated: October 21, 2025