Edge RAG Architecture

Table of Contents

  1. Edge RAG Reference Architecture
    1. High-Level Components
    2. Component Descriptions
  2. Local LLM Deployment Considerations
    1. Model Size vs. Quality Trade-Off
    2. Quantization Techniques
    3. Inference Optimization
  3. Vector Database at the Edge
    1. Database Options
    2. Indexing Strategies
  4. Data Ingestion and Indexing Pipeline
    1. Document Processing Steps
    2. Continuous Ingestion
  5. Query Processing and Response Generation
    1. Query Flow
    2. Prompt Engineering for RAG
    3. Retrieval Strategies
  6. Orchestration and Response Generation
    1. RAG Orchestration Tools
    2. Streaming Responses
  7. Caching and Performance Optimization
    1. Caching Strategies
    2. Performance Optimizations
  8. Integration with Azure Services (Connected Mode)
    1. Optional Cloud Integration
  9. Disconnected Operation Scenarios
    1. Fully Air-Gapped Deployment
    2. Update Management
  10. Monitoring and Maintenance at the Edge
    1. Key Metrics to Monitor
    2. Maintenance Tasks
  11. Hardware Requirements
  12. Next Steps

Edge RAG Reference Architecture

High-Level Components

graph TB
    subgraph EdgeRAG[Edge RAG System]
        Query[Query API<br/>REST/gRPC] --> Embed[Embedding Model]
        Embed --> VectorDB[Vector Database]
        
        VectorDB --> LLM[LLM Local Deployment<br/>7B-70B Parameters<br/>GPU Accelerated]
        Query --> LLM
        
        LLM --> Response[Response with Sources]
        
        Ingest[Document Ingestion<br/>PDF, DOCX, TXT, HTML] --> Chunk[Chunking & Processing]
        Chunk --> Embed2[Embedding Generation]
        Embed2 --> VectorDB
    end
    
    User[User Query] --> Query
    Response --> User
    Docs[Documents] --> Ingest
    
    style EdgeRAG fill:#F8F8F8,stroke:#666,stroke-width:2px,color:#000
    style Query fill:#E8F4FD,stroke:#0078D4,stroke-width:2px,color:#000
    style LLM fill:#FFF4E6,stroke:#FF8C00,stroke-width:2px,color:#000
    style VectorDB fill:#F3E8FF,stroke:#7B3FF2,stroke-width:2px,color:#000
    style Response fill:#D4E9D7,stroke:#107C10,stroke-width:2px,color:#000

Component Descriptions

Query API:

  • REST or gRPC interface
  • Authentication and authorization
  • Rate limiting
  • Request logging

Embedding Model:

  • Converts queries and documents to vectors
  • Can be same as used for indexing
  • Typically 384-1536 dimensions

Vector Database:

  • Stores document embeddings
  • Fast similarity search
  • Metadata filtering

LLM (Large Language Model):

  • Generates final answer
  • Uses retrieved context
  • 7B to 70B parameters (larger = better but slower)

Document Ingestion:

  • Processes new documents
  • Chunks text appropriately
  • Generates and stores embeddings

Local LLM Deployment Considerations

Model Size vs. Quality Trade-Off

7B Parameter Models:

  • Examples: LLaMA 2 7B, Mistral 7B
  • Memory: 14-16 GB VRAM (float16)
  • Quality: Good for simple Q&A
  • Speed: Fast (30-50 tokens/sec)
  • Use Case: High-throughput scenarios

13B-15B Parameter Models:

  • Examples: LLaMA 2 13B, Vicuna 13B
  • Memory: 26-30 GB VRAM
  • Quality: Better reasoning, more coherent
  • Speed: Moderate (20-30 tokens/sec)
  • Use Case: Balanced scenarios

30B-40B Parameter Models:

  • Examples: Falcon 40B, Code Llama 34B
  • Memory: 60-80 GB VRAM (requires multi-GPU or quantization)
  • Quality: Strong performance
  • Speed: Slower (10-15 tokens/sec)
  • Use Case: Quality-critical applications

70B Parameter Models:

  • Examples: LLaMA 2 70B
  • Memory: 140+ GB VRAM (multi-GPU required)
  • Quality: Excellent, approaches GPT-3.5
  • Speed: Slow (5-10 tokens/sec)
  • Use Case: Highest quality requirements

Quantization Techniques

What is Quantization: Reduce model precision to save memory and improve speed.

INT8 Quantization:

  • 8-bit integers instead of 16-bit floats
  • 2x memory reduction
  • Minimal quality loss (< 1%)
  • Example: 70B model fits in 70 GB instead of 140 GB

INT4 / GPTQ / GGML:

  • 4-bit quantization
  • 4x memory reduction
  • Some quality degradation (5-10%)
  • 70B model fits in 35 GB

Recommendations:

  • INT8 for production (good quality/size trade-off)
  • INT4 for resource-constrained edge
  • Full precision (FP16) only if hardware allows

Inference Optimization

vLLM (Fast Inference):

  • PagedAttention algorithm
  • 2-4x faster than standard
  • Lower memory usage

Text Generation Inference (TGI):

  • Hugging Face solution
  • Production-ready
  • Good scaling

llama.cpp / GGML:

  • C++ implementation
  • CPU-friendly
  • Apple Silicon optimized

Vector Database at the Edge

Database Options

Chroma:

  • Pros: Simple, Python-native, easy setup
  • Cons: Less scalable (< 1M vectors)
  • Best For: POCs, small deployments

Milvus:

  • Pros: Highly scalable, production-ready, feature-rich
  • Cons: More complex setup
  • Best For: Enterprise deployments

Qdrant:

  • Pros: Rust-based (fast), modern API, good filtering
  • Cons: Newer (less mature ecosystem)
  • Best For: Performance-critical applications

FAISS (Facebook AI Similarity Search):

  • Pros: Very fast, library (not database), well-tested
  • Cons: No metadata filtering, not persistent
  • Best For: In-memory search, prototyping

Indexing Strategies

Flat Index:

  • Exact search
  • Slow for large datasets
  • 100% recall

IVF (Inverted File):

  • Cluster vectors, search nearest clusters
  • Fast, good recall (95-99%)
  • Good balance

HNSW (Hierarchical Navigable Small World):

  • Graph-based index
  • Very fast queries
  • High memory usage

PQ (Product Quantization):

  • Compress vectors
  • Lower memory usage
  • Some accuracy loss

Recommendation: HNSW for most cases, IVF for memory-constrained.


Data Ingestion and Indexing Pipeline

Document Processing Steps

1. Document Loading:

Supported formats: PDF, DOCX, TXT, HTML, MD, CSV
Tools: PyPDF2, python-docx, Beautiful Soup, Unstructured.io

2. Text Extraction:

  • Extract clean text
  • Preserve structure (headings, lists)
  • Handle tables and images (OCR if needed)

3. Chunking:

  • Split into overlapping chunks
  • Typical: 500-1000 tokens per chunk
  • Overlap: 100-200 tokens

4. Metadata Extraction:

  • Document title, author, date
  • Section/chapter
  • Tags/categories
  • Source URL or file path

5. Embedding Generation:

  • Run embedding model on each chunk
  • Batch processing for efficiency
  • Store embeddings with metadata

6. Vector Database Indexing:

  • Insert embeddings into vector DB
  • Build search index
  • Enable metadata filters

Continuous Ingestion

Watch Folder Pattern:

1. Monitor directory for new files
2. Detect new/changed documents
3. Process and index automatically
4. Update vector database

Scheduled Batch Jobs:

  • Run nightly or weekly
  • Process accumulated documents
  • Rebuild indices if needed

Event-Driven:

  • Triggered by application events
  • Real-time indexing
  • Good for dynamic content

Query Processing and Response Generation

Query Flow

1. User submits natural language query
   ↓
2. Query → Embedding Model → Query Vector
   ↓
3. Vector Database Search (top K similar documents)
   ↓
4. Retrieved documents + Query → Prompt for LLM
   ↓
5. LLM generates answer with citations
   ↓
6. Response returned to user

Prompt Engineering for RAG

Template:

Context: {retrieved_documents}

Question: {user_query}

Instructions:
- Answer the question based only on the context provided
- If the context doesn't contain enough information, say so
- Cite the source for each fact you use
- Be concise but complete

Answer:

Best Practices:

  • Clear instructions to use context
  • Request citations
  • Specify answer format/length
  • Include examples if helpful

Retrieval Strategies

Top-K Retrieval:

  • Return K most similar documents (typically 3-5)
  • Simple and effective

MMR (Maximal Marginal Relevance):

  • Balance similarity and diversity
  • Avoid redundant results

Re-ranking:

  • Initial retrieval: top 20-50
  • Re-rank with more sophisticated model
  • Return top 5 to LLM

Hybrid Search:

  • Combine vector and keyword search
  • Weighted fusion of results
  • More robust

Orchestration and Response Generation

RAG Orchestration Tools

LangChain:

from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=local_llm,
    retriever=vector_store.as_retriever(),
    return_source_documents=True
)

result = qa_chain({"query": "What is Azure Local?"})
answer = result['result']
sources = result['source_documents']

LlamaIndex:

from llama_index import VectorStoreIndex, ServiceContext

service_context = ServiceContext.from_defaults(llm=local_llm)
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

query_engine = index.as_query_engine()
response = query_engine.query("What is Azure Local?")

Streaming Responses

Why Streaming:

  • Better user experience (see tokens as generated)
  • Lower perceived latency
  • Ability to cancel long responses

Implementation:

for token in llm.stream("Query..."):
    print(token, end="", flush=True)

Caching and Performance Optimization

Caching Strategies

Query Caching:

  • Cache common queries and responses
  • Instant response for repeated questions
  • Use semantic similarity for cache lookup

Embedding Caching:

  • Cache document embeddings
  • Avoid re-computing for unchanged documents

Retrieved Context Caching:

  • Cache retrieval results for popular queries
  • Reduces vector DB load

Performance Optimizations

Batch Processing:

  • Process multiple documents at once
  • Batch embed generation
  • More efficient GPU utilization

Asynchronous Processing:

  • Non-blocking I/O
  • Process queries in parallel
  • Better throughput

Load Balancing:

  • Multiple LLM instances
  • Distribute queries
  • Horizontal scaling

Integration with Azure Services (Connected Mode)

Optional Cloud Integration

When deployed in Connected Mode on Azure Local:

Azure OpenAI (for comparison/fallback):

  • Use cloud LLM for complex queries
  • Hybrid approach (edge for sensitive, cloud for general)

Azure Cognitive Search:

  • Offload some vector search to cloud
  • Hybrid local + cloud knowledge base

Azure Monitor:

  • Query performance telemetry
  • Model performance metrics
  • Usage analytics

Azure Blob Storage:

  • Backup vector database
  • Archive old documents
  • Disaster recovery

Important: Only sync approved, non-sensitive data to cloud!


Disconnected Operation Scenarios

Fully Air-Gapped Deployment

Requirements:

  • All models pre-downloaded
  • Vector database fully local
  • No internet dependency

Challenges:

  • Model updates manual
  • Document ingestion from local sources only
  • No cloud-based monitoring

Solutions:

  • USB/physical media for updates
  • Local monitoring dashboards
  • Comprehensive local logging

Update Management

Model Updates:

  • Download new models in secure environment
  • Transfer to air-gapped network
  • Test before deploying
  • Rollback capability

Document Updates:

  • Scheduled ingestion from local repositories
  • Version control for knowledge base
  • Incremental indexing

Monitoring and Maintenance at the Edge

Key Metrics to Monitor

Query Performance:

  • Query latency (end-to-end)
  • Retrieval time
  • LLM inference time
  • Queries per second

Quality Metrics:

  • User satisfaction scores
  • Answer accuracy (if ground truth available)
  • Citation accuracy

Resource Utilization:

  • GPU utilization
  • Memory usage
  • Storage usage
  • CPU utilization

Failure Metrics:

  • Query errors
  • Timeout rate
  • Model load failures

Maintenance Tasks

Regular:

  • Monitor disk space
  • Review slow queries
  • Update monitoring dashboards

Weekly:

  • Analyze usage patterns
  • Review user feedback
  • Plan capacity

Monthly:

  • Evaluate model performance
  • Consider index rebuild
  • Review and update documents

Hardware Requirements

Minimum (7B Model):

  • GPU: NVIDIA RTX 4090 (24 GB) or A10 (24 GB)
  • RAM: 32 GB
  • Storage: 500 GB SSD
  • CPU: 8+ cores

Recommended (13B Model):

  • GPU: NVIDIA A100 (40 GB) or A30 (24 GB)
  • RAM: 64 GB
  • Storage: 1 TB NVMe SSD
  • CPU: 16+ cores

Enterprise (70B Model):

  • GPU: 2x NVIDIA A100 (80 GB) or H100
  • RAM: 256 GB
  • Storage: 2 TB NVMe SSD
  • CPU: 32+ cores

Next Steps


Last Updated: October 2025