Edge RAG Architecture

Edge RAG Reference Architecture
1. High-Level Components
2. Component Descriptions
Local LLM Deployment Considerations
Vector Database at the Edge
1. Database Options
2. Indexing Strategies
Data Ingestion and Indexing Pipeline
1. Document Processing Steps
2. Continuous Ingestion
Query Processing and Response Generation
Orchestration and Response Generation
1. RAG Orchestration Tools
2. Streaming Responses
Caching and Performance Optimization
1. Caching Strategies
2. Performance Optimizations
Integration with Azure Services (Connected Mode)
1. Optional Cloud Integration
Disconnected Operation Scenarios
1. Fully Air-Gapped Deployment
2. Update Management
Monitoring and Maintenance at the Edge
1. Key Metrics to Monitor
2. Maintenance Tasks
Hardware Requirements
Next Steps

Edge RAG Reference Architecture

High-Level Components

graph TB
    subgraph EdgeRAG[Edge RAG System]
        Query[Query API<br/>REST/gRPC] --> Embed[Embedding Model]
        Embed --> VectorDB[Vector Database]
        
        VectorDB --> LLM[LLM Local Deployment<br/>7B-70B Parameters<br/>GPU Accelerated]
        Query --> LLM
        
        LLM --> Response[Response with Sources]
        
        Ingest[Document Ingestion<br/>PDF, DOCX, TXT, HTML] --> Chunk[Chunking & Processing]
        Chunk --> Embed2[Embedding Generation]
        Embed2 --> VectorDB
    end
    
    User[User Query] --> Query
    Response --> User
    Docs[Documents] --> Ingest
    
    style EdgeRAG fill:#F8F8F8,stroke:#666,stroke-width:2px,color:#000
    style Query fill:#E8F4FD,stroke:#0078D4,stroke-width:2px,color:#000
    style LLM fill:#FFF4E6,stroke:#FF8C00,stroke-width:2px,color:#000
    style VectorDB fill:#F3E8FF,stroke:#7B3FF2,stroke-width:2px,color:#000
    style Response fill:#D4E9D7,stroke:#107C10,stroke-width:2px,color:#000

Component Descriptions

Query API:

REST or gRPC interface
Authentication and authorization
Rate limiting
Request logging

Embedding Model:

Converts queries and documents to vectors
Can be same as used for indexing
Typically 384-1536 dimensions

Vector Database:

Stores document embeddings
Fast similarity search
Metadata filtering

LLM (Large Language Model):

Generates final answer
Uses retrieved context
7B to 70B parameters (larger = better but slower)

Document Ingestion:

Processes new documents
Chunks text appropriately
Generates and stores embeddings

Local LLM Deployment Considerations

Model Size vs. Quality Trade-Off

7B Parameter Models:

Examples: LLaMA 2 7B, Mistral 7B
Memory: 14-16 GB VRAM (float16)
Quality: Good for simple Q&A
Speed: Fast (30-50 tokens/sec)
Use Case: High-throughput scenarios

13B-15B Parameter Models:

Examples: LLaMA 2 13B, Vicuna 13B
Memory: 26-30 GB VRAM
Quality: Better reasoning, more coherent
Speed: Moderate (20-30 tokens/sec)
Use Case: Balanced scenarios

30B-40B Parameter Models:

Examples: Falcon 40B, Code Llama 34B
Memory: 60-80 GB VRAM (requires multi-GPU or quantization)
Quality: Strong performance
Speed: Slower (10-15 tokens/sec)
Use Case: Quality-critical applications

70B Parameter Models:

Examples: LLaMA 2 70B
Memory: 140+ GB VRAM (multi-GPU required)
Quality: Excellent, approaches GPT-3.5
Speed: Slow (5-10 tokens/sec)
Use Case: Highest quality requirements

Quantization Techniques

What is Quantization: Reduce model precision to save memory and improve speed.

INT8 Quantization:

8-bit integers instead of 16-bit floats
2x memory reduction
Minimal quality loss (< 1%)
Example: 70B model fits in 70 GB instead of 140 GB

INT4 / GPTQ / GGML:

4-bit quantization
4x memory reduction
Some quality degradation (5-10%)
70B model fits in 35 GB

Recommendations:

INT8 for production (good quality/size trade-off)
INT4 for resource-constrained edge
Full precision (FP16) only if hardware allows

Inference Optimization

vLLM (Fast Inference):

PagedAttention algorithm
2-4x faster than standard
Lower memory usage

Text Generation Inference (TGI):

Hugging Face solution
Production-ready
Good scaling

llama.cpp / GGML:

C++ implementation
CPU-friendly
Apple Silicon optimized

Vector Database at the Edge

Database Options

Chroma:

Pros: Simple, Python-native, easy setup
Cons: Less scalable (< 1M vectors)
Best For: POCs, small deployments

Milvus:

Pros: Highly scalable, production-ready, feature-rich
Cons: More complex setup
Best For: Enterprise deployments

Qdrant:

Pros: Rust-based (fast), modern API, good filtering
Cons: Newer (less mature ecosystem)
Best For: Performance-critical applications

FAISS (Facebook AI Similarity Search):

Pros: Very fast, library (not database), well-tested
Cons: No metadata filtering, not persistent
Best For: In-memory search, prototyping

Indexing Strategies

Flat Index:

Exact search
Slow for large datasets
100% recall

IVF (Inverted File):

Cluster vectors, search nearest clusters
Fast, good recall (95-99%)
Good balance

HNSW (Hierarchical Navigable Small World):

Graph-based index
Very fast queries
High memory usage

PQ (Product Quantization):

Compress vectors
Lower memory usage
Some accuracy loss

Recommendation: HNSW for most cases, IVF for memory-constrained.

Data Ingestion and Indexing Pipeline

Document Processing Steps

1. Document Loading:

Supported formats: PDF, DOCX, TXT, HTML, MD, CSV
Tools: PyPDF2, python-docx, Beautiful Soup, Unstructured.io

2. Text Extraction:

Extract clean text
Preserve structure (headings, lists)
Handle tables and images (OCR if needed)

3. Chunking:

Split into overlapping chunks
Typical: 500-1000 tokens per chunk
Overlap: 100-200 tokens

4. Metadata Extraction:

Document title, author, date
Section/chapter
Tags/categories
Source URL or file path

5. Embedding Generation:

Run embedding model on each chunk
Batch processing for efficiency
Store embeddings with metadata

6. Vector Database Indexing:

Insert embeddings into vector DB
Build search index
Enable metadata filters

Continuous Ingestion

Watch Folder Pattern:

Monitor directory for new files
Detect new/changed documents
Process and index automatically
Update vector database

Scheduled Batch Jobs:

Run nightly or weekly
Process accumulated documents
Rebuild indices if needed

Event-Driven:

Triggered by application events
Real-time indexing
Good for dynamic content

Query Processing and Response Generation

Query Flow

1. User submits natural language query
   ↓
2. Query → Embedding Model → Query Vector
   ↓
3. Vector Database Search (top K similar documents)
   ↓
4. Retrieved documents + Query → Prompt for LLM
   ↓
5. LLM generates answer with citations
   ↓
6. Response returned to user

Prompt Engineering for RAG

Template:

Context: {retrieved_documents}

Question: {user_query}

Instructions:
- Answer the question based only on the context provided
- If the context doesn't contain enough information, say so
- Cite the source for each fact you use
- Be concise but complete

Answer:

Best Practices:

Clear instructions to use context
Request citations
Specify answer format/length
Include examples if helpful

Retrieval Strategies

Top-K Retrieval:

Return K most similar documents (typically 3-5)
Simple and effective

MMR (Maximal Marginal Relevance):

Balance similarity and diversity
Avoid redundant results

Re-ranking:

Initial retrieval: top 20-50
Re-rank with more sophisticated model
Return top 5 to LLM

Hybrid Search:

Combine vector and keyword search
Weighted fusion of results
More robust

Orchestration and Response Generation

RAG Orchestration Tools

LangChain:

from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=local_llm,
    retriever=vector_store.as_retriever(),
    return_source_documents=True
)

result = qa_chain({"query": "What is Azure Local?"})
answer = result['result']
sources = result['source_documents']

LlamaIndex:

from llama_index import VectorStoreIndex, ServiceContext

service_context = ServiceContext.from_defaults(llm=local_llm)
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

query_engine = index.as_query_engine()
response = query_engine.query("What is Azure Local?")

Streaming Responses

Why Streaming:

Better user experience (see tokens as generated)
Lower perceived latency
Ability to cancel long responses

Implementation:

for token in llm.stream("Query..."):
    print(token, end="", flush=True)

Caching and Performance Optimization

Caching Strategies

Query Caching:

Cache common queries and responses
Instant response for repeated questions
Use semantic similarity for cache lookup

Embedding Caching:

Cache document embeddings
Avoid re-computing for unchanged documents

Retrieved Context Caching:

Cache retrieval results for popular queries
Reduces vector DB load

Performance Optimizations

Batch Processing:

Process multiple documents at once
Batch embed generation
More efficient GPU utilization

Asynchronous Processing:

Non-blocking I/O
Process queries in parallel
Better throughput

Load Balancing:

Multiple LLM instances
Distribute queries
Horizontal scaling

Integration with Azure Services (Connected Mode)

Optional Cloud Integration

When deployed in Connected Mode on Azure Local:

Azure OpenAI (for comparison/fallback):

Use cloud LLM for complex queries
Hybrid approach (edge for sensitive, cloud for general)

Azure Cognitive Search:

Offload some vector search to cloud
Hybrid local + cloud knowledge base

Azure Monitor:

Query performance telemetry
Model performance metrics
Usage analytics

Azure Blob Storage:

Backup vector database
Archive old documents
Disaster recovery

Important: Only sync approved, non-sensitive data to cloud!

Disconnected Operation Scenarios

Fully Air-Gapped Deployment

Requirements:

All models pre-downloaded
Vector database fully local
No internet dependency

Challenges:

Model updates manual
Document ingestion from local sources only
No cloud-based monitoring

Solutions:

USB/physical media for updates
Local monitoring dashboards
Comprehensive local logging

Update Management

Model Updates:

Download new models in secure environment
Transfer to air-gapped network
Test before deploying
Rollback capability

Document Updates:

Scheduled ingestion from local repositories
Version control for knowledge base
Incremental indexing

Monitoring and Maintenance at the Edge

Key Metrics to Monitor

Query Performance:

Query latency (end-to-end)
Retrieval time
LLM inference time
Queries per second

Quality Metrics:

User satisfaction scores
Answer accuracy (if ground truth available)
Citation accuracy

Resource Utilization:

GPU utilization
Memory usage
Storage usage
CPU utilization

Failure Metrics:

Query errors
Timeout rate
Model load failures

Maintenance Tasks

Regular:

Monitor disk space
Review slow queries
Update monitoring dashboards

Weekly:

Analyze usage patterns
Review user feedback
Plan capacity

Monthly:

Evaluate model performance
Consider index rebuild
Review and update documents

Hardware Requirements

Minimum (7B Model):

GPU: NVIDIA RTX 4090 (24 GB) or A10 (24 GB)
RAM: 32 GB
Storage: 500 GB SSD
CPU: 8+ cores

Recommended (13B Model):

GPU: NVIDIA A100 (40 GB) or A30 (24 GB)
RAM: 64 GB
Storage: 1 TB NVMe SSD
CPU: 16+ cores

Enterprise (70B Model):

GPU: 2x NVIDIA A100 (80 GB) or H100
RAM: 256 GB
Storage: 2 TB NVMe SSD
CPU: 32+ cores

Next Steps

Last Updated: October 2025