Edge RAG Architecture
Table of Contents
- Edge RAG Reference Architecture
 - Local LLM Deployment Considerations
 - Vector Database at the Edge
 - Data Ingestion and Indexing Pipeline
 - Query Processing and Response Generation
 - Orchestration and Response Generation
 - Caching and Performance Optimization
 - Integration with Azure Services (Connected Mode)
 - Disconnected Operation Scenarios
 - Monitoring and Maintenance at the Edge
 - Hardware Requirements
 - Next Steps
 
Edge RAG Reference Architecture
High-Level Components
graph TB
    subgraph EdgeRAG[Edge RAG System]
        Query[Query API<br/>REST/gRPC] --> Embed[Embedding Model]
        Embed --> VectorDB[Vector Database]
        
        VectorDB --> LLM[LLM Local Deployment<br/>7B-70B Parameters<br/>GPU Accelerated]
        Query --> LLM
        
        LLM --> Response[Response with Sources]
        
        Ingest[Document Ingestion<br/>PDF, DOCX, TXT, HTML] --> Chunk[Chunking & Processing]
        Chunk --> Embed2[Embedding Generation]
        Embed2 --> VectorDB
    end
    
    User[User Query] --> Query
    Response --> User
    Docs[Documents] --> Ingest
    
    style EdgeRAG fill:#F8F8F8,stroke:#666,stroke-width:2px,color:#000
    style Query fill:#E8F4FD,stroke:#0078D4,stroke-width:2px,color:#000
    style LLM fill:#FFF4E6,stroke:#FF8C00,stroke-width:2px,color:#000
    style VectorDB fill:#F3E8FF,stroke:#7B3FF2,stroke-width:2px,color:#000
    style Response fill:#D4E9D7,stroke:#107C10,stroke-width:2px,color:#000
Component Descriptions
Query API:
- REST or gRPC interface
 - Authentication and authorization
 - Rate limiting
 - Request logging
 
Embedding Model:
- Converts queries and documents to vectors
 - Can be same as used for indexing
 - Typically 384-1536 dimensions
 
Vector Database:
- Stores document embeddings
 - Fast similarity search
 - Metadata filtering
 
LLM (Large Language Model):
- Generates final answer
 - Uses retrieved context
 - 7B to 70B parameters (larger = better but slower)
 
Document Ingestion:
- Processes new documents
 - Chunks text appropriately
 - Generates and stores embeddings
 
Local LLM Deployment Considerations
Model Size vs. Quality Trade-Off
7B Parameter Models:
- Examples: LLaMA 2 7B, Mistral 7B
 - Memory: 14-16 GB VRAM (float16)
 - Quality: Good for simple Q&A
 - Speed: Fast (30-50 tokens/sec)
 - Use Case: High-throughput scenarios
 
13B-15B Parameter Models:
- Examples: LLaMA 2 13B, Vicuna 13B
 - Memory: 26-30 GB VRAM
 - Quality: Better reasoning, more coherent
 - Speed: Moderate (20-30 tokens/sec)
 - Use Case: Balanced scenarios
 
30B-40B Parameter Models:
- Examples: Falcon 40B, Code Llama 34B
 - Memory: 60-80 GB VRAM (requires multi-GPU or quantization)
 - Quality: Strong performance
 - Speed: Slower (10-15 tokens/sec)
 - Use Case: Quality-critical applications
 
70B Parameter Models:
- Examples: LLaMA 2 70B
 - Memory: 140+ GB VRAM (multi-GPU required)
 - Quality: Excellent, approaches GPT-3.5
 - Speed: Slow (5-10 tokens/sec)
 - Use Case: Highest quality requirements
 
Quantization Techniques
What is Quantization: Reduce model precision to save memory and improve speed.
INT8 Quantization:
- 8-bit integers instead of 16-bit floats
 - 2x memory reduction
 - Minimal quality loss (< 1%)
 - Example: 70B model fits in 70 GB instead of 140 GB
 
INT4 / GPTQ / GGML:
- 4-bit quantization
 - 4x memory reduction
 - Some quality degradation (5-10%)
 - 70B model fits in 35 GB
 
Recommendations:
- INT8 for production (good quality/size trade-off)
 - INT4 for resource-constrained edge
 - Full precision (FP16) only if hardware allows
 
Inference Optimization
vLLM (Fast Inference):
- PagedAttention algorithm
 - 2-4x faster than standard
 - Lower memory usage
 
Text Generation Inference (TGI):
- Hugging Face solution
 - Production-ready
 - Good scaling
 
llama.cpp / GGML:
- C++ implementation
 - CPU-friendly
 - Apple Silicon optimized
 
Vector Database at the Edge
Database Options
Chroma:
- Pros: Simple, Python-native, easy setup
 - Cons: Less scalable (< 1M vectors)
 - Best For: POCs, small deployments
 
Milvus:
- Pros: Highly scalable, production-ready, feature-rich
 - Cons: More complex setup
 - Best For: Enterprise deployments
 
Qdrant:
- Pros: Rust-based (fast), modern API, good filtering
 - Cons: Newer (less mature ecosystem)
 - Best For: Performance-critical applications
 
FAISS (Facebook AI Similarity Search):
- Pros: Very fast, library (not database), well-tested
 - Cons: No metadata filtering, not persistent
 - Best For: In-memory search, prototyping
 
Indexing Strategies
Flat Index:
- Exact search
 - Slow for large datasets
 - 100% recall
 
IVF (Inverted File):
- Cluster vectors, search nearest clusters
 - Fast, good recall (95-99%)
 - Good balance
 
HNSW (Hierarchical Navigable Small World):
- Graph-based index
 - Very fast queries
 - High memory usage
 
PQ (Product Quantization):
- Compress vectors
 - Lower memory usage
 - Some accuracy loss
 
Recommendation: HNSW for most cases, IVF for memory-constrained.
Data Ingestion and Indexing Pipeline
Document Processing Steps
1. Document Loading:
Supported formats: PDF, DOCX, TXT, HTML, MD, CSV
Tools: PyPDF2, python-docx, Beautiful Soup, Unstructured.io
2. Text Extraction:
- Extract clean text
 - Preserve structure (headings, lists)
 - Handle tables and images (OCR if needed)
 
3. Chunking:
- Split into overlapping chunks
 - Typical: 500-1000 tokens per chunk
 - Overlap: 100-200 tokens
 
4. Metadata Extraction:
- Document title, author, date
 - Section/chapter
 - Tags/categories
 - Source URL or file path
 
5. Embedding Generation:
- Run embedding model on each chunk
 - Batch processing for efficiency
 - Store embeddings with metadata
 
6. Vector Database Indexing:
- Insert embeddings into vector DB
 - Build search index
 - Enable metadata filters
 
Continuous Ingestion
Watch Folder Pattern:
1. Monitor directory for new files
2. Detect new/changed documents
3. Process and index automatically
4. Update vector database
Scheduled Batch Jobs:
- Run nightly or weekly
 - Process accumulated documents
 - Rebuild indices if needed
 
Event-Driven:
- Triggered by application events
 - Real-time indexing
 - Good for dynamic content
 
Query Processing and Response Generation
Query Flow
1. User submits natural language query
   ↓
2. Query → Embedding Model → Query Vector
   ↓
3. Vector Database Search (top K similar documents)
   ↓
4. Retrieved documents + Query → Prompt for LLM
   ↓
5. LLM generates answer with citations
   ↓
6. Response returned to user
Prompt Engineering for RAG
Template:
Context: {retrieved_documents}
Question: {user_query}
Instructions:
- Answer the question based only on the context provided
- If the context doesn't contain enough information, say so
- Cite the source for each fact you use
- Be concise but complete
Answer:
Best Practices:
- Clear instructions to use context
 - Request citations
 - Specify answer format/length
 - Include examples if helpful
 
Retrieval Strategies
Top-K Retrieval:
- Return K most similar documents (typically 3-5)
 - Simple and effective
 
MMR (Maximal Marginal Relevance):
- Balance similarity and diversity
 - Avoid redundant results
 
Re-ranking:
- Initial retrieval: top 20-50
 - Re-rank with more sophisticated model
 - Return top 5 to LLM
 
Hybrid Search:
- Combine vector and keyword search
 - Weighted fusion of results
 - More robust
 
Orchestration and Response Generation
RAG Orchestration Tools
LangChain:
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(
    llm=local_llm,
    retriever=vector_store.as_retriever(),
    return_source_documents=True
)
result = qa_chain({"query": "What is Azure Local?"})
answer = result['result']
sources = result['source_documents']
LlamaIndex:
from llama_index import VectorStoreIndex, ServiceContext
service_context = ServiceContext.from_defaults(llm=local_llm)
index = VectorStoreIndex.from_documents(documents, service_context=service_context)
query_engine = index.as_query_engine()
response = query_engine.query("What is Azure Local?")
Streaming Responses
Why Streaming:
- Better user experience (see tokens as generated)
 - Lower perceived latency
 - Ability to cancel long responses
 
Implementation:
for token in llm.stream("Query..."):
    print(token, end="", flush=True)
Caching and Performance Optimization
Caching Strategies
Query Caching:
- Cache common queries and responses
 - Instant response for repeated questions
 - Use semantic similarity for cache lookup
 
Embedding Caching:
- Cache document embeddings
 - Avoid re-computing for unchanged documents
 
Retrieved Context Caching:
- Cache retrieval results for popular queries
 - Reduces vector DB load
 
Performance Optimizations
Batch Processing:
- Process multiple documents at once
 - Batch embed generation
 - More efficient GPU utilization
 
Asynchronous Processing:
- Non-blocking I/O
 - Process queries in parallel
 - Better throughput
 
Load Balancing:
- Multiple LLM instances
 - Distribute queries
 - Horizontal scaling
 
Integration with Azure Services (Connected Mode)
Optional Cloud Integration
When deployed in Connected Mode on Azure Local:
Azure OpenAI (for comparison/fallback):
- Use cloud LLM for complex queries
 - Hybrid approach (edge for sensitive, cloud for general)
 
Azure Cognitive Search:
- Offload some vector search to cloud
 - Hybrid local + cloud knowledge base
 
Azure Monitor:
- Query performance telemetry
 - Model performance metrics
 - Usage analytics
 
Azure Blob Storage:
- Backup vector database
 - Archive old documents
 - Disaster recovery
 
Important: Only sync approved, non-sensitive data to cloud!
Disconnected Operation Scenarios
Fully Air-Gapped Deployment
Requirements:
- All models pre-downloaded
 - Vector database fully local
 - No internet dependency
 
Challenges:
- Model updates manual
 - Document ingestion from local sources only
 - No cloud-based monitoring
 
Solutions:
- USB/physical media for updates
 - Local monitoring dashboards
 - Comprehensive local logging
 
Update Management
Model Updates:
- Download new models in secure environment
 - Transfer to air-gapped network
 - Test before deploying
 - Rollback capability
 
Document Updates:
- Scheduled ingestion from local repositories
 - Version control for knowledge base
 - Incremental indexing
 
Monitoring and Maintenance at the Edge
Key Metrics to Monitor
Query Performance:
- Query latency (end-to-end)
 - Retrieval time
 - LLM inference time
 - Queries per second
 
Quality Metrics:
- User satisfaction scores
 - Answer accuracy (if ground truth available)
 - Citation accuracy
 
Resource Utilization:
- GPU utilization
 - Memory usage
 - Storage usage
 - CPU utilization
 
Failure Metrics:
- Query errors
 - Timeout rate
 - Model load failures
 
Maintenance Tasks
Regular:
- Monitor disk space
 - Review slow queries
 - Update monitoring dashboards
 
Weekly:
- Analyze usage patterns
 - Review user feedback
 - Plan capacity
 
Monthly:
- Evaluate model performance
 - Consider index rebuild
 - Review and update documents
 
Hardware Requirements
Minimum (7B Model):
- GPU: NVIDIA RTX 4090 (24 GB) or A10 (24 GB)
 - RAM: 32 GB
 - Storage: 500 GB SSD
 - CPU: 8+ cores
 
Recommended (13B Model):
- GPU: NVIDIA A100 (40 GB) or A30 (24 GB)
 - RAM: 64 GB
 - Storage: 1 TB NVMe SSD
 - CPU: 16+ cores
 
Enterprise (70B Model):
- GPU: 2x NVIDIA A100 (80 GB) or H100
 - RAM: 256 GB
 - Storage: 2 TB NVMe SSD
 - CPU: 32+ cores
 
Next Steps
Last Updated: October 2025