Skip to content

Edge RAG Architecture


graph TB
    subgraph EdgeRAG[Edge RAG System]
        Query[Query API<br/>REST/gRPC] --> Embed[Embedding Model]
        Embed --> VectorDB[Vector Database]

        VectorDB --> LLM[LLM Local Deployment<br/>7B-70B Parameters<br/>GPU Accelerated]
        Query --> LLM

        LLM --> Response[Response with Sources]

        Ingest[Document Ingestion<br/>PDF, DOCX, TXT, HTML] --> Chunk[Chunking & Processing]
        Chunk --> Embed2[Embedding Generation]
        Embed2 --> VectorDB
    end

    User[User Query] --> Query
    Response --> User
    Docs[Documents] --> Ingest

    style EdgeRAG fill:#F8F8F8,stroke:#666,stroke-width:2px,color:#000
    style Query fill:#E8F4FD,stroke:#0078D4,stroke-width:2px,color:#000
    style LLM fill:#FFF4E6,stroke:#FF8C00,stroke-width:2px,color:#000
    style VectorDB fill:#F3E8FF,stroke:#7B3FF2,stroke-width:2px,color:#000
    style Response fill:#D4E9D7,stroke:#107C10,stroke-width:2px,color:#000

Query API:

  • REST or gRPC interface
  • Authentication and authorization
  • Rate limiting
  • Request logging

Embedding Model:

  • Converts queries and documents to vectors
  • Can be same as used for indexing
  • Typically 384-1536 dimensions

Vector Database:

  • Stores document embeddings
  • Fast similarity search
  • Metadata filtering

LLM (Large Language Model):

  • Generates final answer
  • Uses retrieved context
  • 7B to 70B parameters (larger = better but slower)

Document Ingestion:

  • Processes new documents
  • Chunks text appropriately
  • Generates and stores embeddings

7B Parameter Models:

  • Examples: LLaMA 2 7B, Mistral 7B
  • Memory: 14-16 GB VRAM (float16)
  • Quality: Good for simple Q&A
  • Speed: Fast (30-50 tokens/sec)
  • Use Case: High-throughput scenarios

13B-15B Parameter Models:

  • Examples: LLaMA 2 13B, Vicuna 13B
  • Memory: 26-30 GB VRAM
  • Quality: Better reasoning, more coherent
  • Speed: Moderate (20-30 tokens/sec)
  • Use Case: Balanced scenarios

30B-40B Parameter Models:

  • Examples: Falcon 40B, Code Llama 34B
  • Memory: 60-80 GB VRAM (requires multi-GPU or quantization)
  • Quality: Strong performance
  • Speed: Slower (10-15 tokens/sec)
  • Use Case: Quality-critical applications

70B Parameter Models:

  • Examples: LLaMA 2 70B
  • Memory: 140+ GB VRAM (multi-GPU required)
  • Quality: Excellent, approaches GPT-3.5
  • Speed: Slow (5-10 tokens/sec)
  • Use Case: Highest quality requirements

What is Quantization: Reduce model precision to save memory and improve speed.

INT8 Quantization:

  • 8-bit integers instead of 16-bit floats
  • 2x memory reduction
  • Minimal quality loss (< 1%)
  • Example: 70B model fits in 70 GB instead of 140 GB

INT4 / GPTQ / GGML:

  • 4-bit quantization
  • 4x memory reduction
  • Some quality degradation (5-10%)
  • 70B model fits in 35 GB

Recommendations:

  • INT8 for production (good quality/size trade-off)
  • INT4 for resource-constrained edge
  • Full precision (FP16) only if hardware allows

vLLM (Fast Inference):

  • PagedAttention algorithm
  • 2-4x faster than standard
  • Lower memory usage

Text Generation Inference (TGI):

  • Hugging Face solution
  • Production-ready
  • Good scaling

llama.cpp / GGML:

  • C++ implementation
  • CPU-friendly
  • Apple Silicon optimized

Chroma:

  • Pros: Simple, Python-native, easy setup
  • Cons: Less scalable (< 1M vectors)
  • Best For: POCs, small deployments

Milvus:

  • Pros: Highly scalable, production-ready, feature-rich
  • Cons: More complex setup
  • Best For: Enterprise deployments

Qdrant:

  • Pros: Rust-based (fast), modern API, good filtering
  • Cons: Newer (less mature ecosystem)
  • Best For: Performance-critical applications

FAISS (Facebook AI Similarity Search):

  • Pros: Very fast, library (not database), well-tested
  • Cons: No metadata filtering, not persistent
  • Best For: In-memory search, prototyping

Flat Index:

  • Exact search
  • Slow for large datasets
  • 100% recall

IVF (Inverted File):

  • Cluster vectors, search nearest clusters
  • Fast, good recall (95-99%)
  • Good balance

HNSW (Hierarchical Navigable Small World):

  • Graph-based index
  • Very fast queries
  • High memory usage

PQ (Product Quantization):

  • Compress vectors
  • Lower memory usage
  • Some accuracy loss

Recommendation: HNSW for most cases, IVF for memory-constrained.


1. Document Loading:

Supported formats: PDF, DOCX, TXT, HTML, MD, CSV
Tools: PyPDF2, python-docx, Beautiful Soup, Unstructured.io

2. Text Extraction:

  • Extract clean text
  • Preserve structure (headings, lists)
  • Handle tables and images (OCR if needed)

3. Chunking:

  • Split into overlapping chunks
  • Typical: 500-1000 tokens per chunk
  • Overlap: 100-200 tokens

4. Metadata Extraction:

  • Document title, author, date
  • Section/chapter
  • Tags/categories
  • Source URL or file path

5. Embedding Generation:

  • Run embedding model on each chunk
  • Batch processing for efficiency
  • Store embeddings with metadata

6. Vector Database Indexing:

  • Insert embeddings into vector DB
  • Build search index
  • Enable metadata filters

Watch Folder Pattern:

1. Monitor directory for new files
2. Detect new/changed documents
3. Process and index automatically
4. Update vector database

Scheduled Batch Jobs:

  • Run nightly or weekly
  • Process accumulated documents
  • Rebuild indices if needed

Event-Driven:

  • Triggered by application events
  • Real-time indexing
  • Good for dynamic content

1. User submits natural language query
2. Query → Embedding Model → Query Vector
3. Vector Database Search (top K similar documents)
4. Retrieved documents + Query → Prompt for LLM
5. LLM generates answer with citations
6. Response returned to user

Template:

Context: {retrieved_documents}
Question: {user_query}
Instructions:
- Answer the question based only on the context provided
- If the context doesn't contain enough information, say so
- Cite the source for each fact you use
- Be concise but complete
Answer:

Best Practices:

  • Clear instructions to use context
  • Request citations
  • Specify answer format/length
  • Include examples if helpful

Top-K Retrieval:

  • Return K most similar documents (typically 3-5)
  • Simple and effective

MMR (Maximal Marginal Relevance):

  • Balance similarity and diversity
  • Avoid redundant results

Re-ranking:

  • Initial retrieval: top 20-50
  • Re-rank with more sophisticated model
  • Return top 5 to LLM

Hybrid Search:

  • Combine vector and keyword search
  • Weighted fusion of results
  • More robust

LangChain:

from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(
llm=local_llm,
retriever=vector_store.as_retriever(),
return_source_documents=True
)
result = qa_chain({"query": "What is Azure Local?"})
answer = result['result']
sources = result['source_documents']

LlamaIndex:

from llama_index import VectorStoreIndex, ServiceContext
service_context = ServiceContext.from_defaults(llm=local_llm)
index = VectorStoreIndex.from_documents(documents, service_context=service_context)
query_engine = index.as_query_engine()
response = query_engine.query("What is Azure Local?")

Why Streaming:

  • Better user experience (see tokens as generated)
  • Lower perceived latency
  • Ability to cancel long responses

Implementation:

for token in llm.stream("Query..."):
print(token, end="", flush=True)

Query Caching:

  • Cache common queries and responses
  • Instant response for repeated questions
  • Use semantic similarity for cache lookup

Embedding Caching:

  • Cache document embeddings
  • Avoid re-computing for unchanged documents

Retrieved Context Caching:

  • Cache retrieval results for popular queries
  • Reduces vector DB load

Batch Processing:

  • Process multiple documents at once
  • Batch embed generation
  • More efficient GPU utilization

Asynchronous Processing:

  • Non-blocking I/O
  • Process queries in parallel
  • Better throughput

Load Balancing:

  • Multiple LLM instances
  • Distribute queries
  • Horizontal scaling

Integration with Azure Services (Connected Mode)

Section titled “Integration with Azure Services (Connected Mode)”

When deployed in Connected Mode on Azure Local:

Azure OpenAI (for comparison/fallback):

  • Use cloud LLM for complex queries
  • Hybrid approach (edge for sensitive, cloud for general)

Azure Cognitive Search:

  • Offload some vector search to cloud
  • Hybrid local + cloud knowledge base

Azure Monitor:

  • Query performance telemetry
  • Model performance metrics
  • Usage analytics

Azure Blob Storage:

  • Backup vector database
  • Archive old documents
  • Disaster recovery

Important: Only sync approved, non-sensitive data to cloud!


Requirements:

  • All models pre-downloaded
  • Vector database fully local
  • No internet dependency

Challenges:

  • Model updates manual
  • Document ingestion from local sources only
  • No cloud-based monitoring

Solutions:

  • USB/physical media for updates
  • Local monitoring dashboards
  • Comprehensive local logging

Model Updates:

  • Download new models in secure environment
  • Transfer to air-gapped network
  • Test before deploying
  • Rollback capability

Document Updates:

  • Scheduled ingestion from local repositories
  • Version control for knowledge base
  • Incremental indexing

Query Performance:

  • Query latency (end-to-end)
  • Retrieval time
  • LLM inference time
  • Queries per second

Quality Metrics:

  • User satisfaction scores
  • Answer accuracy (if ground truth available)
  • Citation accuracy

Resource Utilization:

  • GPU utilization
  • Memory usage
  • Storage usage
  • CPU utilization

Failure Metrics:

  • Query errors
  • Timeout rate
  • Model load failures

Regular:

  • Monitor disk space
  • Review slow queries
  • Update monitoring dashboards

Weekly:

  • Analyze usage patterns
  • Review user feedback
  • Plan capacity

Monthly:

  • Evaluate model performance
  • Consider index rebuild
  • Review and update documents

Minimum (7B Model):

  • GPU: NVIDIA RTX 4090 (24 GB) or A10 (24 GB)
  • RAM: 32 GB
  • Storage: 500 GB SSD
  • CPU: 8+ cores

Recommended (13B Model):

  • GPU: NVIDIA A100 (40 GB) or A30 (24 GB)
  • RAM: 64 GB
  • Storage: 1 TB NVMe SSD
  • CPU: 16+ cores

Enterprise (70B Model):

  • GPU: 2x NVIDIA A100 (80 GB) or H100
  • RAM: 256 GB
  • Storage: 2 TB NVMe SSD
  • CPU: 32+ cores


Last Updated: October 2025