Vector Databases for Edge
Overview
Section titled “Overview”Vector databases are the foundation of RAG systems, enabling fast similarity search across millions of embeddings. This page explores vector database selection, indexing strategies, and optimization techniques for enterprise edge deployments where performance and resource efficiency are critical.
Vector Database Selection Matrix
Section titled “Vector Database Selection Matrix”Evaluation Criteria
Section titled “Evaluation Criteria”-
Performance
- Query latency: <50ms for 1M+ vectors
- Throughput: 1,000+ queries per second
- Recall accuracy: >95% for top-k results
-
Scalability
- Support for 100M+ vectors
- Horizontal sharding
- Memory efficiency (< 5KB per vector)
-
Enterprise Features
- RBAC & authentication
- Encryption at rest/transit
- Replication & failover
- Backup & recovery
-
Kubernetes Integration
- Container-native deployment
- Helm charts available
- Persistent volumes support
- StatefulSet ready
Edge-Ready Vector Databases
Section titled “Edge-Ready Vector Databases”Weaviate
Section titled “Weaviate”Profile:
- Language: Go
- Model: Modular (pluggable ML modules)
- Scaling: Horizontal sharding
- Enterprise: ✅ Full support
Characteristics:
Vector Capacity: <50M vectors (single), <500M (sharded)Query Latency: <20ms (1M vectors)Memory per Vector: ~2KBDeployment Size: ~500MB containerGPU Support: Yes (CUDA/ROCm)ML Framework: Pluggable (Hugging Face, Cohere)Best For:
- Hybrid search (vector + keyword)
- Rapid prototyping
- Multi-tenant deployments
Kubernetes Example:
helm install weaviate weaviate/weaviate \ --values values.yaml \ --set persistence.enabled=true \ --set persistence.size=100GiQdrant
Section titled “Qdrant”Profile:
- Language: Rust
- Model: High-performance, optimized
- Scaling: Distributed clusters
- Enterprise: ✅ Strong
Characteristics:
Vector Capacity: <100M vectors (single), >1B (distributed)Query Latency: <30ms (1M vectors)Memory per Vector: ~1.5KBDeployment Size: ~150MB containerGPU Support: Limited (CPU optimized)Index Type: HNSW, IVFBest For:
- Large-scale deployments
- High throughput requirements
- Cost-sensitive environments
Performance Comparison:
1M Vectors: <30ms latency10M Vectors: <50ms latency100M Vectors: <200ms latency (distributed)Milvus
Section titled “Milvus”Profile:
- Language: C++
- Model: Distributed, cloud-native
- Scaling: Kubernetes-first design
- Enterprise: ✅ Strong
Characteristics:
Vector Capacity: >1B vectors (distributed)Query Latency: <50ms (100M vectors)Memory per Vector: ~2KBDeployment Size: ~400MB containerGPU Support: Full (CUDA)Cloud Native: Kubernetes operator availableBest For:
- Massive scale (>100M vectors)
- Cloud-native deployments
- Complex filtering requirements
Chroma
Section titled “Chroma”Profile:
- Language: Python
- Model: Simple, developer-friendly
- Scaling: Single machine or basic clustering
- Enterprise: ⚠️ Limited
Characteristics:
Vector Capacity: <10M vectorsQuery Latency: <15ms (1M vectors)Memory per Vector: ~2KBDeployment Size: ~100MB containerGPU Support: NoBest Use: Development, small scaleComparison Table
Section titled “Comparison Table”| Database | Scale | Latency | Memory | Enterprise | Edge-Ready |
|---|---|---|---|---|---|
| Weaviate | 500M | <20ms | 2KB | ✅ Strong | ✅ Optimal |
| Qdrant | >1B | <30ms | 1.5KB | ✅ Strong | ✅ Optimal |
| Milvus | >1B | <50ms | 2KB | ✅ Strong | ✅ Good |
| Chroma | 10M | <15ms | 2KB | ⚠️ Limited | ✅ Dev |
Vector Indexing Strategies
Section titled “Vector Indexing Strategies”Index Types & Trade-offs
Section titled “Index Types & Trade-offs”HNSW (Hierarchical Navigable Small World)
Section titled “HNSW (Hierarchical Navigable Small World)”Best for: Most edge deployments
Characteristics: - Search: O(log n) complexity - Insert: O(log n) complexity - Memory: ~2KB per vector - Latency: <10ms for 1M vectors - Accuracy: 99%+
Configuration: - ef_construction: 200-400 (higher = better quality, slower build) - max_connections: 16-64 (higher = better search, more memory) - ef_search: 100-200 (higher = better accuracy, slower search)Use When:
- Sub-100ms latency required
- Memory is constrained
- Real-time indexing needed
Example:
{ "indexType": "hnsw", "hnsw": { "m": 32, "efConstruction": 200, "efSearch": 200 }}IVF (Inverted File)
Section titled “IVF (Inverted File)”Best for: Large-scale deployments (>100M vectors)
Characteristics: - Search: O(k log n) complexity (faster for large n) - Insert: O(log n) complexity - Memory: ~1KB per vector - Latency: <50ms for 100M vectors - Accuracy: 95-98%
Configuration: - nlist: 100-1000 (number of partitions) - nprobe: 10-100 (partitions to search) - trainable: Training vectors (>100K)Use When:
- Scale >100M vectors
- Memory constraints
- Batch indexing acceptable
Trade-offs:
- Slower search than HNSW
- Better memory efficiency
- Requires training phase
Flat Search
Section titled “Flat Search”Best for: Small datasets or maximum accuracy
Characteristics: - Search: O(n) complexity (linear) - Insert: O(1) complexity (instant) - Memory: No index overhead - Latency: Proportional to dataset size - Accuracy: 100%
Use When: - <1M vectors - Accuracy critical - Preprocessing acceptableMulti-Index Strategy for Edge
Section titled “Multi-Index Strategy for Edge”Optimize for your workload:
Dataset Size | Primary Index | Secondary | Rationale────────────────────────────────────────────────────────────<1M vectors | Flat Search | - | Fast, accurate1-10M vectors | HNSW | Flat | Balance speed/memory10-100M vectors | HNSW+IVF | - | Distributed search>100M vectors | IVF+Sharding | HNSW | Partition by shardEmbedding Model Selection
Section titled “Embedding Model Selection”Embedding Model Characteristics
Section titled “Embedding Model Characteristics”Model | Dimensions | Size | Speed | Quality | Edge-Ready────────────────────────────────────────────────────────────────────────────all-minilm-l6-v2 | 384 | 90MB | Fast | Good | ✅ Bestbge-base-en | 768 | 440MB | Medium | Optimal | ✅ Goodmultilingual-e5-m | 384 | 120MB | Fast | Good | ✅ GoodOpenAI text-embed | 1536 | API | Slow | Optimal | ❌ CloudCohere embed-en | 1024 | API | Slow | Optimal | ❌ CloudEmbedding Generation Pipeline
Section titled “Embedding Generation Pipeline”Raw Text │ ├─ Tokenization (text → token IDs) │ • BPE, WordPiece, or SentencePiece │ • Token limit: 512 typical │ ├─ Embedding Computation (tokens → vector) │ • Transformer model inference │ • Output: Float32 vector (384-1536 dims) │ └─ Vector Normalization • L2 normalization (improve similarity comparison) • Output: Normalized vector [-1, 1]
Performance: - Single embedding: 5-50ms (CPU), 1-5ms (GPU) - Batch 100: 100-200ms (CPU), 10-20ms (GPU) - Batch 1000: 500-1000ms (CPU), 50-100ms (GPU)Embedding Storage & Retrieval
Section titled “Embedding Storage & Retrieval”Vector Format Optimization:
Standard (Float32): - Size: 384 dims × 4 bytes = 1.5 KB per vector - Accuracy: 100% - Recommended: Production
Half-Precision (Float16): - Size: 384 dims × 2 bytes = 768 B per vector - Accuracy: 99.5% (minimal impact) - Recommended: When memory constrained - Savings: 50% storage, 2x faster search
Quantized (Int8): - Size: 384 dims × 1 byte = 384 B per vector - Accuracy: 98% (acceptable) - Recommended: Extreme memory constraints - Savings: 75% storage, 4x faster searchSimilarity Search Tuning
Section titled “Similarity Search Tuning”Search Quality vs. Performance
Section titled “Search Quality vs. Performance”Search Strategy | Accuracy | Speed | Memory | Use Case─────────────────────────────────────────────────────────────Exact (Flat) | 100% | Slow | Low | Small dataHNSW (ef=200) | 99% | Medium | Medium | BalancedHNSW (ef=500) | 99.5% | Slower | Medium | High qualityIVF (nprobe=10) | 95% | Fast | Low | Large scaleIVF (nprobe=100) | 98% | Medium | Low | BalancedQuery Optimization
Section titled “Query Optimization”Retrieve most relevant context:
Basic Query: SELECT * FROM vectors WHERE distance(query_vec, embedding) < threshold ORDER BY distance ASC LIMIT 5
Result: 5 closest vectors, ~50ms
Optimized Query (with reranking): 1. BM25 keyword search (100 results) → 10ms 2. Vector search (top 50) → 30ms 3. LLM-based reranking (top 5) → 100ms 4. Return top 5 results
Result: More relevant results, ~140ms totalMetadata Filtering
Section titled “Metadata Filtering”Efficient filtering strategies:
High-Quality Filtering: SELECT * FROM vectors WHERE date_updated > '2024-01-01' AND source = 'internal_docs' AND status = 'approved' ORDER BY distance(query_vec, embedding) ASC LIMIT 10
Optimization: - Create indexes on date_updated, source, status - Filter first (reduce vector search space) - Then similarity search on filtered subset - Impact: 10x faster queries with specific metadataScaling Vector Databases
Section titled “Scaling Vector Databases”Horizontal Scaling (Sharding)
Section titled “Horizontal Scaling (Sharding)”Distribution strategy:
Logical Shards (by hash): Vectors 0-333M: Shard 1 (Server 1) Vectors 333M-666M: Shard 2 (Server 2) Vectors 666M-1B: Shard 3 (Server 3)
Query Flow: 1. Parse query vector 2. Send to all 3 shards in parallel 3. Collect top-10 from each shard (30 candidates) 4. Re-rank globally 5. Return top-10
Performance: - Single shard latency: 50ms × 3 shards (parallel) = 50ms - vs. single server: 200ms - Improvement: 4x fasterReplication for High Availability
Section titled “Replication for High Availability”Master-Replica Setup: ┌──────────────┐ │ Master Node │ (Write operations) │ (Primary) │ └─────┬────────┘ │ ┌───┴───┐ │ │ ┌─▼───┐ ┌▼───┐ │Rep 1 │ │Rep 2│ (Read operations) │ │ │ │ └──────┘ └─────┘
Redundancy: - Replica 1: Full copy - Replica 2: Full copy - Any node down: No impact - RPO: 0 (no data loss) - RTO: <5s (automatic failover)Vector Database Operations
Section titled “Vector Database Operations”Backup & Recovery
Section titled “Backup & Recovery”Enterprise backup strategy:
Backup Schedule: - Snapshots: Every 6 hours - Full export: Daily (off-peak) - WAL (Write-Ahead Logs): Continuous
Recovery Options: 1. Point-in-time restore (< 5 minutes) 2. Full restore from backup (< 30 minutes) 3. Incremental restore (< 10 minutes)
Storage: - Local: For rapid recovery - Remote: For disaster recovery - Cloud storage: Long-term retentionMonitoring & Health
Section titled “Monitoring & Health”Key metrics to track:
Search Performance: - Query latency (p50, p95, p99) - Throughput (queries/second) - Cache hit rate
Resource Utilization: - CPU usage - Memory consumption - Disk I/O - Network bandwidth
Data Health: - Total vectors indexed - Index fragmentation - Missing or corrupt entriesRelated Topics
Section titled “Related Topics”- Main Page: Edge RAG Implementation
- Deployment: RAG Deployment Strategies
- LLM Optimization: LLM Inference Optimization
- Operations: RAG Operations & Monitoring
- Assessment: RAG Implementation Knowledge Check
Last Updated: October 21, 2025