RAG Fundamentals Deep Dive
View Diagram: Vector Embedding Process
Figure 1: How text is converted to vector embeddings for RAG retrieval
Table of Contents
- Table of Contents
- Traditional LLMs and Their Limitations
- Information Retrieval Basics
- Embedding Models and Representation Learning
- Vector Databases and Similarity Search
- Fine-Tuning vs. RAG Trade-Offs
- Evaluation Metrics for RAG Systems
- RAG Components and Data Flow
- Practical RAG Components and Tools
- Best Practices
- Next Steps
Traditional LLMs and Their Limitations
How LLMs Work
Large Language Models are neural networks trained on massive text corpora to predict the next word in a sequence.
Training Process:
- Ingest billions of words from internet, books, code
- Learn patterns, grammar, facts, reasoning
- Generate coherent, contextual responses
Strengths:
- General knowledge across many domains
- Strong language understanding
- Creative text generation
- Code generation and explanation
Key Limitations
1. Knowledge Cutoff:
- Training data has a cutoff date (e.g., September 2021)
- No awareness of events after training
- Cannot access proprietary organizational data
2. Hallucinations:
- May generate plausible but false information
- Confident incorrect answers
- No way to verify claims without external validation
3. Lack of Source Attribution:
- Cannot cite sources for information
- Difficult to verify accuracy
- No audit trail
4. Static Knowledge:
- Knowledge frozen at training time
- Cannot be easily updated
- Retraining is expensive ($millions)
5. Context Window Limitations:
- Limited context size (4K-32K tokens typical)
- Cannot process entire large documents
- Loses context in long conversations
Information Retrieval Basics
What is Information Retrieval?
Process of finding relevant documents from a collection based on a query.
Traditional Search:
- Keyword matching (TF-IDF, BM25)
- Boolean queries (AND, OR, NOT)
- Fast but limited by exact matches
Semantic Search:
- Understanding meaning, not just keywords
- Finds conceptually similar documents
- Better handles synonyms and paraphrasing
Vector Search Fundamentals
Core Concept: Represent text as numerical vectors (embeddings) in high-dimensional space. Similar meanings → similar vectors.
Process:
- Convert query to vector
- Find nearest vectors in database
- Return corresponding documents
Advantages:
- Semantic similarity
- Language-independent (with multilingual models)
- Efficient at scale
Embedding Models and Representation Learning
What Are Embeddings?
Definition: Dense numerical vector representations of text that capture semantic meaning.
Example:
"king" → [0.2, 0.5, -0.1, ..., 0.3] (768 dimensions)
"queen" → [0.18, 0.52, -0.09, ..., 0.29] (similar vector!)
Key Property: Semantically similar text has similar embeddings.
Popular Embedding Models
OpenAI text-embedding-ada-002:
- 1536 dimensions
- General purpose
- Cloud API
sentence-transformers (Open-Source):
- all-MiniLM-L6-v2: Fast, 384 dimensions
- all-mpnet-base-v2: Better quality, 768 dimensions
- Can run locally
Domain-Specific Models:
- BioBERT for biomedical text
- FinBERT for financial documents
- CodeBERT for source code
Embedding Generation Process
# Example with sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
# Generate embeddings
text = "Azure Local provides sovereign cloud capabilities"
embedding = model.encode(text)
# embedding.shape = (384,)
Vector Databases and Similarity Search
What is a Vector Database?
Specialized database optimized for storing and querying high-dimensional vectors.
Key Capabilities:
- Store millions to billions of vectors
- Fast similarity search (nearest neighbors)
- Metadata filtering
- Hybrid search (vector + keyword)
Popular Vector Databases
Cloud:
- Azure Cognitive Search (with vector support)
- Pinecone
- Weaviate Cloud
Self-Hosted / Edge:
- Chroma
- Milvus
- Qdrant
- FAISS (Facebook AI Similarity Search)
Similarity Metrics
Cosine Similarity:
- Measures angle between vectors
- Range: -1 to 1 (higher = more similar)
- Most common for text
Euclidean Distance:
- Straight-line distance
- Lower = more similar
- Sensitive to magnitude
Dot Product:
- Inner product of vectors
- Higher = more similar
- Fast to compute
Search Algorithms
Exact Search:
- Brute force comparison
- 100% accurate
- Slow for large datasets (O(n))
Approximate Nearest Neighbor (ANN):
- Trade accuracy for speed
- 95-99% recall typical
- Much faster (sub-linear time)
- Algorithms: HNSW, IVF, LSH
Fine-Tuning vs. RAG Trade-Offs
graph TB
Decision{Knowledge Update<br/>Approach?}
Decision -->|Style/Behavior| FT[Fine-Tuning]
Decision -->|Factual Knowledge| RAG[RAG System]
Decision -->|Both| Hybrid[Hybrid Approach]
FT --> FTPros[Pros:<br/>• Deep domain learning<br/>• No retrieval latency<br/>• Behavior changes]
FT --> FTCons[Cons:<br/>• Expensive<br/>• Stale knowledge<br/>• Large data needed]
RAG --> RAGPros[Pros:<br/>• No retraining<br/>• Easy updates<br/>• Citations<br/>• Cost-effective]
RAG --> RAGCons[Cons:<br/>• Retrieval latency<br/>• Context limits<br/>• Quality critical]
Hybrid --> HybridDesc[Combine:<br/>1. Fine-tune for style<br/>2. RAG for facts<br/>Best of both worlds]
style Decision fill:#0078D4,stroke:#004578,stroke-width:3px,color:#fff
style FT fill:#FFF4E6,stroke:#FF8C00,stroke-width:2px,color:#000
style RAG fill:#E8F4FD,stroke:#0078D4,stroke-width:2px,color:#000
style Hybrid fill:#D4E9D7,stroke:#107C10,stroke-width:2px,color:#000
Fine-Tuning
What It Is: Further training an LLM on domain-specific data.
Pros:
- Model learns domain knowledge deeply
- No retrieval latency
- Can change model behavior/style
Cons:
- Expensive (compute, time, expertise)
- Risk of catastrophic forgetting
- Knowledge becomes stale
- Requires significant data (1000s+ examples)
Best For:
- Changing model style/tone
- Learning specialized vocabulary
- Task-specific optimization
RAG (Retrieval-Augmented Generation)
What It Is: Retrieve relevant documents and provide as context to LLM.
Pros:
- No model retraining needed
- Easy to update knowledge (just add documents)
- Grounded responses with citations
- Cost-effective
- Transparent and auditable
Cons:
- Retrieval latency added
- Limited by context window
- Retrieval quality critical
Best For:
- Factual question answering
- Knowledge-intensive tasks
- Frequently changing information
- Providing citations
Hybrid Approach
Combine Both:
- Fine-tune LLM on domain for style and vocabulary
- Use RAG for up-to-date factual information
Example: Fine-tune on medical language, retrieve latest research papers.
Evaluation Metrics for RAG Systems
Retrieval Metrics
Precision:
- Of retrieved documents, how many are relevant?
- Precision = Relevant Retrieved / Total Retrieved
Recall:
- Of all relevant documents, how many were retrieved?
- Recall = Relevant Retrieved / Total Relevant
F1 Score:
- Harmonic mean of precision and recall
- F1 = 2 × (Precision × Recall) / (Precision + Recall)
Mean Reciprocal Rank (MRR):
- How high is the first relevant result?
- MRR = 1 / rank_of_first_relevant_document
Generation Metrics
Faithfulness:
- Is the generated answer supported by retrieved documents?
- Manually evaluated or with LLM-as-judge
Answer Relevance:
- Does the answer address the question?
Groundedness:
- Is every claim in the answer backed by sources?
BLEU / ROUGE (for reference answers):
- Compare generated answer to gold standard
- Measures n-gram overlap
End-to-End Metrics
Answer Accuracy:
- Is the final answer factually correct?
- Requires gold standard QA pairs
Latency:
- Time from query to answer
- Target: < 2-5 seconds for good UX
Cost:
- Per-query cost (LLM API, compute)
- Important for ROI calculations
RAG Components and Data Flow
Practical RAG Components and Tools
RAG Framework Comparison
LangChain:
- Most popular RAG framework
- Python and JavaScript
- Extensive integrations
- Modular and flexible
LlamaIndex (GPT Index):
- Specialized for RAG
- Better for complex queries
- Strong data connectors
- Query optimization built-in
Haystack:
- Production-ready
- Good for Q&A systems
- Pipeline architecture
- Open-source
Complete RAG Stack Example
Document Processing:
- Unstructured.io for document parsing
- LangChain document loaders
Embedding:
- Sentence-Transformers (local)
- OpenAI embeddings (cloud)
Vector Database:
- Chroma (local, simple)
- Milvus (production, scalable)
LLM:
- LLaMA 2 (open-source, local)
- GPT-4 (cloud, highest quality)
Framework:
- LangChain for orchestration
Minimal RAG Implementation
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import LlamaCpp
from langchain.chains import RetrievalQA
# 1. Load documents
loader = DirectoryLoader('./documents')
documents = loader.load()
# 2. Split into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
texts = text_splitter.split_documents(documents)
# 3. Create embeddings
embeddings = HuggingFaceEmbeddings()
# 4. Store in vector database
vectorstore = Chroma.from_documents(texts, embeddings)
# 5. Create retriever
retriever = vectorstore.as_retriever()
# 6. Load LLM
llm = LlamaCpp(model_path="./llama-2-7b.gguf")
# 7. Create RAG chain
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
# 8. Query
answer = qa_chain.run("What are the benefits of Azure Local?")
print(answer)
Best Practices
1. Chunk Sizing:
- Balance between too small (loss of context) and too large (irrelevant content)
- Typical: 500-1000 tokens
- Use overlap (100-200 tokens) to preserve context
2. Metadata Filtering:
- Add metadata to documents (date, author, category)
- Enable filtered retrieval
- Improves precision
3. Hybrid Search:
- Combine vector similarity with keyword matching
- Best of both worlds
- More robust
4. Reranking:
- Initial retrieval: top 20-50 results
- Rerank with more sophisticated model
- Return top 3-5 to LLM
5. Prompt Engineering:
- Clear instructions to LLM
- Ask for citations
- Specify answer format
Next Steps
Last Updated: October 2025