LLM Inference Optimization

Overview

View Diagram: Inference Optimization Flow

flowchart LR
    subgraph Input["Input Processing"]
        REQ[Request] --> TOK[Tokenizer]
        TOK --> EMB[Embedding]
    end

    subgraph Opt["Optimization Layer"]
        EMB --> QUANT{Quantization<br/>INT4/INT8}
        QUANT --> BATCH[Batch<br/>Processing]
        BATCH --> CACHE[KV Cache]
    end

    subgraph Inference["Model Inference"]
        CACHE --> GPU[GPU<br/>Compute]
        GPU --> SPEC[Speculative<br/>Decoding]
        SPEC --> OUT[Output<br/>Tokens]
    end

    subgraph Output["Response"]
        OUT --> STREAM[Streaming<br/>Response]
        STREAM --> CLIENT[Client]
    end

    style Input fill:#E8F4FD,stroke:#0078D4,stroke-width:2px,color:#000
    style Opt fill:#FFF4E6,stroke:#FF8C00,stroke-width:2px,color:#000
    style Inference fill:#F3E8FF,stroke:#7B3FF2,stroke-width:2px,color:#000
    style Output fill:#D4E9D7,stroke:#107C10,stroke-width:2px,color:#000

Figure 1: LLM inference optimization pipeline for edge deployments

LLM inference optimization is critical for edge RAG systems where hardware is constrained and latency requirements are strict. This page covers quantization techniques, prompt engineering, batch processing, and hardware-aware optimization strategies to maximize throughput and minimize latency.

Quantization Techniques

Quantization Fundamentals

Problem: LLMs are memory-intensive

Llama 2 7B in FP32 = 28 GB VRAM
Llama 2 70B in FP32 = 280 GB VRAM
Most edge hardware: 24-80 GB VRAM

Solution: Reduce precision without losing quality

Quantization Levels

Precision Type   | Bits  | Size (7B Model) | Memory Reduction | Quality Loss
──────────────────────────────────────────────────────────────────────────────
FP32 (Full)      | 32    | 28 GB          | 0%               | None
FP16 (Half)      | 16    | 14 GB          | 50%              | <1%
INT8             | 8     | 7 GB           | 75%              | 2-3%
INT4             | 4     | 3.5 GB         | 87.5%            | 5-8%
Binary           | 1     | 0.875 GB       | 96.875%          | 20%+ (not practical)

INT4 Quantization (Recommended for Edge)

Advantages:

87.5% memory reduction
2-4x faster inference
Minimal quality loss (5-8%)
Fits 70B models on single 80GB GPU

Example Configuration:

{
  "quantization": {
    "type": "int4",
    "method": "GPTQ",
    "groupSize": 128,
    "actOrder": true,
    "desc_act": false
  }
}

Trade-offs:

FP32 → INT4 Conversion:
  Speed Gain:     2.5-4x faster
  Latency:        500ms → 150ms (per token)
  Quality Impact: 5-8% (measured by benchmarks)
  Tokens/Second:  5 → 15 tokens/sec

Best Use:        Production deployments
Recommended:     Edge with latency constraints

INT8 Quantization (Balance)

Advantages:

75% memory reduction
Minimal quality loss (2-3%)
Easier post-training quantization
Good balance point

Use When:

You have 24-32GB VRAM
Quality critical
Can tolerate 2-3% loss

Dynamic vs. Static Quantization

Static Quantization:
  - Calibration phase: Run model on sample data
  - Fixed scaling factors
  - Performance: Fast, predictable
  - Quality: High
  - Setup: 1-2 hours

Dynamic Quantization:
  - No calibration needed
  - On-the-fly scaling
  - Performance: Slightly slower
  - Quality: Comparable to static
  - Setup: 5 minutes

Prompt Optimization

Prompt Structure & Performance

Inefficient Prompt (1,500 tokens):

"Tell me about machine learning. Include history,
current applications, future trends, challenges,
and detailed examples of neural networks..."

Tokens: 150 (question) + 1,350 (response)
Time: 2.7 seconds (15 tokens/sec)
Quality: Generic

Optimized Prompt (800 tokens):

[SYSTEM]
You are an expert ML engineer. Provide concise, accurate information.
Answer in structured format: Definition, Key Points, Example.

[QUERY]
Define machine learning with 2 key points and 1 example.

Tokens: 50 (system) + 80 (query) + 400 (response)
Time: 0.8 seconds (15 tokens/sec)
Quality: Precise, structured

Prompt Templates for RAG

Structure for consistent output:

Template Format:
────────────────────────────────
[CONTEXT]
{relevant_documents}

[QUESTION]
{user_query}

[INSTRUCTION]
Answer based on context. If not in context, say "not found".
Format: 1-2 sentences max.

[ANSWER]

Token Savings:

Without Template:
  Query: 50 tokens
  Context: 1000 tokens
  Instruction (implicit): 200 tokens
  Response: 800 tokens
  Total: 2050 tokens = 2.7s

With Template:
  Context: 600 tokens (optimized)
  Instruction: 50 tokens (explicit)
  Response: 300 tokens (forced brevity)
  Total: 950 tokens = 0.63s

Improvement: 4.3x faster (2.7s → 0.63s)

Few-Shot vs. Zero-Shot

Zero-Shot (No examples):
  Prompt: "Classify: positive or negative. 'Great product!'"
  Tokens: 10
  Quality: 85%
  Time: 10ms

Few-Shot (2 examples):
  Prompt: "Positive: 'Amazing!'
           Negative: 'Terrible'
           Classify: 'Great product!'"
  Tokens: 40
  Quality: 95%
  Time: 20ms

Decision: Zero-shot for speed, few-shot for quality

Batch Processing for Throughput

Single Request vs. Batching

Single Request Flow:
  Request 1 → LLM → Response 1    (600ms)
  Request 2 → LLM → Response 2    (600ms)
  Request 3 → LLM → Response 3    (600ms)
  Request 4 → LLM → Response 4    (600ms)
  Request 5 → LLM → Response 5    (600ms)
  ────────────────────────────────
  Total: 3000ms (5 requests)
  Throughput: 1.67 req/sec

Batched (5 at once):
  Requests 1-5 → LLM (batched) → Responses 1-5  (800ms)
  ────────────────────────────────
  Total: 800ms (5 requests)
  Throughput: 6.25 req/sec
  Improvement: 3.75x

Optimal Batch Size

Batch Size | VRAM Used | Latency | Throughput | Efficiency
───────────────────────────────────────────────────────────
1          | 8GB       | 400ms   | 2.5 req/s  | Baseline
2          | 12GB      | 500ms   | 4.0 req/s  | 1.6x
4          | 16GB      | 700ms   | 5.7 req/s  | 2.3x
8          | 24GB      | 1000ms  | 8.0 req/s  | 3.2x
16         | 32GB      | 1300ms  | 12.3 req/s | 4.9x
32         | 48GB      | 1800ms  | 17.7 req/s | 7.1x

Hardware: Single 80GB GPU
Model: Llama 2 7B quantized

Recommendation for Edge:
  - Batch size 4-8 (balance latency/throughput)
  - Monitor VRAM, keep headroom (80% utilization max)

Dynamic Batching

Queue requests, batch when ready:

Time Event                      Queue State       Action
────────────────────────────────────────────────────────
0ms  Request 1 arrives          [R1]              Wait (5ms)
2ms  Request 2 arrives          [R1, R2]          Wait (3ms)
3ms  Request 3 arrives          [R1, R2, R3]      Wait (2ms)
4ms  Request 4 arrives          [R1-R4]           Wait (1ms)
5ms  Request 5 arrives          [R1-R5]           Process immediately
     Batch size: 5
     Inference time: 1000ms
     Output: Responses 1-5

Results:
  - Request 1: 1005ms (batched wait + inference)
  - Request 5: 1000ms (instant batch trigger)
  - Average: 1001ms
  - Without batching: 5 × 400ms = 2000ms
  - Savings: 50% latency

Latency Optimization

Inference Pipeline Components

Total Latency Breakdown (Llama 2 7B, 100 tokens output):
────────────────────────────────────────────────────────
Tokenization:        10ms    (2%)
Embedding Lookup:    20ms    (3%)
Model Inference:     480ms   (85%)
De-tokenization:     10ms    (2%)
Post-processing:     30ms    (5%)
────────────────────────────────
Total:              550ms    (100%)

Optimization Targets:
  1. Model Inference (85%) → Quantization, batch size tuning
  2. Post-processing (5%) → Caching, async operations
  3. Tokenization (2%) → Pre-tokenization, caching

Token Generation Optimization

Greedy Decoding (Fastest):
  - Each step: Pick highest probability token
  - Time: O(n) where n = output tokens
  - Quality: Lower (greedy choices)
  - 100 tokens: 400ms

Top-k Sampling (Medium):
  - Each step: Sample from top-k tokens
  - Time: O(n × k)
  - Quality: Better diversity
  - 100 tokens: 420ms

Beam Search (Slower):
  - Maintains multiple hypotheses
  - Time: O(n × beam_width)
  - Quality: Highest (explore alternatives)
  - 100 tokens: 600ms

Recommendation: Greedy for speed, top-k for balance

Speculative Decoding

Predict multiple tokens ahead:

Standard Decoding:
  Input: "The quick brown"
  Token 1: "fox" (inference)
  Token 2: "jumps" (inference)
  Token 3: "over" (inference)
  Time: 3 × 200ms = 600ms

Speculative Decoding:
  Input: "The quick brown"
  Draft prediction: "fox jumps over the lazy"
  Verify tokens: 2/4 correct
  Generate tokens: 2, 4 (missed)
  Time: 1 × 200ms (draft) + 2 × 200ms (corrections) = 600ms

  Benefit: Works well for predictable text
  Savings: Up to 40% when draft model good

Throughput Maximization

Multi-Model Serving

Serve multiple models on same GPU:

Scenario:
  - Model A: 7B (8GB), used 30% of time
  - Model B: 3B (4GB), used 70% of time
  - GPU: 80GB total

Without Sharing:
  Model A: Allocated 8GB (idle 70% of time)
  Model B: Allocated 4GB (loaded when needed)
  Total: 12GB used

With Time-Slicing:
  Slot 1: Model A (8GB, 100ms)
  Slot 2: Model B (4GB, 300ms)
  Cycle: 400ms

  Result:
    - Model A latency: 100ms + 300ms (other model) = 400ms
    - Model B latency: 300ms + queue wait
    - Better for 70/30 split

Request Coalescing

Merge similar requests:

Without Coalescing:
  User 1: "Summarize document A" (500ms)
  User 2: "Summarize document B" (500ms)
  Total: 1000ms

With Coalescing:
  Batch: "Summarize documents A and B" (600ms)
  Split responses
  Total: 600ms for both
  Savings: 40%

Use Case: Shared queries, bulk operations

Hardware-Aware Optimization

GPU vs. CPU Inference

                 GPU (RTX 4090)  | CPU (64 cores)
────────────────────────────────────────────────
Model: Llama 7B-INT4
Latency (1 token): 20ms           | 500ms
Latency (100 tokens): 400ms        | 15000ms
Throughput: 40 tokens/sec         | 1.3 tokens/sec
VRAM/Memory: 8GB                  | 60GB
Power: 300W                        | 120W
Cost: $2000                        | $500
Cost/token: $0.0000015            | $0.0003

Recommendation:
  - GPU: >1000 req/day (amortize latency)
  - CPU: <100 req/day (simple, no GPU available)
  - Edge: Always GPU if available

Mixed Precision (FP16 + FP32)

Strategy:
  - Layers 1-24: INT8 (fast)
  - Layers 25-30: FP16 (more precise)
  - Output: FP32 (highest precision)

Result:
  - Latency: 400ms (95% of INT4)
  - Quality: Better than INT4 (3% loss vs. 5%)
  - VRAM: Same as INT4 (3.5GB)

Best for: Quality-critical applications

Cost Optimization

Cost Per Query Analysis

Model Deployment (Llama 2 7B-INT4)
──────────────────────────────────

Cloud API (OpenAI):
  - Cost: $0.01 per 1K tokens
  - 100 tokens: $0.001
  - 1M queries: $1000/month

Edge Deployment:
  - Hardware: $2000 (GPU)
  - Amortized: $333/month (6-month ROI)
  - Electricity: $100/month
  - Maintenance: $50/month
  - Total: ~$483/month for unlimited queries

  Break-even: 483,000 queries → ~16K queries/day

Recommendation:
  - <5K queries/day: Use cloud
  - >10K queries/day: Deploy edge
  - >50K queries/day: Multi-GPU edge

Energy Efficiency

Power Consumption (Llama 7B):

GPU (RTX 4090):
  - Idle: 50W
  - Inference: 350W
  - 1M queries: 350W × 110 hours = 38.5 kWh
  - Cost: $5.80 @ $0.15/kWh

CPU (64 cores):
  - Idle: 20W
  - Inference: 200W
  - 1M queries: 200W × 3600 hours = 720 kWh
  - Cost: $108 @ $0.15/kWh

GPU is 18.6x more efficient

Best Practices & Trade-offs

Selection Matrix

Use Case	Model	Quantization	Batch Size	Hardware	Latency	Quality
Real-time Chat	7B	INT4	1-2	GPU	100ms	95%
Bulk Processing	13B	INT8	16	GPU	8s	98%
Cost-Sensitive	3B	INT4	4	CPU	2s	90%
High Quality	70B	FP16	2	Multi-GPU	500ms	99%+

Main Page: Edge RAG Implementation
Deployment: RAG Deployment Strategies
Vector Databases: Vector Databases for Edge
Operations: RAG Operations & Monitoring
Assessment: RAG Implementation Knowledge Check

Last Updated: October 21, 2025