Solution Sizing & Planning

Overview

Solution sizing translates customer requirements into specific hardware configurations, software components, and deployment architectures. This page covers workload assessment, sizing calculators, capacity planning, and validation techniques.


Solution Sizing Methodology

Three-Phase Sizing Process

PHASE 1: REQUIREMENTS EXTRACTION (1-2 weeks)
├─ Workload characterization
├─ Performance requirements
├─ Scale and growth projections
└─ Operational constraints

PHASE 2: CAPACITY CALCULATION (1 week)
├─ CPU/Memory sizing
├─ Storage sizing
├─ Network requirements
└─ GPU/accelerator needs

PHASE 3: VALIDATION & OPTIMIZATION (1 week)
├─ Benchmark against requirements
├─ Cost optimization analysis
├─ Risk mitigation planning
└─ Final recommendation & presentation

Workload Classification & Profiling

Workload Categories

CATEGORY 1: Inference (AI/ML prediction)
├─ Characteristics:
│  ├─ High throughput (100-10000 QPS)
│  ├─ Low latency requirement (<500ms)
│  ├─ GPU-intensive
│  └─ Stateless operations
├─ Sizing Focus:
│  ├─ GPU count and memory
│  ├─ Batch processing capability
│  └─ Network throughput

CATEGORY 2: Analytics/Reporting
├─ Characteristics:
│  ├─ Batch processing
│  ├─ High data volume (TB-PB)
│  ├─ Flexible timing
│  └─ CPU-intensive
├─ Sizing Focus:
│  ├─ Storage capacity
│  ├─ CPU cores
│  └─ Network bandwidth

CATEGORY 3: Transactional Database
├─ Characteristics:
│  ├─ Consistent throughput (1000-100K TPS)
│  ├─ Low latency (<10ms)
│  ├─ High availability
│  └─ ACID compliance
├─ Sizing Focus:
│  ├─ Memory (cache efficiency)
│  ├─ IOPS (disk speed)
│  └─ Network latency

CATEGORY 4: Vector Search (RAG)
├─ Characteristics:
│  ├─ Moderate throughput (100-10000 QPS)
│  ├─ Similarity search
│  ├─ Memory-resident indexes
│  └─ GPU-optional
├─ Sizing Focus:
│  ├─ Memory (index size)
│  ├─ Network I/O
│  └─ CPU cores for search

Workload Profile Template

WORKLOAD NAME: [Name]
WORKLOAD TYPE: [Classification]
BUSINESS CONTEXT: [Why this matters]

SCALE CHARACTERISTICS:
├─ Concurrent Users: [Number]
├─ Queries/Transactions per Second:
│  ├─ Average: [X QPS]
│  ├─ Peak: [Y QPS, Z% of time]
│  └─ 99th percentile: [W QPS]
├─ Data Volume:
│  ├─ Current: [Size]
│  ├─ Year 1: [Size]
│  ├─ Year 3: [Size]
│  └─ Growth Rate: [% per year]

PERFORMANCE REQUIREMENTS:
├─ Latency:
│  ├─ p50: [ms]
│  ├─ p95: [ms]
│  └─ p99: [ms]
├─ Throughput: [Transactions/sec]
├─ Availability: [SLA %]
└─ Error Rate: [Target %]

INFRASTRUCTURE CONSTRAINTS:
├─ Hardware Available: [Description]
├─ Power Budget: [kW]
├─ Network Bandwidth: [Mbps]
├─ Physical Space: [sq ft]
└─ Environmental: [Temperature, humidity, etc.]

OPERATIONAL CONSTRAINTS:
├─ Maintenance Windows: [Frequency]
├─ Failover Requirements: [RPO/RTO]
├─ Compliance Audits: [Frequency]
└─ Support Model: [24/7, business hours, etc.]

Hardware Sizing Calculators

CPU Sizing Calculator

STEP 1: Estimate CPU Requirements
─────────────────────────────────
Per-Query CPU Time: [ms]
  (from benchmarking or vendor data)

Queries per Second (peak): [QPS]

Cores Needed = (QPS × CPU_time_ms × num_threads) / (core_time_per_sec × efficiency)

Example:
  - 500 QPS peak
  - 50ms CPU time per query
  - 4 threads per query
  - 85% CPU efficiency
  
  Cores = (500 × 50 × 4) / (1000 × 0.85) = 117 cores needed
  
  → Recommend: 4 nodes × 32 cores = 128 cores
  → Provides 9% headroom

STEP 2: Add Overhead for System Services
────────────────────────────────────
  - Operating system: 15% of cores
  - Monitoring/logging: 10% of cores
  - Cache/buffers: 5% of cores
  - Headroom for peaks: 10% of cores
  
  Total Overhead: ~40% of cores
  
  Adjusted cores needed = 117 / 0.6 = 195 cores

STEP 3: Select Hardware
──────────────────────
  - Dual-socket servers, 32 cores each = 64 cores/server
  - 195 / 64 = 3.05 → Recommend 4 servers
  
RESULT: 4 servers with dual 32-core processors

Memory Sizing Calculator

STEP 1: Identify Memory Components
──────────────────────────────────
Working Set Size:
  - Active data in use: [GB]
  - Example: LLM model (7B params in INT4) = 2GB

Cache Layers:
  - Application cache: [GB]
  - Database buffer pool: [GB]
  - OS page cache: [GB]
  - Total cache: [GB]

Index Structures:
  - Vector index (HNSW for 1M vecs): 2KB per vector = 2GB
  - B-tree indexes: varies
  - Hash tables: varies
  - Total: [GB]

Temporary Allocations:
  - Query processing: [GB]
  - Sort buffers: [GB]
  - Join buffers: [GB]
  - Total: [GB]

STEP 2: Calculate Total Memory
──────────────────────────────
Total = Working Set + Cache + Indexes + Temp + Headroom (25%)

Example:
  - Working Set (LLM): 2GB
  - Cache: 64GB
  - Indexes (1M vectors): 2GB
  - Temp buffers: 10GB
  - Headroom (25%): 19.5GB
  
  Total: ~98GB → Round to 128GB per server

STEP 3: Memory Layout per Server
─────────────────────────────────
With 384GB server (enterprise hardware):
  - LLM model: 2GB (40% GPU VRAM, 60% host)
  - Vector index: 2GB
  - Buffer pool: 256GB
  - OS + overhead: 124GB
  
  Utilization: ~340/384 = 88.5%

GPU Sizing Calculator

STEP 1: Assess GPU Requirements
───────────────────────────────
Model Requirements:
  - Llama 2 7B in INT4: ~2GB VRAM
  - Mistral 7B in INT8: ~8GB VRAM
  - Llama 2 13B in INT8: ~14GB VRAM

Inference Requirements:
  - QPS: 500 peak
  - Model latency: 200ms per query
  - Batch size: 4 (256ms latency, 2000 QPS capacity)
  
GPU Memory Needed:
  - Model: 2GB (INT4 quantized)
  - Batch buffers: 1GB (4x batch size)
  - Embeddings: 0.5GB
  - Total per GPU: 3.5GB
  
  Utilization per GPU: 3.5GB / 16GB = 22%
  
STEP 2: Calculate GPU Count
───────────────────────────
With Batch Processing:
  - Batching 4 queries: 2000 QPS theoretical
  - Need 500 QPS: 1 GPU sufficient mathematically
  - Add redundancy: 2 GPUs
  - Add failure headroom: 3 GPUs
  
  → 3x T4 GPUs (16GB each)

STEP 3: Validate Load Distribution
──────────────────────────────────
  - 500 QPS across 3 GPUs = 167 QPS per GPU
  - Batch size 4 = 42 batches/second per GPU
  - Processing time: 200ms + overhead = 250ms per batch
  - Capacity: 1000/250ms × 4 = 16,000 QPS per GPU ✓
  
  Result: 3 GPUs provides 10x headroom

STEP 4: CPU Support for GPU Operations
──────────────────────────────────
  - 1 CPU core per GPU for host processing
  - 2-4 CPU cores for driver and utilities
  - With 3 GPUs: allocate 6 cores minimum

Storage Sizing Calculator

STEP 1: Categorize Storage Needs
────────────────────────────────

Hot Storage (SSD, <10ms latency):
  - Active vector indexes: 2GB
  - LLM model cache: 2GB
  - Database working set: 50GB
  - Total hot: 54GB

Warm Storage (NVMe/SSD, <100ms latency):
  - Full embeddings backup: 6GB
  - Model variants: 10GB
  - Query logs (1 week): 100GB
  - Total warm: 116GB

Cold Storage (HDD/Archive, <1s latency):
  - Historical query logs: 500GB
  - Data backups: 200GB
  - Total cold: 700GB

STEP 2: Calculate Total with Replication
────────────────────────────────────────
Replication Factor: 2x (high availability)

  - Hot storage: 54GB × 2 = 108GB SSD
  - Warm storage: 116GB × 2 = 232GB NVMe
  - Cold storage: 700GB × 2 = 1.4TB HDD
  
  Total: ~1.8TB raw storage

STEP 3: Add Growth Buffer (30% per year)
─────────────────────────────────────
Year 1: 1.8TB
Year 2: 1.8TB × 1.3 = 2.3TB
Year 3: 2.3TB × 1.3 = 3TB

Recommendation: Procure 3TB usable with tiering

Network Sizing Calculator

STEP 1: Calculate Bandwidth Requirements
─────────────────────────────────────────

Ingress (Data flowing in):
  - Query traffic: 500 QPS × 2KB query = 1 MB/sec
  - Data ingestion: 100 MB/hour
  - Replication sync: 50 Mbps (peak)
  - Total ingress: ~50 Mbps average, 150 Mbps peak

Egress (Data flowing out):
  - Query responses: 500 QPS × 5KB response = 2.5 MB/sec
  - Replication egress: 50 Mbps
  - Monitoring/logging: 20 Mbps
  - Total egress: ~70 Mbps average, 200 Mbps peak

Cluster Internal:
  - Node-to-node replication: 50 Mbps
  - Storage access: 100 Mbps
  - Monitoring: 10 Mbps
  - Total internal: ~160 Mbps average, 400 Mbps peak

STEP 2: Size Network Interfaces
───────────────────────────────
External-facing (to users/services):
  - 200 Mbps peak required
  - Recommend: Dual 10Gbps uplinks (20 Gbps total)
  - Utilization: <1%

Cluster-internal (node-to-node):
  - 400 Mbps peak required
  - Recommend: 25Gbps fabric interconnect
  - Utilization: <2%

Per-server network:
  - External: 10Gbps NIC (1000x peak requirement)
  - Internal: 25Gbps interconnect

STEP 3: Network Services Configuration
──────────────────────────────────────
Load Balancer:
  - Capacity: 20 Gbps minimum
  - Connection tracking: 100K sessions
  - NAT throughput: 5 Gbps

Firewall:
  - Stateful inspection: 20 Gbps minimum
  - Connection limit: 1M concurrent

Switch Fabric:
  - Non-blocking: Full bandwidth between all ports
  - Redundancy: Dual switches, active-active

Backup Network:
  - Dedicated for replication
  - 25Gbps capacity

Capacity Planning & Growth

Growth Projection Framework

CURRENT STATE (Year 0):
├─ Users: 50 concurrent
├─ Queries: 100K/day (1.15 QPS avg)
├─ Data: 100K vectors (200MB)
└─ Hardware: 2 nodes + 1 GPU

YEAR 1 PROJECTION (30% growth):
├─ Users: 65 concurrent
├─ Queries: 130K/day (1.5 QPS avg)
├─ Data: 130K vectors (260MB)
└─ Action: Monitor, no changes needed

YEAR 2 PROJECTION (30% growth):
├─ Users: 85 concurrent
├─ Queries: 169K/day (2 QPS avg)
├─ Data: 169K vectors (338MB)
└─ Action: Prepare for expansion

YEAR 3 PROJECTION (30% growth):
├─ Users: 110 concurrent
├─ Queries: 220K/day (2.5 QPS avg)
├─ Data: 220K vectors (440MB)
└─ Action: Add additional replicas

YEAR 5 PROJECTION (Conservative, slowing growth):
├─ Users: 200 concurrent
├─ Queries: 500K/day (5.8 QPS avg)
├─ Data: 500K vectors (1GB)
└─ Action: Consider multi-cluster sharding

Capacity Monitoring & Triggers

METRIC: CPU Utilization
├─ Yellow (75%): Plan upgrade timeline
├─ Red (90%): Implement optimization or upgrade
├─ Critical (95%): Escalate immediately

METRIC: Memory Utilization
├─ Yellow (80%): Monitor swap activity
├─ Red (90%): Plan hardware upgrade
├─ Critical (95%): Implement emergency measures

METRIC: Storage I/O (IOPS)
├─ Yellow (70% of limit): Monitor workload patterns
├─ Red (85% of limit): Add storage or optimize queries
├─ Critical (95% of limit): Escalate, may impact SLA

METRIC: Network Bandwidth
├─ Yellow (60% of capacity): Monitor usage patterns
├─ Red (80% of capacity): Consider network upgrade
├─ Critical (95% of capacity): Escalate immediately

TRIGGER ACTIONS:
├─ Yellow State: Send notification to ops team
├─ Red State: Create upgrade planning ticket
├─ Critical State: Execute emergency response playbook

Sizing Validation Techniques

Benchmark-Based Validation

STEP 1: Identify Reference Workload
───────────────────────────────────
Similar customer workload:
  - 500 QPS peak load
  - 1M vector embeddings
  - Llama 2 7B model

Benchmark Results from Reference:
  - 3x T4 GPUs: 2000 QPS capacity
  - 4x CPU nodes: 4000 req/sec processing
  - Total storage: 2TB with HA

STEP 2: Scale to Your Requirements
──────────────────────────────────
Your Peak Load: 500 QPS
Reference Benchmark: 2000 QPS capacity
Scaling Factor: 500 / 2000 = 0.25x

Your Hardware Needs:
  - GPUs: 3 × 0.25 = 0.75 → round up to 1 GPU
  - But add redundancy: use 2 GPUs
  - CPU nodes: scale similarly
  - Storage: scale proportionally

STEP 3: Validation
──────────────────
Performance Model Prediction:
  - 2 T4 GPUs: 1333 QPS capacity
  - Expected latency: 180ms p95
  - Storage needs: 500GB

Customer Requirement Alignment:
  - Required QPS: 500 ✓ (67% headroom)
  - Required latency: <200ms ✓ (meets p95)
  - Required storage: <500GB ✓ (meets requirement)

VERDICT: Sizing meets requirements with 50% headroom

Load Testing Validation

PHASE 1: Baseline Testing (Week 1)
──────────────────────────────────
Deploy proposed hardware
Run representative workload:
  - Ramp up from 10% to 100% of expected load
  - Measure latency, throughput, CPU, memory
  - Collect performance data for 6 hours

Expected Results:
  - p50 latency: <100ms
  - p95 latency: <200ms
  - CPU utilization: <70%
  - Memory utilization: <80%

PHASE 2: Peak Load Testing (Week 2)
──────────────────────────────────
Run spike tests to peak expected load:
  - Instantaneous ramp to 100% load
  - Sustain for 30 minutes
  - Measure recovery time

Expected Results:
  - Latency degradation <20%
  - No request timeouts
  - Recovery to baseline within 2 minutes

PHASE 3: Stress Testing (Week 3)
───────────────────────────────
Run 150% of peak expected load:
  - Identify breaking point
  - Test failover and recovery
  - Verify graceful degradation

Expected Results:
  - System remains responsive
  - Graceful throttling below peak at 150%
  - No data loss during failover

PASS/FAIL CRITERIA:
✓ PASS if all phases meet expected results
✗ FAIL if any metric significantly misses
  → If failed: Re-size and re-test

Sizing Output Document

Executive Summary Section

PROPOSED SOLUTION SUMMARY
─────────────────────────
Deployment Model: [Single/Multi-region Hub-Spoke, etc.]
Hardware: [Component list]
Software: [Platform and tools]
Estimated Cost: [$ figure]
Implementation Timeline: [Weeks/months]

CAPACITY PROFILE
────────────────
Peak Throughput: [X QPS or TPS]
Concurrent Users: [N users]
Data Volume: [Size]
Storage Capacity: [Size]
Network Bandwidth: [Mbps]

PERFORMANCE TARGETS
───────────────────
Latency p95: [ms]
Availability: [SLA %]
Throughput: [ops/sec]
Cost per Transaction: [$]

VALIDATION
──────────
Benchmark Match: [Yes/No]
Load Test Results: [Pass/Fail]
Risk Assessment: [Low/Medium/High]
Confidence Level: [%]


Last Updated: October 21, 2025