High Availability Patterns
This section covers advanced strategies for building highly available and resilient Azure Local deployments. Understanding these patterns is essential for designing mission-critical systems.
Cluster Quorum Options
Quorum prevents split-brain scenarios where cluster nodes become isolated and make conflicting decisions.
Node Majority Quorum
How It Works:
- Requires more than half of nodes to be healthy
 - 3-node cluster: 2 nodes needed for quorum
 - 5-node cluster: 3 nodes needed for quorum
 - Completely distributed, no external dependency
 
Advantages:
- Simple to understand and operate
 - No external infrastructure required
 - Automatic decision making
 - Scales well to any size
 
Limitations:
- Only works with odd number of nodes (3, 5, 7…)
 - 2-node cluster cannot use node quorum alone
 
File Share Witness
How It Works:
- External server hosts witness file share
 - Provides tiebreaker for 2-node clusters
 - Any node plus witness can form quorum
 - Requires Windows Server 2022 or later on witness
 
Configuration:
- Must be accessible via 1 Gbps network minimum
 - Witness file share ≥ 5 GB
 - Separate from cluster nodes (different server)
 - Can be on different site or cloud
 
Best Practices:
- Place witness at third location for geo-redundancy
 - Use Azure Storage file share for cloud-based witness
 - Monitor witness connectivity regularly
 - Test failover with witness offline
 
Cloud Witness (Azure Storage)
How It Works:
- Uses Azure Storage account as quorum witness
 - Virtual quorum resource in Azure Storage
 - Suitable for hybrid or disconnected scenarios
 - Requires internet connectivity
 
Configuration:
Set-ClusterQuorum -CloudWitness -AccountName "mystorageaccount" `
  -AccessKey "storagekey123..." 
Considerations:
- Network latency to Azure must be < 1 second
 - Internet connectivity required
 - Cost: Minimal (~$1-2/month for witness account)
 - Useful for disconnected edge scenarios
 
Multi-Node Failure Scenarios
Single Node Failure (Automatic Recovery)
What Happens:
- Cluster detects node timeout (5-30 seconds)
 - Cluster declares node offline
 - VMs on failed node restart on surviving node
 - Storage mirrors rebuild data from surviving copies
 - Services resume within 5-15 minutes
 
Recovery Timeline:
- Detection: 5-30 seconds
 - VM failover: 30-120 seconds
 - VM application start: 1-5 minutes
 - Storage rebuild: Hours (dependent on data volume)
 
Data Loss:
- Zero data loss with mirrors (2-way or 3-way)
 - Network storage retained
 - VM state maintained through mirror
 
Two-Node Failure (Planned Maintenance)
Scenario: One node down for maintenance, second node fails unexpectedly
With 3-Node Cluster:
- Node 1 down (maintenance)
 - Node 2 fails
 - Node 3 survives with quorum
 - Cluster continues (degraded mode)
 - VMs resume on Node 3
 - Storage rebuilds automatically
 
With 2-Node Cluster:
- Node 1 down (maintenance)
 - Node 2 fails
 - Loss of quorum
 - Cluster stops (automatic failover not possible)
 - Manual intervention required
 
Mitigation:
- Never perform maintenance on more than one node
 - Keep witness online during maintenance
 - Test failover scenarios regularly
 
Network Partition (Split-Brain Prevention)
Scenario: Network link between cluster nodes fails
Detection:
- Nodes cannot communicate within heartbeat timeout
 - Quorum mechanism activates
 - Nodes without quorum stop services
 - Nodes with quorum continue
 
With 3-Node Cluster (Network Split):
Scenario 1: 2 nodes on side A, 1 node on side B
- Side A: 2/3 nodes = quorum (continues)
- Side B: 1/3 nodes = no quorum (stops)
Scenario 2: Equal split (impossible with odd node count)
Cluster Mode (Not Possible):
Scenario 1: 2 nodes on side A, 1 on side B
- Side A: 2/3 = quorum (continues)
- Side B: 1/3 = no quorum (stops VMs)
Result: Automatic prevention of split-brain
Recovery:
- Reconnect network
 - Stopped cluster automatically rejoins
 - Storage automatically synchronizes
 - No data loss with synchronous replication
 
Failover Mechanisms
Automatic VM Failover
Prerequisites:
- VMs stored on cluster shared storage (not local)
 - Cluster monitoring enabled
 - Health policy configured
 
Failover Process:
- Source node fails
 - Cluster detects failure
 - Cluster runs HA policy for VMs
 - VM marked as “failed-over”
 - Surviving node starts VM from storage
 - VM connections resume after restart
 
Application Considerations:
- Connection timeout: 2-5 minutes typical
 - Application must handle restart
 - Stateless services recover automatically
 - Stateful services may need recovery scripts
 
Storage Rebuild
Automatic Start:
- Begins immediately when node fails
 - Lower priority than live workloads
 - Can be paused if performance impact too high
 
Rebuild Performance:
- Typical: 200-500 MB/s read from mirrors
 - Write: 100-300 MB/s to replacement location
 - Time to complete: Hours to days
    
- 1 TB storage: 2-5 hours
 - 10 TB storage: 20-50 hours
 
 
Operational Impact:
- Performance degradation during rebuild
 - Reduced fault tolerance during rebuild
 - No redundancy if second failure occurs during rebuild
 - Monitor rebuild progress
 
Optimization:
- Schedule during low-traffic windows
 - Use spare drive in pool for faster rebuild
 - Rebuild within same node group if possible
 
Multi-Site Active-Active
Architecture:
- Two clusters at different locations
 - Synchronized storage replication
 - Independent VM workloads
 - Coordinated through Arc management
 
Benefits:
- Workload distribution across sites
 - Local compute for latency-sensitive apps
 - Automated failover per site
 - No single point of failure
 
Challenges:
- Consistent configuration across sites
 - Network latency between sites
 - Replication overhead
 - Debugging failures across geography
 
Multi-Site Active-Passive
Architecture:
- Primary cluster handles all workloads
 - Secondary cluster on standby
 - Asynchronous replication
 - Manual or automated failover
 
Deployment:
Primary Site:
  - 3-node Azure Local cluster
  - All VMs and data
Secondary Site:
  - 3-node Azure Local cluster (idle)
  - Replica of primary data
  - No VMs running
Replication:
  - 1-hour RPO (asynchronous)
  - Bandwidth: 100 Mbps minimum
  - Via WAN or dedicated link
Failover Process:
- Primary site fails
 - Ops team initiates failover
 - Secondary cluster promoted to primary
 - VMs started on secondary
 - Applications resume from replicas
 - RTO: 15-30 minutes (manual)
 - RPO: 1 hour (data loss risk)
 
Advantages:
- Simpler than active-active
 - Lower replication bandwidth
 - Lower cost (secondary can be smaller)
 - Well-understood recovery process
 
Disaster Recovery Runbooks
Node Failure Recovery
1. Detection Phase:
- Monitor: Check cluster event log
- Diagnosis: Is node physically dead or network issue?
- Decision: Restart node or replace hardware?
2. Recovery Options:
Option A: Restart Failed Node
1. Power cycle server
2. Wait for node to rejoin cluster
3. Monitor storage rebuild
4. Verify cluster health
Option B: Hardware Replacement
1. Remove failed node from cluster
2. Replace hardware
3. Rejoin node to cluster (fresh install)
4. Monitor storage rebuild
5. Verify all disks recognized
3. Post-Recovery:
- Verify all VMs running
 - Check storage pool status
 - Monitor rebuild progress
 - Document timeline and issues
 
Storage Pool Degradation
Detection:
Alert triggers: Storage pool showing "degraded" status
- Cluster event: "Physical disk offline"
- Alarm: Low redundancy warning
Investigation:
1. Which physical disk failed?
   Get-StoragePool | Get-PhysicalDisk | Where Status -ne Healthy
   
2. Is it NVMe, SSD, or HDD?
3. Can it be recovered (reseat connector)?
4. Is spare capacity available?
Recovery:
1. If disk can be recovered:
   - Reseat connection
   - Monitor for reintegration
   
2. If disk must be replaced:
   - Remove old disk
   - Insert new disk
   - Monitor automatic rebuild
   - Verify: Get-PhysicalDisk
3. Monitor rebuild:
   - Check: Get-StorageJob
   - Expected time: 1-5 hours per TB
Network Partition Recovery
Detection:
- Cluster stops responding
- VMs appear frozen
- Management interfaces inaccessible
Investigation:
1. Check network connectivity between nodes
   ping node2 (from node1)
   ping node3 (from node1)
2. Check network switch:
   - Verify ports active
   - Check VLAN configuration
   - Verify QoS settings
3. Check if quorum present:
   Get-ClusterQuorum
Recovery:
1. Restore network connection:
   - Fix network switch
   - Reconnect cable
   - Verify VLAN is correct
2. Monitor cluster recovery:
   - Cluster should automatically rejoin
   - VMs restart
   - Storage synchronizes
3. If manual recovery needed:
   - See network troubleshooting section
Complete Site Failure
Recovery Procedures:
Step 1: Assess Damage
- Determine if site is recoverable
 - Estimate recovery timeline
 - Check backup/replica status
 
Step 2: Failover to Secondary
- Confirm primary site completely down
 - Promote secondary cluster to primary
 - Update DNS/network routing
 - Start VMs on secondary
 
Step 3: Validate Applications
- Test critical business functions
 - Verify data integrity
 - Check for replication lag issues
 - Notify stakeholders
 
Step 4: Plan Site Recovery
- Assess what failed (hardware? Power? Network?)
 - Repair infrastructure
 - Restore from backups if needed
 - Test before returning to production
 
Example Timeline:
T+0:00    Primary site goes offline
T+0:05    Alert received and investigated
T+0:15    Decision made to failover to secondary
T+0:20    Secondary promoted to primary
T+0:30    First VM starts on secondary
T+1:00    All critical VMs running on secondary
T+2:00    All applications validated
T+4:00    Begin primary site recovery
T+8:00    Primary site infrastructure repaired
T+10:00   Data synchronized back to primary
T+12:00   Failback to primary completed
Testing HA and DR
Regular Failover Exercises
Monthly Test Procedure:
- Notify operations team (not real emergency)
 - Simulate node failure (stop Hyper-V service)
 - Monitor cluster response
 - Verify VMs restart
 - Check alert notifications
 - Document any issues
 - Resume normal operation
 
Quarterly Full DR Test:
- Test secondary site startup
 - Verify data replicas are current
 - Perform application testing
 - Measure actual RTO (recovery time)
 - Document results
 - Update runbooks
 
RTO/RPO Validation
RTO Measurement:
RTO = Time from failure start to service restoration
Measure:
- T0: Failure event
- T1: Detection (cluster recognizes failure)
- T2: VM restart initiated
- T3: VM operating system boots
- T4: Application services start
- T5: Applications accepting requests
Typical breakdown for 3-node cluster:
- T1-T0: 30 seconds (failure detection)
- T2-T1: 5 seconds (orchestration)
- T3-T2: 2 minutes (OS boot)
- T4-T3: 2 minutes (application startup)
- T5-T4: 30 seconds (warming up)
Total RTO: ~5 minutes
RPO Measurement:
RPO = Maximum data loss acceptable
Depends on:
- Storage mirror type (synchronous = 0)
- Replication strategy
- Backup frequency
Example:
- 3-way mirror + synchronous: RPO = 0 (no data loss)
- Async replication to secondary: RPO = 1 hour
- Nightly backups only: RPO = 24 hours
Key Takeaways
- Quorum Selection: Node majority for 3+ nodes, file share witness for 2-node
 - Failure Scenarios: Plan for single node, double node, and network partition
 - Automatic Recovery: Most scenarios recover automatically; understand which don’t
 - Testing: Regularly test failover and measure RTO/RPO
 - Documentation: Maintain current runbooks for all failure scenarios