High Availability Patterns

This section covers advanced strategies for building highly available and resilient Azure Local deployments. Understanding these patterns is essential for designing mission-critical systems.

Cluster Quorum Options

Quorum prevents split-brain scenarios where cluster nodes become isolated and make conflicting decisions.

Node Majority Quorum

How It Works:

Requires more than half of nodes to be healthy
3-node cluster: 2 nodes needed for quorum
5-node cluster: 3 nodes needed for quorum
Completely distributed, no external dependency

Advantages:

Simple to understand and operate
No external infrastructure required
Automatic decision making
Scales well to any size

Limitations:

Only works with odd number of nodes (3, 5, 7…)
2-node cluster cannot use node quorum alone

How It Works:

External server hosts witness file share
Provides tiebreaker for 2-node clusters
Any node plus witness can form quorum
Requires Windows Server 2022 or later on witness

Configuration:

Must be accessible via 1 Gbps network minimum
Witness file share ≥ 5 GB
Separate from cluster nodes (different server)
Can be on different site or cloud

Best Practices:

Place witness at third location for geo-redundancy
Use Azure Storage file share for cloud-based witness
Monitor witness connectivity regularly
Test failover with witness offline

Cloud Witness (Azure Storage)

How It Works:

Uses Azure Storage account as quorum witness
Virtual quorum resource in Azure Storage
Suitable for hybrid or disconnected scenarios
Requires internet connectivity

Configuration:

Set-ClusterQuorum -CloudWitness -AccountName "mystorageaccount" `
  -AccessKey "storagekey123..." 

Considerations:

Network latency to Azure must be < 1 second
Internet connectivity required
Cost: Minimal (~$1-2/month for witness account)
Useful for disconnected edge scenarios

Multi-Node Failure Scenarios

Single Node Failure (Automatic Recovery)

What Happens:

Cluster detects node timeout (5-30 seconds)
Cluster declares node offline
VMs on failed node restart on surviving node
Storage mirrors rebuild data from surviving copies
Services resume within 5-15 minutes

Recovery Timeline:

Detection: 5-30 seconds
VM failover: 30-120 seconds
VM application start: 1-5 minutes
Storage rebuild: Hours (dependent on data volume)

Data Loss:

Zero data loss with mirrors (2-way or 3-way)
Network storage retained
VM state maintained through mirror

Two-Node Failure (Planned Maintenance)

Scenario: One node down for maintenance, second node fails unexpectedly

With 3-Node Cluster:

Node 1 down (maintenance)
Node 2 fails
Node 3 survives with quorum
Cluster continues (degraded mode)
VMs resume on Node 3
Storage rebuilds automatically

With 2-Node Cluster:

Node 1 down (maintenance)
Node 2 fails
Loss of quorum
Cluster stops (automatic failover not possible)
Manual intervention required

Mitigation:

Never perform maintenance on more than one node
Keep witness online during maintenance
Test failover scenarios regularly

Network Partition (Split-Brain Prevention)

Scenario: Network link between cluster nodes fails

Detection:

Nodes cannot communicate within heartbeat timeout
Quorum mechanism activates
Nodes without quorum stop services
Nodes with quorum continue

With 3-Node Cluster (Network Split):

Scenario 1: 2 nodes on side A, 1 node on side B
- Side A: 2/3 nodes = quorum (continues)
- Side B: 1/3 nodes = no quorum (stops)

Scenario 2: Equal split (impossible with odd node count)

Cluster Mode (Not Possible):

Scenario 1: 2 nodes on side A, 1 on side B
- Side A: 2/3 = quorum (continues)
- Side B: 1/3 = no quorum (stops VMs)

Result: Automatic prevention of split-brain

Recovery:

Reconnect network
Stopped cluster automatically rejoins
Storage automatically synchronizes
No data loss with synchronous replication

Failover Mechanisms

Automatic VM Failover

Prerequisites:

VMs stored on cluster shared storage (not local)
Cluster monitoring enabled
Health policy configured

Failover Process:

Source node fails
Cluster detects failure
Cluster runs HA policy for VMs
VM marked as “failed-over”
Surviving node starts VM from storage
VM connections resume after restart

Application Considerations:

Connection timeout: 2-5 minutes typical
Application must handle restart
Stateless services recover automatically
Stateful services may need recovery scripts

Storage Rebuild

Automatic Start:

Begins immediately when node fails
Lower priority than live workloads
Can be paused if performance impact too high

Rebuild Performance:

Typical: 200-500 MB/s read from mirrors
Write: 100-300 MB/s to replacement location
Time to complete: Hours to days
- 1 TB storage: 2-5 hours
- 10 TB storage: 20-50 hours

Operational Impact:

Performance degradation during rebuild
Reduced fault tolerance during rebuild
No redundancy if second failure occurs during rebuild
Monitor rebuild progress

Optimization:

Schedule during low-traffic windows
Use spare drive in pool for faster rebuild
Rebuild within same node group if possible

Multi-Site Active-Active

Architecture:

Two clusters at different locations
Synchronized storage replication
Independent VM workloads
Coordinated through Arc management

Benefits:

Workload distribution across sites
Local compute for latency-sensitive apps
Automated failover per site
No single point of failure

Challenges:

Consistent configuration across sites
Network latency between sites
Replication overhead
Debugging failures across geography

Multi-Site Active-Passive

Architecture:

Primary cluster handles all workloads
Secondary cluster on standby
Asynchronous replication
Manual or automated failover

Deployment:

Primary Site:
  - 3-node Azure Local cluster
  - All VMs and data

Secondary Site:
  - 3-node Azure Local cluster (idle)
  - Replica of primary data
  - No VMs running

Replication:
  - 1-hour RPO (asynchronous)
  - Bandwidth: 100 Mbps minimum
  - Via WAN or dedicated link

Failover Process:

Primary site fails
Ops team initiates failover
Secondary cluster promoted to primary
VMs started on secondary
Applications resume from replicas
RTO: 15-30 minutes (manual)
RPO: 1 hour (data loss risk)

Advantages:

Simpler than active-active
Lower replication bandwidth
Lower cost (secondary can be smaller)
Well-understood recovery process

Disaster Recovery Runbooks

Node Failure Recovery

1. Detection Phase:

- Monitor: Check cluster event log
- Diagnosis: Is node physically dead or network issue?
- Decision: Restart node or replace hardware?

2. Recovery Options:

Option A: Restart Failed Node

Power cycle server
Wait for node to rejoin cluster
Monitor storage rebuild
Verify cluster health

Option B: Hardware Replacement

Remove failed node from cluster
Replace hardware
Rejoin node to cluster (fresh install)
Monitor storage rebuild
Verify all disks recognized

3. Post-Recovery:

Verify all VMs running
Check storage pool status
Monitor rebuild progress
Document timeline and issues

Storage Pool Degradation

Detection:

Alert triggers: Storage pool showing "degraded" status
- Cluster event: "Physical disk offline"
- Alarm: Low redundancy warning

Investigation:

Which physical disk failed?
   Get-StoragePool | Get-PhysicalDisk | Where Status -ne Healthy
   
Is it NVMe, SSD, or HDD?

Can it be recovered (reseat connector)?

Is spare capacity available?

Recovery:

1. If disk can be recovered:
   - Reseat connection
   - Monitor for reintegration
   
2. If disk must be replaced:
   - Remove old disk
   - Insert new disk
   - Monitor automatic rebuild
   - Verify: Get-PhysicalDisk

3. Monitor rebuild:
   - Check: Get-StorageJob
   - Expected time: 1-5 hours per TB

Network Partition Recovery

Detection:

- Cluster stops responding
- VMs appear frozen
- Management interfaces inaccessible

Investigation:

1. Check network connectivity between nodes
   ping node2 (from node1)
   ping node3 (from node1)

2. Check network switch:
   - Verify ports active
   - Check VLAN configuration
   - Verify QoS settings

3. Check if quorum present:
   Get-ClusterQuorum

Recovery:

1. Restore network connection:
   - Fix network switch
   - Reconnect cable
   - Verify VLAN is correct

2. Monitor cluster recovery:
   - Cluster should automatically rejoin
   - VMs restart
   - Storage synchronizes

3. If manual recovery needed:
   - See network troubleshooting section

Complete Site Failure

Recovery Procedures:

Step 1: Assess Damage

Determine if site is recoverable
Estimate recovery timeline
Check backup/replica status

Step 2: Failover to Secondary

Confirm primary site completely down
Promote secondary cluster to primary
Update DNS/network routing
Start VMs on secondary

Step 3: Validate Applications

Test critical business functions
Verify data integrity
Check for replication lag issues
Notify stakeholders

Step 4: Plan Site Recovery

Assess what failed (hardware? Power? Network?)
Repair infrastructure
Restore from backups if needed
Test before returning to production

Example Timeline:

T+0:00    Primary site goes offline
T+0:05    Alert received and investigated
T+0:15    Decision made to failover to secondary
T+0:20    Secondary promoted to primary
T+0:30    First VM starts on secondary
T+1:00    All critical VMs running on secondary
T+2:00    All applications validated
T+4:00    Begin primary site recovery
T+8:00    Primary site infrastructure repaired
T+10:00   Data synchronized back to primary
T+12:00   Failback to primary completed

Testing HA and DR

Regular Failover Exercises

Monthly Test Procedure:

Notify operations team (not real emergency)
Simulate node failure (stop Hyper-V service)
Monitor cluster response
Verify VMs restart
Check alert notifications
Document any issues
Resume normal operation

Quarterly Full DR Test:

Test secondary site startup
Verify data replicas are current
Perform application testing
Measure actual RTO (recovery time)
Document results
Update runbooks

RTO/RPO Validation

RTO Measurement:

RTO = Time from failure start to service restoration

Measure:
- T0: Failure event
- T1: Detection (cluster recognizes failure)
- T2: VM restart initiated
- T3: VM operating system boots
- T4: Application services start
- T5: Applications accepting requests

Typical breakdown for 3-node cluster:
- T1-T0: 30 seconds (failure detection)
- T2-T1: 5 seconds (orchestration)
- T3-T2: 2 minutes (OS boot)
- T4-T3: 2 minutes (application startup)
- T5-T4: 30 seconds (warming up)
Total RTO: ~5 minutes

RPO Measurement:

RPO = Maximum data loss acceptable

Depends on:
- Storage mirror type (synchronous = 0)
- Replication strategy
- Backup frequency

Example:
- 3-way mirror + synchronous: RPO = 0 (no data loss)
- Async replication to secondary: RPO = 1 hour
- Nightly backups only: RPO = 24 hours

Key Takeaways

Quorum Selection: Node majority for 3+ nodes, file share witness for 2-node
Failure Scenarios: Plan for single node, double node, and network partition
Automatic Recovery: Most scenarios recover automatically; understand which don’t
Testing: Regularly test failover and measure RTO/RPO
Documentation: Maintain current runbooks for all failure scenarios

High Availability Patterns

Cluster Quorum Options

Node Majority Quorum

File Share Witness

Cloud Witness (Azure Storage)

Multi-Node Failure Scenarios

Single Node Failure (Automatic Recovery)

Two-Node Failure (Planned Maintenance)

Network Partition (Split-Brain Prevention)

Failover Mechanisms

Automatic VM Failover

Storage Rebuild

Multi-Site Active-Active

Multi-Site Active-Passive

Disaster Recovery Runbooks

Node Failure Recovery

Storage Pool Degradation

Network Partition Recovery

Complete Site Failure

Testing HA and DR

Regular Failover Exercises

RTO/RPO Validation

Key Takeaways