Version : 1.0
Date : 2026-04-15
Environment : Development
Region : swedencentral
Item Value Primary Region swedencentralResource Group rg-malta-catering-devSupport Contact platform@example.comEscalation Path Platform contact → workload owner → Azure support
Resource Name Resource Group Severity Web front end app-malta-catering-devrg-malta-catering-devP1 Storage persistence stmaltadevb6lg3lrg-malta-catering-devP1 Secret store kv-malta-dev-b6lg3lrg-malta-catering-devP2 Container registry acrmaltadevb6lg3lrg-malta-catering-devP2
Morning Health Check:
Confirm App Service plan state is Running and the production site state is Running.
Verify curl -I to production and staging no longer returns 503.
Review Key Vault, Storage, and ACR private endpoint provisioning state.
KQL Query — System Health Overview:
| where TimeGenerated > ago ( 24h )
| summarize Requests= count() , Failures= countif (Success == false ), P95= percentile (DurationMs, 95 ) by bin (TimeGenerated, 1h )
| order by TimeGenerated desc
Priority Logs to Review:
Log Source Query Focus Action Threshold Application Insights Request failures and startup exceptions Any sustained 5xx rate above 5% App Service platform logs Container pull or startup failures Any failed container start Budget notifications Forecast threshold crossings Any forecast ≥ 80%
Severity Definition Response Time P1 Customer-facing site unavailable or returning 503 30 minutes P2 Core dependency degraded but site partially available 4 hours P3 Non-blocking drift or documentation/config issue 1 business day
%%{init: {'theme':'neutral'}}%%
flowchart LR
D["Detect"] --> T["Triage"]
T --> E["Escalate"]
E --> R["Resolve"]
R --> P["Postmortem"]
style D fill:#D83B01,color:#fff
style R fill:#107C10,color:#fff
Alert Runbook Owner Production or slot 503 Restart + container verification Platform owner Image pull failure Registry and RBAC check Platform owner Key Vault reference failure Verify secret and Key Vault Secrets User assignment Platform owner Budget threshold triggered Review spend and suppress nonessential usage Platform owner
az webapp restart -g rg-malta-catering-dev -n app-malta-catering-dev
az webapp restart -g rg-malta-catering-dev -n app-malta-catering-dev --slot staging
az appservice plan update -g rg-malta-catering-dev -n asp-malta-catering-dev --sku P1v3
az appservice plan update -g rg-malta-catering-dev -n asp-malta-catering-dev --number-of-workers 2
az webapp config container show -g rg-malta-catering-dev -n app-malta-catering-dev -o json
az webapp config appsettings list -g rg-malta-catering-dev -n app-malta-catering-dev -o table
Task Schedule Duration Workload maintenance window Sunday 02:00-06:00 4 hours Container refresh and validation Monthly 1 hour DR procedure walkthrough Quarterly 2 hours
%%{init: {'theme':'neutral'}}%%
gantt
title Maintenance Schedule
dateFormat YYYY-MM-DD
section Routine
Weekly maintenance window :a1, 2026-04-19, 1d
Monthly image refresh :a2, 2026-05-01, 1d
section DR
DR walkthrough :b1, 2026-06-01, 1d
Role Contact Phone On-Call Rotation Platform Owner platform@example.comN/A Business hours Workload Owner Malta Catering demo owner N/A Ad hoc Azure Escalation Azure Support N/A N/A
%%{init: {'theme':'neutral'}}%%
flowchart TD
L1["L1: Platform Owner"] --> L2["L2: Workload Owner"]
L2 --> L3["L3: Azure Support"]
L3 --> MGMT["Management"]
Generated : 2026-04-15
Version : 1.0
Environment : Development
Primary Region : swedencentral
Secondary Region : germanywestcentral (planned failover target only)
Metric Current Target RPO Best-effort for order data 12 hours RTO 24 hours via IaC redeploy 24 hours Availability Single-region deployment 99.0%
Tier RTO Target Services Critical 24 hours App Service plan, production site, storage account Important 24 hours Key Vault, ACR, private endpoints, DNS zones Standard 48 hours Monitoring workspace, Application Insights, budget
Data Type RPO Target Backup Strategy App configuration Rebuild from IaC Bicep + parameter file + app settings Key Vault secrets 7 days recoverability Soft delete and purge protection Order data in Table Storage Best-effort No automated export deployed
%%{init: {'theme':'neutral'}}%%
gantt
title RPO / RTO Targets by Tier
dateFormat HH:mm
axisFormat %H:%M
section Critical
RPO :crit, rpo1, 00:00, 12h
RTO :crit, rto1, 00:00, 24h
section Important
RPO :rpo2, 00:00, 7h
RTO :rto2, 00:00, 24h
section Standard
RPO :rpo3, 00:00, 24h
RTO :rto3, 00:00, 48h
Setting Configuration Backup Type None deployed Retention Application-managed only Geo-Redundancy Not enabled (Standard_LRS) Gap No automated Table Storage export or restore path
Setting Configuration Soft Delete Enabled Purge Protection Enabled
Setting Configuration Tier Premium Retention Policy 15 days for untagged manifests Geo-Redundancy Not configured
Confirm a regional service event or unrecoverable platform issue in swedencentral.
Select germanywestcentral as the recovery region for an EU-hosted redeploy.
Re-run the Bicep deployment with region overrides and compliant resource-group tags.
Re-push or re-import the required container image into a recovery ACR if registry access is unavailable.
Reconfigure application secrets and validate container startup before exposing the site.
Restore order data only if an out-of-band export exists; otherwise communicate best-effort data loss.
Validate that swedencentral is stable again.
Compare recovery-region configuration and app settings with the source-controlled Bicep state.
Deploy the canonical workload back into the primary region.
Repoint DNS or user access paths to the primary region endpoint.
Decommission temporary recovery resources after verification.
Test Type Frequency Last Test Next Test App configuration redeploy Quarterly Not yet run 2026-07-01 Secret recovery validation Quarterly Not yet run 2026-07-01 Full DR walkthrough Semi-annual Not yet run 2026-10-01
%%{init: {'theme':'neutral'}}%%
gantt
title DR Testing Schedule
dateFormat YYYY-MM-DD
section Backup Validation
Key Vault restore validation :a1, 2026-07-01, 1d
section Failover
Regional failover walkthrough :b1, 2026-10-01, 1d
section Full DR
Full DR exercise :c1, 2027-01-15, 2d
Audience Channel Template Demo stakeholders Email / Teams Incident update and service restoration ETA Platform owner Direct escalation Technical failure summary Management Escalation mail Business impact and recovery decision
Role Team Responsibility Platform Owner Demo Platform Execute redeploy and infrastructure recovery Application Owner Malta Catering Validate application behavior after recovery Stakeholder Contact Demo sponsor Approve fallback decisions if data loss occurs
Dependency Impact Mitigation Container image in ACR App cannot start without image pull Keep canonical image tagged and document import procedure Key Vault secret appinsights-connection-string Missing secret breaks telemetry wiring Recover from Key Vault soft delete or recreate from App Insights Table Storage order data No automated restore path today Document best-effort recovery and prioritize export automation Private DNS and private endpoints Backend connectivity failure if missing Redeploy network phase before compute validation
Scenario Runbook Owner Production endpoint returns 503 Restart app, verify container image and app settings Platform Owner Secret deletion or Key Vault lockout Recover secret or rehydrate from source metadata Platform Owner Regional outage Rebuild to secondary EU region from Bicep Platform Owner
Trigger : Production or staging endpoint returns HTTP 503
Estimated Duration : 30-60 minutes
Confirm App Service plan and site are in Running state.
Verify linuxFxVersion, ACR pull settings, and Key Vault reference app settings.
Restart the site and slot, then re-run the health probes.
If the issue persists, inspect application/container logs and validate image availability.
Validation:
curl -I -L --max-time 20 https://app-malta-catering-dev.azurewebsites.net
curl -I -L --max-time 20 https://app-malta-catering-dev-staging.azurewebsites.net