Skip to content

Operations Runbook

Version: 1.0 Date: 2026-04-15 Environment: Development Region: swedencentral

ItemValue
Primary Regionswedencentral
Resource Grouprg-malta-catering-dev
Support Contactplatform@example.com
Escalation PathPlatform contact → workload owner → Azure support
ResourceNameResource GroupSeverity
Web front endapp-malta-catering-devrg-malta-catering-devP1
Storage persistencestmaltadevb6lg3lrg-malta-catering-devP1
Secret storekv-malta-dev-b6lg3lrg-malta-catering-devP2
Container registryacrmaltadevb6lg3lrg-malta-catering-devP2

Morning Health Check:

  1. Confirm App Service plan state is Running and the production site state is Running.
  2. Verify curl -I to production and staging no longer returns 503.
  3. Review Key Vault, Storage, and ACR private endpoint provisioning state.

KQL Query — System Health Overview:

AppRequests
| where TimeGenerated > ago(24h)
| summarize Requests=count(), Failures=countif(Success == false), P95=percentile(DurationMs, 95) by bin(TimeGenerated, 1h)
| order by TimeGenerated desc

Priority Logs to Review:

Log SourceQuery FocusAction Threshold
Application InsightsRequest failures and startup exceptionsAny sustained 5xx rate above 5%
App Service platform logsContainer pull or startup failuresAny failed container start
Budget notificationsForecast threshold crossingsAny forecast ≥ 80%
SeverityDefinitionResponse Time
P1Customer-facing site unavailable or returning 50330 minutes
P2Core dependency degraded but site partially available4 hours
P3Non-blocking drift or documentation/config issue1 business day
%%{init: {'theme':'neutral'}}%%
flowchart LR
    D["Detect"] --> T["Triage"]
    T --> E["Escalate"]
    E --> R["Resolve"]
    R --> P["Postmortem"]
    style D fill:#D83B01,color:#fff
    style R fill:#107C10,color:#fff
AlertRunbookOwner
Production or slot 503Restart + container verificationPlatform owner
Image pull failureRegistry and RBAC checkPlatform owner
Key Vault reference failureVerify secret and Key Vault Secrets User assignmentPlatform owner
Budget threshold triggeredReview spend and suppress nonessential usagePlatform owner
Terminal window
az webapp restart -g rg-malta-catering-dev -n app-malta-catering-dev
az webapp restart -g rg-malta-catering-dev -n app-malta-catering-dev --slot staging
Terminal window
az appservice plan update -g rg-malta-catering-dev -n asp-malta-catering-dev --sku P1v3
az appservice plan update -g rg-malta-catering-dev -n asp-malta-catering-dev --number-of-workers 2
Terminal window
az webapp config container show -g rg-malta-catering-dev -n app-malta-catering-dev -o json
az webapp config appsettings list -g rg-malta-catering-dev -n app-malta-catering-dev -o table
TaskScheduleDuration
Workload maintenance windowSunday 02:00-06:004 hours
Container refresh and validationMonthly1 hour
DR procedure walkthroughQuarterly2 hours
%%{init: {'theme':'neutral'}}%%
gantt
    title Maintenance Schedule
    dateFormat YYYY-MM-DD
    section Routine
        Weekly maintenance window :a1, 2026-04-19, 1d
        Monthly image refresh     :a2, 2026-05-01, 1d
    section DR
        DR walkthrough            :b1, 2026-06-01, 1d
RoleContactPhoneOn-Call Rotation
Platform Ownerplatform@example.comN/ABusiness hours
Workload OwnerMalta Catering demo ownerN/AAd hoc
Azure EscalationAzure SupportN/AN/A
%%{init: {'theme':'neutral'}}%%
flowchart TD
    L1["L1: Platform Owner"] --> L2["L2: Workload Owner"]
    L2 --> L3["L3: Azure Support"]
    L3 --> MGMT["Management"]

Generated: 2026-04-15 Version: 1.0 Environment: Development Primary Region: swedencentral Secondary Region: germanywestcentral (planned failover target only)

MetricCurrentTarget
RPOBest-effort for order data12 hours
RTO24 hours via IaC redeploy24 hours
AvailabilitySingle-region deployment99.0%
TierRTO TargetServices
Critical24 hoursApp Service plan, production site, storage account
Important24 hoursKey Vault, ACR, private endpoints, DNS zones
Standard48 hoursMonitoring workspace, Application Insights, budget
Data TypeRPO TargetBackup Strategy
App configurationRebuild from IaCBicep + parameter file + app settings
Key Vault secrets7 days recoverabilitySoft delete and purge protection
Order data in Table StorageBest-effortNo automated export deployed
%%{init: {'theme':'neutral'}}%%
gantt
    title RPO / RTO Targets by Tier
    dateFormat HH:mm
    axisFormat %H:%M
    section Critical
        RPO :crit, rpo1, 00:00, 12h
        RTO :crit, rto1, 00:00, 24h
    section Important
        RPO :rpo2, 00:00, 7h
        RTO :rto2, 00:00, 24h
    section Standard
        RPO :rpo3, 00:00, 24h
        RTO :rto3, 00:00, 48h
SettingConfiguration
Backup TypeNone deployed
RetentionApplication-managed only
Geo-RedundancyNot enabled (Standard_LRS)
GapNo automated Table Storage export or restore path
SettingConfiguration
Soft DeleteEnabled
Purge ProtectionEnabled
SettingConfiguration
TierPremium
Retention Policy15 days for untagged manifests
Geo-RedundancyNot configured
  1. Confirm a regional service event or unrecoverable platform issue in swedencentral.
  2. Select germanywestcentral as the recovery region for an EU-hosted redeploy.
  3. Re-run the Bicep deployment with region overrides and compliant resource-group tags.
  4. Re-push or re-import the required container image into a recovery ACR if registry access is unavailable.
  5. Reconfigure application secrets and validate container startup before exposing the site.
  6. Restore order data only if an out-of-band export exists; otherwise communicate best-effort data loss.
  1. Validate that swedencentral is stable again.
  2. Compare recovery-region configuration and app settings with the source-controlled Bicep state.
  3. Deploy the canonical workload back into the primary region.
  4. Repoint DNS or user access paths to the primary region endpoint.
  5. Decommission temporary recovery resources after verification.
Test TypeFrequencyLast TestNext Test
App configuration redeployQuarterlyNot yet run2026-07-01
Secret recovery validationQuarterlyNot yet run2026-07-01
Full DR walkthroughSemi-annualNot yet run2026-10-01
%%{init: {'theme':'neutral'}}%%
gantt
    title DR Testing Schedule
    dateFormat YYYY-MM-DD
    section Backup Validation
        Key Vault restore validation :a1, 2026-07-01, 1d
    section Failover
        Regional failover walkthrough :b1, 2026-10-01, 1d
    section Full DR
        Full DR exercise              :c1, 2027-01-15, 2d
AudienceChannelTemplate
Demo stakeholdersEmail / TeamsIncident update and service restoration ETA
Platform ownerDirect escalationTechnical failure summary
ManagementEscalation mailBusiness impact and recovery decision
RoleTeamResponsibility
Platform OwnerDemo PlatformExecute redeploy and infrastructure recovery
Application OwnerMalta CateringValidate application behavior after recovery
Stakeholder ContactDemo sponsorApprove fallback decisions if data loss occurs
DependencyImpactMitigation
Container image in ACRApp cannot start without image pullKeep canonical image tagged and document import procedure
Key Vault secret appinsights-connection-stringMissing secret breaks telemetry wiringRecover from Key Vault soft delete or recreate from App Insights
Table Storage order dataNo automated restore path todayDocument best-effort recovery and prioritize export automation
Private DNS and private endpointsBackend connectivity failure if missingRedeploy network phase before compute validation
ScenarioRunbookOwner
Production endpoint returns 503Restart app, verify container image and app settingsPlatform Owner
Secret deletion or Key Vault lockoutRecover secret or rehydrate from source metadataPlatform Owner
Regional outageRebuild to secondary EU region from BicepPlatform Owner

Trigger: Production or staging endpoint returns HTTP 503 Estimated Duration: 30-60 minutes

  1. Confirm App Service plan and site are in Running state.
  2. Verify linuxFxVersion, ACR pull settings, and Key Vault reference app settings.
  3. Restart the site and slot, then re-run the health probes.
  4. If the issue persists, inspect application/container logs and validate image availability.

Validation:

Terminal window
curl -I -L --max-time 20 https://app-malta-catering-dev.azurewebsites.net
curl -I -L --max-time 20 https://app-malta-catering-dev-staging.azurewebsites.net