Backup and disaster recovery runbook &ZeroWidthSpace;

Scope: Heritage Community Hub — production environment on Azure
Last reviewed: 2026-06-18
Owner: Platform owner (Kristopher Turner)
Related ADRs: ADR 0004, ADR 0024

Overview &ZeroWidthSpace;

This runbook covers the backup strategy, recovery procedures, and disaster recovery (DR) scenarios for the Heritage Community Hub platform. The platform runs on Azure and stores three categories of data that must be protected:

Data category	Service	Notes
Relational data	Azure Database for PostgreSQL Flexible Server	Members, family groups, content, audit log, approval workflows
Media files	Azure Blob Storage (behind `StorageProvider` interface)	Sermon audio/video, images, attachments
Secrets and configuration	Azure Key Vault (`kv-hcs-vault-01`)	Connection strings, API keys, app configuration

The API compute layer (Azure Container Apps) and web shell (Azure Static Web Apps) are stateless. Container images are rebuilt from source on every deployment; the web shell is deployed from the repository. Neither requires a backup procedure — recovery for compute is a redeployment, covered in the DR scenarios below.

Recovery objectives &ZeroWidthSpace;

The following targets are sized for a small faith community (~200 members) where data freshness and recovery speed matter, but the platform is not life-safety critical.

Objective	Target	Rationale
Recovery point objective (RPO)	1 hour	Automated database backups run every 1 hour or better under Flexible Server defaults; losing more than one hour of community data (new signups, published content, messages) is unacceptable
Recovery time objective (RTO)	4 hours	A small team can complete a Postgres point-in-time restore and verify application health within 4 hours during business hours

If a future compliance requirement (for example, COPPA audit) demands tighter objectives, revisit the Flexible Server backup frequency and add transaction-log shipping.

Database backups &ZeroWidthSpace;

Azure DB for PostgreSQL Flexible Server automated backups &ZeroWidthSpace;

Azure Database for PostgreSQL Flexible Server (the managed host chosen in ADR 0024) takes automated backups continuously. Key characteristics:

Full backup: taken once per week automatically by Azure.
Incremental backup: taken daily automatically by Azure.
Transaction log (WAL) backup: taken every 5 minutes automatically, enabling point-in-time restore (PITR) to any 5-minute window within the retention period.
Default retention period: 7 days. This satisfies the 1-hour RPO target.
Recommended retention period: set to 14 days in the Flexible Server configuration to provide a two-week recovery window for latent data-corruption scenarios.
Geo-redundant backup: disabled by default; must be explicitly enabled at server creation. Enable geo-redundant backup in the production server Bicep template so that backup files are replicated to the Azure paired region. This protects against an Azure-region-level outage and is the only automated off-region copy.

To verify or change the retention period:

bash

# Show current backup retention and geo-redundancy settings
az postgres flexible-server show \
  --resource-group <rg-name> \
  --name <server-name> \
  --query "{retention: backup.backupRetentionDays, geoRedundant: backup.geoRedundantBackup}"

To update retention to 14 days (run once; idempotent):

bash

az postgres flexible-server update \
  --resource-group <rg-name> \
  --name <server-name> \
  --backup-retention 14

Manual backup procedure &ZeroWidthSpace;

Run a manual pg_dump backup before any significant change (schema migration, bulk import, major infrastructure update) or on the first day of each calendar month as a supplemental archive.

Prerequisites:

pg_dump installed locally (PostgreSQL client tools 15+).
Database connection string retrieved from Key Vault (see "Secrets and configuration" section).
Destination: Azure Blob Storage container backups/postgres/ in the same storage account used by the platform.

Naming convention:

heritage-hub-<environment>-<YYYYMMDD>-<HHMMSS>.dump

Example: heritage-hub-prod-20260618-140000.dump

Procedure:

bash

# 1. Retrieve the connection string from Key Vault
DB_URL=$(az keyvault secret show \
  --vault-name kv-hcs-vault-01 \
  --name heritage-hub-db-connection-string \
  --query value -o tsv)

# 2. Create the dump in custom format (-Fc) for fastest restore
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
FILENAME="heritage-hub-prod-${TIMESTAMP}.dump"

pg_dump \
  --dbname="${DB_URL}" \
  --format=custom \
  --compress=9 \
  --file="/tmp/${FILENAME}"

# 3. Upload to Blob Storage (requires az cli with Storage Blob Data Contributor on the container)
az storage blob upload \
  --account-name <storage-account-name> \
  --container-name backups \
  --name "postgres/${FILENAME}" \
  --file "/tmp/${FILENAME}" \
  --auth-mode login

# 4. Remove the local copy
rm "/tmp/${FILENAME}"

Manual backups should be retained for 90 days. Azure Blob Storage lifecycle management rules can enforce this automatically; set a rule on the backups/postgres/ prefix with a 90-day delete action.

Restore procedure &ZeroWidthSpace;

Use this procedure to restore from either an automated PITR backup or a manual pg_dump backup.

Option A: point-in-time restore (Azure PITR) &ZeroWidthSpace;

Use this path for most recovery scenarios. Azure manages the restore entirely.

Preconditions:

Know the approximate timestamp of the last known-good state.
The target time must fall within the server's retention window (14 days if configured as above).

Steps:

bash

# 1. Restore to a new server instance (do not overwrite the original until verified)
az postgres flexible-server restore \
  --resource-group <rg-name> \
  --name heritage-hub-prod-restored \
  --source-server <source-server-name> \
  --restore-time "2026-06-18T12:00:00Z"

# 2. Connect to the restored server and verify row counts and data integrity
# (see verification steps below)

# 3. If verified, update the connection string in Key Vault to point to the
#    restored server, or swap DNS / failover the application.

# 4. Delete the original corrupted server only after the application is confirmed
#    healthy on the restored instance.

Option B: restore from pg_dump &ZeroWidthSpace;

Use this path when the automated backup is unavailable or when restoring to a different host (see "Portability note" section).

Steps:

bash

# 1. Download the target backup from Blob Storage
az storage blob download \
  --account-name <storage-account-name> \
  --container-name backups \
  --name "postgres/<filename>.dump" \
  --file "/tmp/<filename>.dump" \
  --auth-mode login

# 2. Create a fresh target database (must be empty)
# Connect to the target Postgres server and run:
# CREATE DATABASE heritage_hub;

# 3. Restore using pg_restore
pg_restore \
  --dbname="<target-connection-string>" \
  --format=custom \
  --no-owner \
  --no-privileges \
  --exit-on-error \
  "/tmp/<filename>.dump"

# 4. Verify (see steps below)

Restore verification steps &ZeroWidthSpace;

After any restore, run the following checks before routing traffic to the restored database:

sql

-- 1. Spot-check row counts against expected ranges
SELECT COUNT(*) FROM "Users";
SELECT COUNT(*) FROM "FamilyGroups";
SELECT COUNT(*) FROM "AuditLog";

-- 2. Confirm the most recent AuditLog entry is within the expected RPO window
SELECT MAX("occurredAt") FROM "AuditLog";

-- 3. Check for orphaned family group memberships
SELECT COUNT(*) FROM "FamilyGroupMembers" m
WHERE NOT EXISTS (SELECT 1 FROM "Users" u WHERE u.id = m."userId");

-- 4. Confirm Prisma migrations are current
-- Run from the repo root after pointing DATABASE_URL at the restored server:

bash

npx prisma migrate status

If prisma migrate status reports unapplied migrations, the dump predates a migration. Apply migrations before routing traffic:

bash

npx prisma migrate deploy

Media storage backups &ZeroWidthSpace;

Media files (sermon audio/video, images, attachments) are stored via the StorageProvider interface. The current Azure adapter uses Azure Blob Storage.

Current approach: Azure Blob Storage redundancy &ZeroWidthSpace;

Locally redundant storage (LRS): the default; three copies within a single Azure datacenter. Protects against hardware failure, not against datacenter or region outage.
Geo-redundant storage (GRS): recommended for production. Replicates asynchronously to the Azure paired region. Enable GRS on the media storage account in the Bicep template.
Versioning: enable Blob versioning on the media container so that overwritten or deleted files can be recovered. Azure retains previous versions; set a retention policy of 30 days.

To verify the redundancy SKU:

bash

az storage account show \
  --resource-group <rg-name> \
  --name <storage-account-name> \
  --query "sku.name"
# Expected for production: "Standard_GRS"

To enable Blob versioning:

bash

az storage account blob-service-properties update \
  --resource-group <rg-name> \
  --account-name <storage-account-name> \
  --enable-versioning true

Future: Cloudflare R2 via StorageProvider adapter &ZeroWidthSpace;

ADR 0024 defines the StorageProvider interface (upload(), delete(), signedUrl()). If the storage adapter is later changed to Cloudflare R2, the backup strategy updates to use R2's native replication options. No application code changes are required; only the adapter module is replaced. This runbook's media recovery steps remain the same at the interface level.

Secrets and configuration &ZeroWidthSpace;

All secrets (database connection strings, Clerk API keys, Twilio credentials, third-party API tokens) are stored in Azure Key Vault (kv-hcs-vault-01).

Soft delete and purge protection &ZeroWidthSpace;

Key Vault soft delete and purge protection must be enabled on kv-hcs-vault-01. These features ensure that a deleted secret is retained for a recovery window rather than immediately destroyed.

Soft delete: enabled. Deleted secrets are retained for 90 days and can be recovered.
Purge protection: enabled. Prevents permanent deletion of a soft-deleted secret (including by an administrator) until the retention period expires. This protects against accidental or malicious permanent secret loss.

Verify both settings:

bash

az keyvault show \
  --name kv-hcs-vault-01 \
  --query "{softDelete: properties.enableSoftDelete, purgeProtection: properties.enablePurgeProtection}"
# Expected: {"softDelete": true, "purgeProtection": true}

If purge protection is not yet enabled:

bash

az keyvault update \
  --name kv-hcs-vault-01 \
  --enable-purge-protection true

Recovering a deleted secret &ZeroWidthSpace;

If a secret is accidentally deleted, recover it within the 90-day soft-delete window:

bash

# List deleted secrets to find the target
az keyvault secret list-deleted --vault-name kv-hcs-vault-01

# Recover a specific secret
az keyvault secret recover \
  --vault-name kv-hcs-vault-01 \
  --name <secret-name>

Key Vault access loss — see DR scenario below &ZeroWidthSpace;

If access to kv-hcs-vault-01 is lost entirely (vault deleted, subscription suspended, or RBAC lock-out), follow the scenario in "Scenario: Key Vault access lost" below.

Disaster recovery scenarios &ZeroWidthSpace;

Scenario: database corruption &ZeroWidthSpace;

Trigger: Application errors indicate data inconsistency; queries return unexpected results; Prisma ORM errors suggest schema/data mismatch; accidental bulk delete or update.

Preconditions:

Know the approximate time of the last known-good state.
Have access to the Azure portal or az CLI with Contributor rights on the resource group.

Steps:

Isolate the corrupted database. Stop the Container App (or scale it to zero replicas) to prevent further writes that could complicate recovery:

bash

az containerapp update \
  --resource-group <rg-name> \
  --name <container-app-name> \
  --min-replicas 0 \
  --max-replicas 0

Identify the target restore time. Check the AuditLog table (if accessible) or application logs in Azure Monitor / Application Insights to determine the last known-good timestamp.
Initiate PITR restore following Option A in the restore procedure above. Target a time 5–10 minutes before the first observed error.
Verify the restored instance using the verification steps in the restore procedure.

Update the connection string in Key Vault to point to the restored server:

bash

az keyvault secret set \
  --vault-name kv-hcs-vault-01 \
  --name heritage-hub-db-connection-string \
  --value "<new-connection-string>"

Restart the Container App:

bash

az containerapp update \
  --resource-group <rg-name> \
  --name <container-app-name> \
  --min-replicas 1 \
  --max-replicas 3

Smoke-test key application paths: member sign-in, family group page, sermon listing.
Delete the corrupted server only after the restored instance is confirmed healthy in production. Retain the corrupted server for at least 24 hours in case a forensic review is needed.

Scenario: Container App deployment failure &ZeroWidthSpace;

Trigger: A new container image is deployed and the application fails health checks, returns 5xx errors, or crashes on startup.

Note: Azure Container Apps retains previous revisions. A rollback is a revision re-activation, not a database operation. No data recovery is required.

Steps:

List recent revisions to identify the last stable one:

bash

az containerapp revision list \
  --resource-group <rg-name> \
  --name <container-app-name> \
  --query "[].{name:name, active:properties.active, created:properties.createdTime, traffic:properties.trafficWeight}" \
  --output table

Activate the previous stable revision and direct 100% traffic to it:

bash

az containerapp ingress traffic set \
  --resource-group <rg-name> \
  --name <container-app-name> \
  --revision-weight <previous-stable-revision-name>=100

Deactivate the failing revision:

bash

az containerapp revision deactivate \
  --resource-group <rg-name> \
  --name <container-app-name> \
  --revision <failing-revision-name>

Verify the application is healthy by checking the Container App health endpoint and confirming no new 5xx errors in Application Insights.

Investigate the failing image before re-deploying. Review the revision logs:

bash

az containerapp logs show \
  --resource-group <rg-name> \
  --name <container-app-name> \
  --revision <failing-revision-name> \
  --type system

Scenario: Key Vault access lost &ZeroWidthSpace;

Trigger: The platform cannot read secrets from kv-hcs-vault-01; Container App returns connection errors; az keyvault secret show returns 403 or the vault is not found.

Possible causes:

RBAC role assignment removed from the managed identity.
Key Vault firewall rule blocked the Container App's outbound IPs.
Key Vault soft-deleted (recoverable within 90 days).
Azure subscription suspended.

Steps by cause:

Cause: RBAC role assignment missing &ZeroWidthSpace;

bash

# Identify the Container App's managed identity
az containerapp show \
  --resource-group <rg-name> \
  --name <container-app-name> \
  --query "identity.principalId" -o tsv

# Re-assign Key Vault Secrets User role
az role assignment create \
  --role "Key Vault Secrets User" \
  --assignee <principal-id> \
  --scope /subscriptions/<subscription-id>/resourceGroups/<rg-name>/providers/Microsoft.KeyVault/vaults/kv-hcs-vault-01

Cause: firewall rule blocking access &ZeroWidthSpace;

bash

# Add the Container App's outbound IP (or allow Azure services)
az keyvault network-rule add \
  --name kv-hcs-vault-01 \
  --ip-address <outbound-ip>
# OR allow all trusted Azure services (less precise, acceptable at this scale):
az keyvault update \
  --name kv-hcs-vault-01 \
  --bypass AzureServices

Cause: vault soft-deleted &ZeroWidthSpace;

bash

# List soft-deleted vaults
az keyvault list-deleted

# Recover the vault (soft delete must be enabled; see "Secrets and configuration" section)
az keyvault recover --name kv-hcs-vault-01

Cause: subscription suspended &ZeroWidthSpace;

Contact Microsoft support to restore the subscription. While the subscription is suspended, the platform is fully offline. No in-place recovery is possible. If the outage extends beyond the RTO (4 hours), initiate a restore to a new Azure subscription or an alternative provider — this is the portability scenario described in the "Portability note" section.

Portability note &ZeroWidthSpace;

Per ADR 0024, the platform database engine is PostgreSQL hosted on Azure Database for PostgreSQL Flexible Server. The Postgres dialect is identical across all managed Postgres providers and self-hosted installations.

A pg_dump backup produced from Azure Database for PostgreSQL Flexible Server restores without modification to:

AWS RDS for PostgreSQL
GCP Cloud SQL for PostgreSQL
Supabase (managed Postgres)
Any self-hosted Postgres 15+ instance (on-premises or VPS)

No schema translation, query rewriting, or ORM reconfiguration is required. The connection string is the only change.

This means a provider-migration DR scenario (Azure unavailable long-term) is a connection-string update after restoring the pg_dump to an alternative host. The Container App image (OCI-compliant Docker, ADR 0024) deploys unchanged to AWS Fargate/ECS, GCP Cloud Run, or any other OCI host. The SecretsProvider and StorageProvider interfaces (ADR 0024) are the two adapter modules that need updating — all business logic is unaffected.

ADR	Title	Relevance
ADR 0004	Cloud/hosting stack, CI/CD, and free-tier path	Defines Azure services in scope; Key Vault, Blob Storage, SWA, GitHub Actions
ADR 0024	Cloud portability and provider abstraction	Confirms Postgres on Azure DB for PostgreSQL Flexible Server; defines `StorageProvider` and `SecretsProvider` interfaces; containerized API on Azure Container Apps
ADR 0005	Observability model	Application Insights + Azure Monitor for alerting on failures that trigger DR scenarios; `AuditLog` table (Layer 1) is a key verification point during restore
ADR 0008	Platform composition	Confirms compute is stateless (Container Apps); all state lives in the database and Blob Storage

Backup and disaster recovery runbook &ZeroWidthSpace;

Overview &ZeroWidthSpace;

Recovery objectives &ZeroWidthSpace;

Database backups &ZeroWidthSpace;

Azure DB for PostgreSQL Flexible Server automated backups &ZeroWidthSpace;

Manual backup procedure &ZeroWidthSpace;

Restore procedure &ZeroWidthSpace;

Option A: point-in-time restore (Azure PITR) &ZeroWidthSpace;

Option B: restore from pg_dump &ZeroWidthSpace;

Restore verification steps &ZeroWidthSpace;

Media storage backups &ZeroWidthSpace;

Current approach: Azure Blob Storage redundancy &ZeroWidthSpace;

Future: Cloudflare R2 via StorageProvider adapter &ZeroWidthSpace;

Secrets and configuration &ZeroWidthSpace;

Soft delete and purge protection &ZeroWidthSpace;

Recovering a deleted secret &ZeroWidthSpace;

Key Vault access loss — see DR scenario below &ZeroWidthSpace;

Disaster recovery scenarios &ZeroWidthSpace;

Scenario: database corruption &ZeroWidthSpace;

Scenario: Container App deployment failure &ZeroWidthSpace;

Scenario: Key Vault access lost &ZeroWidthSpace;

Cause: RBAC role assignment missing &ZeroWidthSpace;

Cause: firewall rule blocking access &ZeroWidthSpace;

Cause: vault soft-deleted &ZeroWidthSpace;

Cause: subscription suspended &ZeroWidthSpace;

Portability note &ZeroWidthSpace;

Related ADRs &ZeroWidthSpace;