Skip to content

Observability runbook ​

Audience: Platform operators (Plane-1 identities — see ADR 0006)
ADO work item: AB#3182 (Operations documentation — Sprint 2026-Q3-S4)
Last updated: 2026-06-18


Overview ​

The platform implements a four-layer observability model decided in ADR 0005. Each layer covers a distinct concern; together they answer the questions the minister and platform operators need: who accessed the platform, when, how long, what happened, and whether the infrastructure is healthy.

LayerWhat it coversTool (Azure path)Responsibility
1 — App audit logSecurity-relevant events: login, logout, session duration, approval actions, content publishAuditLog table (Postgres)Application team
2 — App telemetryAPI errors, performance traces, request/response timing, custom eventsAzure Application InsightsApplication team
3 — Infrastructure monitoringContainer Apps resource health, Postgres availability, storage, costAzure Monitor + Log Analytics workspacePlatform operator
4 — Auth-provider logsClerk session events, sign-in attempts, suspicious activityClerk dashboardPlatform operator

Infrastructure vs application split. Layers 1 and 2 are implemented by the application (code in apps/api, apps/web, apps/mobile). Layers 3 and 4 are operator-managed — no code, only configuration and dashboard access.

Non-Azure fallback. If the platform relocates away from Azure (see ADR 0024 — portability posture), layers 2 and 3 shift to OpenTelemetry + Grafana Cloud free tier + Sentry for error tracking. Layer 1 (Postgres AuditLog table) and layer 4 (Clerk dashboard) are unaffected by a host change.

Mobile telemetry note (ADR 0005 risk). Azure Application Insights does not have an official React Native SDK. apps/mobile uses either Sentry (React Native SDK) or the OpenTelemetry React Native package for error tracking and custom events, regardless of the stack choice. Layer-3 infrastructure metrics are unaffected.


Structured logging ​

Log format ​

All log entries emitted by apps/api must be valid JSON written to stdout. Azure Container Apps (ADR 0024) captures stdout and routes it to the Log Analytics workspace automatically.

Required fields on every log entry:

FieldTypeNotes
timestampISO 8601 stringUTC, e.g. "2026-06-18T14:30:00.000Z"
levelstringOne of: error, warn, info, debug
servicestringEmitting service name, e.g. "api", "notifications", "mobile"
traceIdstringCorrelation identifier across a single request chain
messagestringHuman-readable summary
userIdstring or nullPlatform Users.id if the request is authenticated; null for unauthenticated requests

Optional but strongly encouraged fields:

FieldTypeWhen to include
requestIdstringUnique ID for a single HTTP request
endpointstringHTTP method + path, e.g. "POST /announcements"
statusCodeintegerHTTP response status
durationMsintegerResponse time in milliseconds
errorCodestringApplication error code for known error conditions

Example (authenticated API error):

json
{
  "timestamp": "2026-06-18T14:30:00.000Z",
  "level": "error",
  "service": "api",
  "traceId": "abc123",
  "requestId": "req-789",
  "userId": "usr_abc",
  "endpoint": "POST /announcements",
  "statusCode": 500,
  "durationMs": 120,
  "message": "Unexpected error saving announcement draft",
  "errorCode": "ANN_SAVE_FAILED",
  "stack": "Error: connection refused\n    at ..."
}

Example (unauthenticated request):

json
{
  "timestamp": "2026-06-18T14:31:00.000Z",
  "level": "info",
  "service": "api",
  "traceId": "def456",
  "requestId": "req-790",
  "userId": null,
  "endpoint": "POST /auth/session",
  "statusCode": 401,
  "durationMs": 45,
  "message": "Authentication failed — invalid Clerk token"
}

The telemetry facade (packages/shared-utils/telemetry) wraps the SDK so that switching from Application Insights to OpenTelemetry is isolated to that module (ADR 0005 consequence).

Log levels ​

LevelWhen to use
errorUnexpected failures that require investigation: unhandled exceptions, database connection failures, provider API errors (Twilio, SendGrid), 5xx responses
warnRecoverable conditions or unusual but non-fatal events: retry attempt before success, stale push token removed, soft-limit threshold approached
infoNormal operational events worth retaining: authentication events, approval workflow state transitions, notification fan-out initiated, container startup/shutdown
debugDetailed diagnostic data useful during active development or incident investigation; must be disabled in production by default

What to log ​

Log the following events explicitly. Do not rely on framework-level logging alone for these.

Request/response — every inbound API request:

  • Method, path, response status code, duration in milliseconds, and traceId.
  • Do not log request or response bodies. Log only metadata.

Errors — every error-level event:

  • The message, errorCode, and the stack trace.
  • The userId (if authenticated) and the endpoint.
  • External provider errors must include the provider's error code or status string (e.g. Twilio error code, SendGrid rejection reason, Postgres error class).

Auth events — these feed both layer-2 telemetry and the layer-1 AuditLog table:

  • Successful sign-in (Clerk session validated, Users row resolved).
  • Failed sign-in (invalid token, user not found, status = pending_approval).
  • Sign-out or session expiry.
  • Role change (admin updates Users.role).

Approval workflow transitions (ADR 0023) — every state change in the announcement authoring workflow:

  • Draft created, submitted to queue, approved, rejected, published.
  • Include the announcementId, the actor's userId and role, and the new state.

Notification fan-out (ADR 0013):

  • Fan-out initiated: include announcementId, recipient count, channels active.
  • Per-channel delivery attempt result: success, provider error code, or hard bounce.
  • Stale push token removed.
  • SMS opt-out evaluated and suppressed for a recipient.

Child sub-account access (ADR 0007):

  • Any access to sections restricted to parent-approved scope should log the child userId and the section accessed.

What NOT to log ​

Never include the following in any log entry at any level, in any environment:

  • Passwords, PINs, or password hashes (including Argon2id hashes for child credentials — ADR 0003/0007).
  • Clerk session tokens, JWTs, or API keys for any provider.
  • Full OAuth tokens (access, refresh, or ID tokens).
  • Personal data beyond userId: names, email addresses, phone numbers, physical addresses, family member details.
  • Full request or response bodies.
  • Key Vault secret values.
  • Any field that could constitute a COPPA-regulated record for a child account.

If a log format tempts inclusion of any of the above, replace it with a truncated identifier or omit it entirely.


Metrics ​

Application Insights and Azure Monitor collect these automatically once the SDK and Container Apps integration are configured. The list below defines what operators and the application team should verify are flowing and being retained.

Key application metrics ​

These come from layer-2 (Application Insights SDK) instrumentation in apps/api:

MetricSourceWhy it matters
Request rate (requests/min per endpoint)HTTP middlewareBaseline for anomaly detection; unexpected spikes indicate a bug or abuse
Error rate (% of 5xx per endpoint)HTTP middlewarePrimary health signal for the API
Response time p50/p95/p99 per endpointHTTP middlewareSLO reference; slow endpoints indicate DB or provider issues
Notification fan-out delivery rate per channelCustom eventIdentifies a degraded channel (e.g. rising SMS failures) before it becomes an incident
Approval queue depth (count of status = 'queued' announcements)Periodic query or custom event on state changeSignals a backlog if approvers are inactive
Auth failure rateCustom event on failed token validationDetects credential stuffing or misconfigured client

Key infrastructure metrics ​

These come from layer-3 (Azure Monitor, automatically collected from Azure resources):

MetricResourceThreshold of concern
CPU utilizationAzure Container Apps (API container)Sustained > 80%
Memory utilizationAzure Container Apps (API container)Sustained > 80%
Replica countAzure Container AppsUnexpected scale-to-zero during active hours
Active connectionsAzure DB for PostgreSQL Flexible ServerApproaching the instance connection limit
Storage utilizationAzure DB for PostgreSQL Flexible Server> 80% of provisioned storage
Connection success rateAzure DB for PostgreSQL Flexible ServerDrop below 100% indicates network or auth issue
Blob storage transactionsAzure Blob StorageUnexpected spike indicates a runaway process or external scraping
Cost (daily/monthly)Azure Cost Management> 15% above the expected baseline

Alerting ​

Alerts are configured in Azure Monitor Action Groups and routed to the on-call channel. All alert rules are defined in Bicep IaC (infrastructure/) — no click-ops configuration.

Critical alerts ​

Critical alerts require immediate response (target response time: within 15 minutes during ministry hours, within 1 hour outside them).

AlertConditionAction
API error rate critical5xx error rate > 5% over a 5-minute windowPage on-call operator; check Application Insights failures blade
Container App unhealthyContainer Apps health probe failing; replica count drops to 0 unexpectedlyPage on-call operator; check Container Apps logs in Log Analytics
Postgres unreachableConnection success rate < 100% for 2 consecutive minutesPage on-call operator; check Flexible Server availability metrics
Authentication provider unavailableClerk API returning 5xx; verify via Clerk status page (status.clerk.com)Page on-call operator; surface a maintenance notice in the web shell
Security anomaly: auth failure spikeFailed auth events > 20 per minute (layer-2 custom metric)Page on-call operator; review Clerk dashboard (layer 4) for suspicious sessions

Warning alerts &ZeroWidthSpace;

Warning alerts require investigation within the next business day and may indicate a drift toward a critical condition.

AlertConditionAction
API response time — p95 elevatedp95 latency > 2 seconds across all endpoints, sustained 15 minutesInvestigate slow queries in Application Insights; check Postgres wait stats
Approval queue backlogAnnouncements with status = 'queued' older than 48 hoursNotify the admin role holder; not a system failure, but a process gap
SMS delivery failure rate elevated> 10% of Twilio send attempts return an error in a 1-hour windowCheck Twilio console; review SmsProvider adapter logs; if A2P 10DLC registration is the issue, escalate to platform owner
Cost anomalyDaily Azure cost > 130% of the 7-day rolling averageReview Cost Management breakdown; identify the unexpected resource
Storage approaching capacityPostgres or Blob Storage > 80% of provisioned capacityPlan a storage tier increase; do not wait for 100%
Push token removal rate elevated> 5 stale push token removals per hourIndicates many reinstalls or a token rotation issue; review Expo push logs

Azure Monitor integration &ZeroWidthSpace;

Log flow &ZeroWidthSpace;

apps/api (Container Apps) stdout  →  Log Analytics workspace
apps/web (Azure SWA)              →  Log Analytics workspace (SWA diagnostic settings)
Azure Container Apps metrics      →  Azure Monitor (automatic)
Azure DB for PostgreSQL metrics   →  Azure Monitor (automatic)
Azure Blob Storage metrics        →  Azure Monitor (automatic)

All resources write to a single Log Analytics workspace provisioned in Bicep. The workspace name and resource ID are defined in infrastructure/environments/<env>/main.bicepparam. Retention is 30 days by default on the free tier (7-day free tier for Log Analytics; extend to 30 days on the paid tier once Nonprofits credits are active). If compliance requires longer retention, configure a diagnostic settings export to Azure Blob Storage cold tier.

Querying logs &ZeroWidthSpace;

Logs are queried with Kusto Query Language (KQL) in the Log Analytics workspace or from the Application Insights portal blade.

Find all errors in the last hour:

kql
// UNVALIDATED — workspace name and table names must match the provisioned Bicep configuration
ContainerAppConsoleLogs_CL
| where TimeGenerated > ago(1h)
| where Log_s contains '"level":"error"'
| project TimeGenerated, Log_s
| order by TimeGenerated desc

Find auth failures for a specific user:

kql
// UNVALIDATED
AppTraces
| where TimeGenerated > ago(24h)
| where Message contains "Authentication failed"
| extend userId = tostring(parse_json(Message).userId)
| where userId == "<target-user-id>"
| project TimeGenerated, Message

Request rate per endpoint over the last 24 hours:

kql
// UNVALIDATED
AppRequests
| where TimeGenerated > ago(24h)
| summarize count() by Name, bin(TimeGenerated, 5m)
| render timechart

Notification fan-out delivery result by channel:

kql
// UNVALIDATED
AppEvents
| where TimeGenerated > ago(7d)
| where Name == "notification.delivery"
| extend channel = tostring(Properties.channel), result = tostring(Properties.result)
| summarize count() by channel, result

Application Insights SDK setup &ZeroWidthSpace;

The telemetry facade in packages/shared-utils/telemetry wraps the Application Insights Node.js SDK (applicationinsights npm package). The connection string is stored in Key Vault under the secret name appinsights-connection-string and injected into the container at runtime via the SecretsProvider adapter (ADR 0024). Do not hard-code the connection string.

For apps/mobile, use the Sentry React Native SDK or the OpenTelemetry React Native package — not the Application Insights SDK, which has no official React Native support (ADR 0005 risk item).


On-call and incident response &ZeroWidthSpace;

Who gets paged &ZeroWidthSpace;

The platform operates with a single on-call rotation during the foundation phase. All critical alerts route to the platform owner (Kristopher Turner — kris@hybridsolutions.cloud) via the Azure Monitor Action Group configured in Bicep.

As the team grows, update the Action Group to route by severity:

  • Critical → pager (SMS + email, immediate).
  • Warning → email digest, next business day.

Escalation path &ZeroWidthSpace;

  1. On-call operator receives the alert. Verifies using the relevant metric/log query above.
  2. If the issue is resolved in under 15 minutes with no data impact, log it in the AuditLog or an ADO task and close.
  3. If unresolved within 15 minutes or if member data may be affected, escalate to the platform owner.
  4. If Postgres is unavailable and a restore is required, follow the point-in-time restore (PITR) procedure documented in the DR runbook (to be authored in Sprint 2026-Q3-S4, AB#3182).
  5. If a third-party provider (Clerk, Twilio, SendGrid) is the root cause, verify on the provider's status page, open a support ticket, and post a maintenance notice in the admin portal.

Incident record &ZeroWidthSpace;

Every critical alert that pages on-call must result in an ADO task (linked under Epic AB#3074 or the relevant feature Epic) recording:

  • Alert triggered, time and date.
  • Root cause.
  • Resolution steps taken.
  • Any follow-up work items created.

This record feeds the retrospective and informs alert threshold tuning.

Key contacts and resources &ZeroWidthSpace;

ResourceLocation
Log Analytics workspaceAzure portal → Heritage Community Hub resource group
Application InsightsAzure portal → Heritage Community Hub resource group
Clerk dashboarddashboard.clerk.com
Twilio consoleconsole.twilio.com
SendGrid dashboardapp.sendgrid.com
Key Vaultkv-hcs-vault-01
Twilio spend alertConfigure in Twilio console billing settings; recommended cap: $30/month

ADRTitleRelevance
ADR 0003Authentication — Clerk (social login)Layer-4 auth-provider logs; Clerk dashboard
ADR 0004Cloud/hosting stack, CI/CD, and free-tier pathAzure SWA, Key Vault, GitHub Actions — diagnostic settings
ADR 0005Observability modelSource of truth for the four-layer model; free-tier retention limits; mobile SDK constraint
ADR 0006Two-plane RBACDefines who has Plane-1 operator access to Azure Monitor and Log Analytics
ADR 0007Account & Family-Group identityChild account logging constraints (COPPA); AuditLog session tracking requirement
ADR 0013Notification transportEvery delivery attempt logged to Application Insights; SMS cost anomaly alert
ADR 0024Cloud portability & provider abstractionContainer Apps log routing; SecretsProvider adapter for telemetry connection strings; non-Azure fallback

Heritage Community Hub — Internal. Access restricted via Cloudflare Access + Entra ID.