Observability runbook &ZeroWidthSpace;

Audience: Platform operators (Plane-1 identities — see ADR 0006)
ADO work item: AB#3182 (Operations documentation — Sprint 2026-Q3-S4)
Last updated: 2026-06-18

Overview &ZeroWidthSpace;

The platform implements a four-layer observability model decided in ADR 0005. Each layer covers a distinct concern; together they answer the questions the minister and platform operators need: who accessed the platform, when, how long, what happened, and whether the infrastructure is healthy.

Layer	What it covers	Tool (Azure path)	Responsibility
1 — App audit log	Security-relevant events: login, logout, session duration, approval actions, content publish	`AuditLog` table (Postgres)	Application team
2 — App telemetry	API errors, performance traces, request/response timing, custom events	Azure Application Insights	Application team
3 — Infrastructure monitoring	Container Apps resource health, Postgres availability, storage, cost	Azure Monitor + Log Analytics workspace	Platform operator
4 — Auth-provider logs	Clerk session events, sign-in attempts, suspicious activity	Clerk dashboard	Platform operator

Infrastructure vs application split. Layers 1 and 2 are implemented by the application (code in apps/api, apps/web, apps/mobile). Layers 3 and 4 are operator-managed — no code, only configuration and dashboard access.

Non-Azure fallback. If the platform relocates away from Azure (see ADR 0024 — portability posture), layers 2 and 3 shift to OpenTelemetry + Grafana Cloud free tier + Sentry for error tracking. Layer 1 (Postgres AuditLog table) and layer 4 (Clerk dashboard) are unaffected by a host change.

Mobile telemetry note (ADR 0005 risk). Azure Application Insights does not have an official React Native SDK. apps/mobile uses either Sentry (React Native SDK) or the OpenTelemetry React Native package for error tracking and custom events, regardless of the stack choice. Layer-3 infrastructure metrics are unaffected.

Structured logging &ZeroWidthSpace;

Log format &ZeroWidthSpace;

All log entries emitted by apps/api must be valid JSON written to stdout. Azure Container Apps (ADR 0024) captures stdout and routes it to the Log Analytics workspace automatically.

Required fields on every log entry:

Field	Type	Notes
`timestamp`	ISO 8601 string	UTC, e.g. `"2026-06-18T14:30:00.000Z"`
`level`	string	One of: `error`, `warn`, `info`, `debug`
`service`	string	Emitting service name, e.g. `"api"`, `"notifications"`, `"mobile"`
`traceId`	string	Correlation identifier across a single request chain
`message`	string	Human-readable summary
`userId`	string or null	Platform `Users.id` if the request is authenticated; `null` for unauthenticated requests

Optional but strongly encouraged fields:

Field	Type	When to include
`requestId`	string	Unique ID for a single HTTP request
`endpoint`	string	HTTP method + path, e.g. `"POST /announcements"`
`statusCode`	integer	HTTP response status
`durationMs`	integer	Response time in milliseconds
`errorCode`	string	Application error code for known error conditions

Example (authenticated API error):

json

{
  "timestamp": "2026-06-18T14:30:00.000Z",
  "level": "error",
  "service": "api",
  "traceId": "abc123",
  "requestId": "req-789",
  "userId": "usr_abc",
  "endpoint": "POST /announcements",
  "statusCode": 500,
  "durationMs": 120,
  "message": "Unexpected error saving announcement draft",
  "errorCode": "ANN_SAVE_FAILED",
  "stack": "Error: connection refused\n    at ..."
}

Example (unauthenticated request):

json

{
  "timestamp": "2026-06-18T14:31:00.000Z",
  "level": "info",
  "service": "api",
  "traceId": "def456",
  "requestId": "req-790",
  "userId": null,
  "endpoint": "POST /auth/session",
  "statusCode": 401,
  "durationMs": 45,
  "message": "Authentication failed — invalid Clerk token"
}

The telemetry facade (packages/shared-utils/telemetry) wraps the SDK so that switching from Application Insights to OpenTelemetry is isolated to that module (ADR 0005 consequence).

Log levels &ZeroWidthSpace;

Level	When to use
`error`	Unexpected failures that require investigation: unhandled exceptions, database connection failures, provider API errors (Twilio, SendGrid), 5xx responses
`warn`	Recoverable conditions or unusual but non-fatal events: retry attempt before success, stale push token removed, soft-limit threshold approached
`info`	Normal operational events worth retaining: authentication events, approval workflow state transitions, notification fan-out initiated, container startup/shutdown
`debug`	Detailed diagnostic data useful during active development or incident investigation; must be disabled in production by default

What to log &ZeroWidthSpace;

Log the following events explicitly. Do not rely on framework-level logging alone for these.

Request/response — every inbound API request:

Method, path, response status code, duration in milliseconds, and traceId.
Do not log request or response bodies. Log only metadata.

Errors — every error-level event:

The message, errorCode, and the stack trace.
The userId (if authenticated) and the endpoint.
External provider errors must include the provider's error code or status string (e.g. Twilio error code, SendGrid rejection reason, Postgres error class).

Auth events — these feed both layer-2 telemetry and the layer-1 AuditLog table:

Successful sign-in (Clerk session validated, Users row resolved).
Failed sign-in (invalid token, user not found, status = pending_approval).
Sign-out or session expiry.
Role change (admin updates Users.role).

Approval workflow transitions (ADR 0023) — every state change in the announcement authoring workflow:

Draft created, submitted to queue, approved, rejected, published.
Include the announcementId, the actor's userId and role, and the new state.

Notification fan-out (ADR 0013):

Fan-out initiated: include announcementId, recipient count, channels active.
Per-channel delivery attempt result: success, provider error code, or hard bounce.
Stale push token removed.
SMS opt-out evaluated and suppressed for a recipient.

Child sub-account access (ADR 0007):

Any access to sections restricted to parent-approved scope should log the child userId and the section accessed.

What NOT to log &ZeroWidthSpace;

Never include the following in any log entry at any level, in any environment:

Passwords, PINs, or password hashes (including Argon2id hashes for child credentials — ADR 0003/0007).
Clerk session tokens, JWTs, or API keys for any provider.
Full OAuth tokens (access, refresh, or ID tokens).
Personal data beyond userId: names, email addresses, phone numbers, physical addresses, family member details.
Full request or response bodies.
Key Vault secret values.
Any field that could constitute a COPPA-regulated record for a child account.

If a log format tempts inclusion of any of the above, replace it with a truncated identifier or omit it entirely.

Metrics &ZeroWidthSpace;

Application Insights and Azure Monitor collect these automatically once the SDK and Container Apps integration are configured. The list below defines what operators and the application team should verify are flowing and being retained.

Key application metrics &ZeroWidthSpace;

These come from layer-2 (Application Insights SDK) instrumentation in apps/api:

Metric	Source	Why it matters
Request rate (requests/min per endpoint)	HTTP middleware	Baseline for anomaly detection; unexpected spikes indicate a bug or abuse
Error rate (% of 5xx per endpoint)	HTTP middleware	Primary health signal for the API
Response time p50/p95/p99 per endpoint	HTTP middleware	SLO reference; slow endpoints indicate DB or provider issues
Notification fan-out delivery rate per channel	Custom event	Identifies a degraded channel (e.g. rising SMS failures) before it becomes an incident
Approval queue depth (count of `status = 'queued'` announcements)	Periodic query or custom event on state change	Signals a backlog if approvers are inactive
Auth failure rate	Custom event on failed token validation	Detects credential stuffing or misconfigured client

Key infrastructure metrics &ZeroWidthSpace;

These come from layer-3 (Azure Monitor, automatically collected from Azure resources):

Metric	Resource	Threshold of concern
CPU utilization	Azure Container Apps (API container)	Sustained > 80%
Memory utilization	Azure Container Apps (API container)	Sustained > 80%
Replica count	Azure Container Apps	Unexpected scale-to-zero during active hours
Active connections	Azure DB for PostgreSQL Flexible Server	Approaching the instance connection limit
Storage utilization	Azure DB for PostgreSQL Flexible Server	> 80% of provisioned storage
Connection success rate	Azure DB for PostgreSQL Flexible Server	Drop below 100% indicates network or auth issue
Blob storage transactions	Azure Blob Storage	Unexpected spike indicates a runaway process or external scraping
Cost (daily/monthly)	Azure Cost Management	> 15% above the expected baseline

Alerting &ZeroWidthSpace;

Alerts are configured in Azure Monitor Action Groups and routed to the on-call channel. All alert rules are defined in Bicep IaC (infrastructure/) — no click-ops configuration.

Critical alerts &ZeroWidthSpace;

Critical alerts require immediate response (target response time: within 15 minutes during ministry hours, within 1 hour outside them).

Alert	Condition	Action
API error rate critical	5xx error rate > 5% over a 5-minute window	Page on-call operator; check Application Insights failures blade
Container App unhealthy	Container Apps health probe failing; replica count drops to 0 unexpectedly	Page on-call operator; check Container Apps logs in Log Analytics
Postgres unreachable	Connection success rate < 100% for 2 consecutive minutes	Page on-call operator; check Flexible Server availability metrics
Authentication provider unavailable	Clerk API returning 5xx; verify via Clerk status page (status.clerk.com)	Page on-call operator; surface a maintenance notice in the web shell
Security anomaly: auth failure spike	Failed auth events > 20 per minute (layer-2 custom metric)	Page on-call operator; review Clerk dashboard (layer 4) for suspicious sessions

Warning alerts &ZeroWidthSpace;

Warning alerts require investigation within the next business day and may indicate a drift toward a critical condition.

Alert	Condition	Action
API response time — p95 elevated	p95 latency > 2 seconds across all endpoints, sustained 15 minutes	Investigate slow queries in Application Insights; check Postgres wait stats
Approval queue backlog	Announcements with `status = 'queued'` older than 48 hours	Notify the `admin` role holder; not a system failure, but a process gap
SMS delivery failure rate elevated	> 10% of Twilio send attempts return an error in a 1-hour window	Check Twilio console; review `SmsProvider` adapter logs; if A2P 10DLC registration is the issue, escalate to platform owner
Cost anomaly	Daily Azure cost > 130% of the 7-day rolling average	Review Cost Management breakdown; identify the unexpected resource
Storage approaching capacity	Postgres or Blob Storage > 80% of provisioned capacity	Plan a storage tier increase; do not wait for 100%
Push token removal rate elevated	> 5 stale push token removals per hour	Indicates many reinstalls or a token rotation issue; review Expo push logs

Azure Monitor integration &ZeroWidthSpace;

Log flow &ZeroWidthSpace;

apps/api (Container Apps) stdout  →  Log Analytics workspace
apps/web (Azure SWA)              →  Log Analytics workspace (SWA diagnostic settings)
Azure Container Apps metrics      →  Azure Monitor (automatic)
Azure DB for PostgreSQL metrics   →  Azure Monitor (automatic)
Azure Blob Storage metrics        →  Azure Monitor (automatic)

All resources write to a single Log Analytics workspace provisioned in Bicep. The workspace name and resource ID are defined in infrastructure/environments/<env>/main.bicepparam. Retention is 30 days by default on the free tier (7-day free tier for Log Analytics; extend to 30 days on the paid tier once Nonprofits credits are active). If compliance requires longer retention, configure a diagnostic settings export to Azure Blob Storage cold tier.

Querying logs &ZeroWidthSpace;

Logs are queried with Kusto Query Language (KQL) in the Log Analytics workspace or from the Application Insights portal blade.

Find all errors in the last hour:

kql

// UNVALIDATED — workspace name and table names must match the provisioned Bicep configuration
ContainerAppConsoleLogs_CL
| where TimeGenerated > ago(1h)
| where Log_s contains '"level":"error"'
| project TimeGenerated, Log_s
| order by TimeGenerated desc

Find auth failures for a specific user:

kql

// UNVALIDATED
AppTraces
| where TimeGenerated > ago(24h)
| where Message contains "Authentication failed"
| extend userId = tostring(parse_json(Message).userId)
| where userId == "<target-user-id>"
| project TimeGenerated, Message

Request rate per endpoint over the last 24 hours:

kql

// UNVALIDATED
AppRequests
| where TimeGenerated > ago(24h)
| summarize count() by Name, bin(TimeGenerated, 5m)
| render timechart

Notification fan-out delivery result by channel:

kql

// UNVALIDATED
AppEvents
| where TimeGenerated > ago(7d)
| where Name == "notification.delivery"
| extend channel = tostring(Properties.channel), result = tostring(Properties.result)
| summarize count() by channel, result

Application Insights SDK setup &ZeroWidthSpace;

The telemetry facade in packages/shared-utils/telemetry wraps the Application Insights Node.js SDK (applicationinsights npm package). The connection string is stored in Key Vault under the secret name appinsights-connection-string and injected into the container at runtime via the SecretsProvider adapter (ADR 0024). Do not hard-code the connection string.

For apps/mobile, use the Sentry React Native SDK or the OpenTelemetry React Native package — not the Application Insights SDK, which has no official React Native support (ADR 0005 risk item).

On-call and incident response &ZeroWidthSpace;

Who gets paged &ZeroWidthSpace;

The platform operates with a single on-call rotation during the foundation phase. All critical alerts route to the platform owner (Kristopher Turner — kris@hybridsolutions.cloud) via the Azure Monitor Action Group configured in Bicep.

As the team grows, update the Action Group to route by severity:

Critical → pager (SMS + email, immediate).
Warning → email digest, next business day.

Escalation path &ZeroWidthSpace;

On-call operator receives the alert. Verifies using the relevant metric/log query above.
If the issue is resolved in under 15 minutes with no data impact, log it in the AuditLog or an ADO task and close.
If unresolved within 15 minutes or if member data may be affected, escalate to the platform owner.
If Postgres is unavailable and a restore is required, follow the point-in-time restore (PITR) procedure documented in the DR runbook (to be authored in Sprint 2026-Q3-S4, AB#3182).
If a third-party provider (Clerk, Twilio, SendGrid) is the root cause, verify on the provider's status page, open a support ticket, and post a maintenance notice in the admin portal.

Incident record &ZeroWidthSpace;

Every critical alert that pages on-call must result in an ADO task (linked under Epic AB#3074 or the relevant feature Epic) recording:

Alert triggered, time and date.
Root cause.
Resolution steps taken.
Any follow-up work items created.

This record feeds the retrospective and informs alert threshold tuning.

Key contacts and resources &ZeroWidthSpace;

Resource	Location
Log Analytics workspace	Azure portal → Heritage Community Hub resource group
Application Insights	Azure portal → Heritage Community Hub resource group
Clerk dashboard	dashboard.clerk.com
Twilio console	console.twilio.com
SendGrid dashboard	app.sendgrid.com
Key Vault	`kv-hcs-vault-01`
Twilio spend alert	Configure in Twilio console billing settings; recommended cap: $30/month

ADR	Title	Relevance
ADR 0003	Authentication — Clerk (social login)	Layer-4 auth-provider logs; Clerk dashboard
ADR 0004	Cloud/hosting stack, CI/CD, and free-tier path	Azure SWA, Key Vault, GitHub Actions — diagnostic settings
ADR 0005	Observability model	Source of truth for the four-layer model; free-tier retention limits; mobile SDK constraint
ADR 0006	Two-plane RBAC	Defines who has Plane-1 operator access to Azure Monitor and Log Analytics
ADR 0007	Account & Family-Group identity	Child account logging constraints (COPPA); `AuditLog` session tracking requirement
ADR 0013	Notification transport	Every delivery attempt logged to Application Insights; SMS cost anomaly alert
ADR 0024	Cloud portability & provider abstraction	Container Apps log routing; `SecretsProvider` adapter for telemetry connection strings; non-Azure fallback