Appearance
Observability runbook ​
Audience: Platform operators (Plane-1 identities — see ADR 0006)
ADO work item: AB#3182 (Operations documentation — Sprint 2026-Q3-S4)
Last updated: 2026-06-18
Overview ​
The platform implements a four-layer observability model decided in ADR 0005. Each layer covers a distinct concern; together they answer the questions the minister and platform operators need: who accessed the platform, when, how long, what happened, and whether the infrastructure is healthy.
| Layer | What it covers | Tool (Azure path) | Responsibility |
|---|---|---|---|
| 1 — App audit log | Security-relevant events: login, logout, session duration, approval actions, content publish | AuditLog table (Postgres) | Application team |
| 2 — App telemetry | API errors, performance traces, request/response timing, custom events | Azure Application Insights | Application team |
| 3 — Infrastructure monitoring | Container Apps resource health, Postgres availability, storage, cost | Azure Monitor + Log Analytics workspace | Platform operator |
| 4 — Auth-provider logs | Clerk session events, sign-in attempts, suspicious activity | Clerk dashboard | Platform operator |
Infrastructure vs application split. Layers 1 and 2 are implemented by the application (code in apps/api, apps/web, apps/mobile). Layers 3 and 4 are operator-managed — no code, only configuration and dashboard access.
Non-Azure fallback. If the platform relocates away from Azure (see ADR 0024 — portability posture), layers 2 and 3 shift to OpenTelemetry + Grafana Cloud free tier + Sentry for error tracking. Layer 1 (Postgres AuditLog table) and layer 4 (Clerk dashboard) are unaffected by a host change.
Mobile telemetry note (ADR 0005 risk). Azure Application Insights does not have an official React Native SDK. apps/mobile uses either Sentry (React Native SDK) or the OpenTelemetry React Native package for error tracking and custom events, regardless of the stack choice. Layer-3 infrastructure metrics are unaffected.
Structured logging ​
Log format ​
All log entries emitted by apps/api must be valid JSON written to stdout. Azure Container Apps (ADR 0024) captures stdout and routes it to the Log Analytics workspace automatically.
Required fields on every log entry:
| Field | Type | Notes |
|---|---|---|
timestamp | ISO 8601 string | UTC, e.g. "2026-06-18T14:30:00.000Z" |
level | string | One of: error, warn, info, debug |
service | string | Emitting service name, e.g. "api", "notifications", "mobile" |
traceId | string | Correlation identifier across a single request chain |
message | string | Human-readable summary |
userId | string or null | Platform Users.id if the request is authenticated; null for unauthenticated requests |
Optional but strongly encouraged fields:
| Field | Type | When to include |
|---|---|---|
requestId | string | Unique ID for a single HTTP request |
endpoint | string | HTTP method + path, e.g. "POST /announcements" |
statusCode | integer | HTTP response status |
durationMs | integer | Response time in milliseconds |
errorCode | string | Application error code for known error conditions |
Example (authenticated API error):
json
{
"timestamp": "2026-06-18T14:30:00.000Z",
"level": "error",
"service": "api",
"traceId": "abc123",
"requestId": "req-789",
"userId": "usr_abc",
"endpoint": "POST /announcements",
"statusCode": 500,
"durationMs": 120,
"message": "Unexpected error saving announcement draft",
"errorCode": "ANN_SAVE_FAILED",
"stack": "Error: connection refused\n at ..."
}Example (unauthenticated request):
json
{
"timestamp": "2026-06-18T14:31:00.000Z",
"level": "info",
"service": "api",
"traceId": "def456",
"requestId": "req-790",
"userId": null,
"endpoint": "POST /auth/session",
"statusCode": 401,
"durationMs": 45,
"message": "Authentication failed — invalid Clerk token"
}The telemetry facade (packages/shared-utils/telemetry) wraps the SDK so that switching from Application Insights to OpenTelemetry is isolated to that module (ADR 0005 consequence).
Log levels ​
| Level | When to use |
|---|---|
error | Unexpected failures that require investigation: unhandled exceptions, database connection failures, provider API errors (Twilio, SendGrid), 5xx responses |
warn | Recoverable conditions or unusual but non-fatal events: retry attempt before success, stale push token removed, soft-limit threshold approached |
info | Normal operational events worth retaining: authentication events, approval workflow state transitions, notification fan-out initiated, container startup/shutdown |
debug | Detailed diagnostic data useful during active development or incident investigation; must be disabled in production by default |
What to log ​
Log the following events explicitly. Do not rely on framework-level logging alone for these.
Request/response — every inbound API request:
- Method, path, response status code, duration in milliseconds, and
traceId. - Do not log request or response bodies. Log only metadata.
Errors — every error-level event:
- The
message,errorCode, and the stack trace. - The
userId(if authenticated) and theendpoint. - External provider errors must include the provider's error code or status string (e.g. Twilio error code, SendGrid rejection reason, Postgres error class).
Auth events — these feed both layer-2 telemetry and the layer-1 AuditLog table:
- Successful sign-in (Clerk session validated,
Usersrow resolved). - Failed sign-in (invalid token, user not found,
status = pending_approval). - Sign-out or session expiry.
- Role change (admin updates
Users.role).
Approval workflow transitions (ADR 0023) — every state change in the announcement authoring workflow:
- Draft created, submitted to queue, approved, rejected, published.
- Include the
announcementId, the actor'suserIdandrole, and the new state.
Notification fan-out (ADR 0013):
- Fan-out initiated: include
announcementId, recipient count, channels active. - Per-channel delivery attempt result: success, provider error code, or hard bounce.
- Stale push token removed.
- SMS opt-out evaluated and suppressed for a recipient.
Child sub-account access (ADR 0007):
- Any access to sections restricted to parent-approved scope should log the child
userIdand the section accessed.
What NOT to log ​
Never include the following in any log entry at any level, in any environment:
- Passwords, PINs, or password hashes (including Argon2id hashes for child credentials — ADR 0003/0007).
- Clerk session tokens, JWTs, or API keys for any provider.
- Full OAuth tokens (access, refresh, or ID tokens).
- Personal data beyond
userId: names, email addresses, phone numbers, physical addresses, family member details. - Full request or response bodies.
- Key Vault secret values.
- Any field that could constitute a COPPA-regulated record for a child account.
If a log format tempts inclusion of any of the above, replace it with a truncated identifier or omit it entirely.
Metrics ​
Application Insights and Azure Monitor collect these automatically once the SDK and Container Apps integration are configured. The list below defines what operators and the application team should verify are flowing and being retained.
Key application metrics ​
These come from layer-2 (Application Insights SDK) instrumentation in apps/api:
| Metric | Source | Why it matters |
|---|---|---|
| Request rate (requests/min per endpoint) | HTTP middleware | Baseline for anomaly detection; unexpected spikes indicate a bug or abuse |
| Error rate (% of 5xx per endpoint) | HTTP middleware | Primary health signal for the API |
| Response time p50/p95/p99 per endpoint | HTTP middleware | SLO reference; slow endpoints indicate DB or provider issues |
| Notification fan-out delivery rate per channel | Custom event | Identifies a degraded channel (e.g. rising SMS failures) before it becomes an incident |
Approval queue depth (count of status = 'queued' announcements) | Periodic query or custom event on state change | Signals a backlog if approvers are inactive |
| Auth failure rate | Custom event on failed token validation | Detects credential stuffing or misconfigured client |
Key infrastructure metrics ​
These come from layer-3 (Azure Monitor, automatically collected from Azure resources):
| Metric | Resource | Threshold of concern |
|---|---|---|
| CPU utilization | Azure Container Apps (API container) | Sustained > 80% |
| Memory utilization | Azure Container Apps (API container) | Sustained > 80% |
| Replica count | Azure Container Apps | Unexpected scale-to-zero during active hours |
| Active connections | Azure DB for PostgreSQL Flexible Server | Approaching the instance connection limit |
| Storage utilization | Azure DB for PostgreSQL Flexible Server | > 80% of provisioned storage |
| Connection success rate | Azure DB for PostgreSQL Flexible Server | Drop below 100% indicates network or auth issue |
| Blob storage transactions | Azure Blob Storage | Unexpected spike indicates a runaway process or external scraping |
| Cost (daily/monthly) | Azure Cost Management | > 15% above the expected baseline |
Alerting ​
Alerts are configured in Azure Monitor Action Groups and routed to the on-call channel. All alert rules are defined in Bicep IaC (infrastructure/) — no click-ops configuration.
Critical alerts ​
Critical alerts require immediate response (target response time: within 15 minutes during ministry hours, within 1 hour outside them).
| Alert | Condition | Action |
|---|---|---|
| API error rate critical | 5xx error rate > 5% over a 5-minute window | Page on-call operator; check Application Insights failures blade |
| Container App unhealthy | Container Apps health probe failing; replica count drops to 0 unexpectedly | Page on-call operator; check Container Apps logs in Log Analytics |
| Postgres unreachable | Connection success rate < 100% for 2 consecutive minutes | Page on-call operator; check Flexible Server availability metrics |
| Authentication provider unavailable | Clerk API returning 5xx; verify via Clerk status page (status.clerk.com) | Page on-call operator; surface a maintenance notice in the web shell |
| Security anomaly: auth failure spike | Failed auth events > 20 per minute (layer-2 custom metric) | Page on-call operator; review Clerk dashboard (layer 4) for suspicious sessions |
Warning alerts ​
Warning alerts require investigation within the next business day and may indicate a drift toward a critical condition.
| Alert | Condition | Action |
|---|---|---|
| API response time — p95 elevated | p95 latency > 2 seconds across all endpoints, sustained 15 minutes | Investigate slow queries in Application Insights; check Postgres wait stats |
| Approval queue backlog | Announcements with status = 'queued' older than 48 hours | Notify the admin role holder; not a system failure, but a process gap |
| SMS delivery failure rate elevated | > 10% of Twilio send attempts return an error in a 1-hour window | Check Twilio console; review SmsProvider adapter logs; if A2P 10DLC registration is the issue, escalate to platform owner |
| Cost anomaly | Daily Azure cost > 130% of the 7-day rolling average | Review Cost Management breakdown; identify the unexpected resource |
| Storage approaching capacity | Postgres or Blob Storage > 80% of provisioned capacity | Plan a storage tier increase; do not wait for 100% |
| Push token removal rate elevated | > 5 stale push token removals per hour | Indicates many reinstalls or a token rotation issue; review Expo push logs |
Azure Monitor integration ​
Log flow ​
apps/api (Container Apps) stdout → Log Analytics workspace
apps/web (Azure SWA) → Log Analytics workspace (SWA diagnostic settings)
Azure Container Apps metrics → Azure Monitor (automatic)
Azure DB for PostgreSQL metrics → Azure Monitor (automatic)
Azure Blob Storage metrics → Azure Monitor (automatic)All resources write to a single Log Analytics workspace provisioned in Bicep. The workspace name and resource ID are defined in infrastructure/environments/<env>/main.bicepparam. Retention is 30 days by default on the free tier (7-day free tier for Log Analytics; extend to 30 days on the paid tier once Nonprofits credits are active). If compliance requires longer retention, configure a diagnostic settings export to Azure Blob Storage cold tier.
Querying logs ​
Logs are queried with Kusto Query Language (KQL) in the Log Analytics workspace or from the Application Insights portal blade.
Find all errors in the last hour:
kql
// UNVALIDATED — workspace name and table names must match the provisioned Bicep configuration
ContainerAppConsoleLogs_CL
| where TimeGenerated > ago(1h)
| where Log_s contains '"level":"error"'
| project TimeGenerated, Log_s
| order by TimeGenerated descFind auth failures for a specific user:
kql
// UNVALIDATED
AppTraces
| where TimeGenerated > ago(24h)
| where Message contains "Authentication failed"
| extend userId = tostring(parse_json(Message).userId)
| where userId == "<target-user-id>"
| project TimeGenerated, MessageRequest rate per endpoint over the last 24 hours:
kql
// UNVALIDATED
AppRequests
| where TimeGenerated > ago(24h)
| summarize count() by Name, bin(TimeGenerated, 5m)
| render timechartNotification fan-out delivery result by channel:
kql
// UNVALIDATED
AppEvents
| where TimeGenerated > ago(7d)
| where Name == "notification.delivery"
| extend channel = tostring(Properties.channel), result = tostring(Properties.result)
| summarize count() by channel, resultApplication Insights SDK setup ​
The telemetry facade in packages/shared-utils/telemetry wraps the Application Insights Node.js SDK (applicationinsights npm package). The connection string is stored in Key Vault under the secret name appinsights-connection-string and injected into the container at runtime via the SecretsProvider adapter (ADR 0024). Do not hard-code the connection string.
For apps/mobile, use the Sentry React Native SDK or the OpenTelemetry React Native package — not the Application Insights SDK, which has no official React Native support (ADR 0005 risk item).
On-call and incident response ​
Who gets paged ​
The platform operates with a single on-call rotation during the foundation phase. All critical alerts route to the platform owner (Kristopher Turner — kris@hybridsolutions.cloud) via the Azure Monitor Action Group configured in Bicep.
As the team grows, update the Action Group to route by severity:
- Critical → pager (SMS + email, immediate).
- Warning → email digest, next business day.
Escalation path ​
- On-call operator receives the alert. Verifies using the relevant metric/log query above.
- If the issue is resolved in under 15 minutes with no data impact, log it in the
AuditLogor an ADO task and close. - If unresolved within 15 minutes or if member data may be affected, escalate to the platform owner.
- If Postgres is unavailable and a restore is required, follow the point-in-time restore (PITR) procedure documented in the DR runbook (to be authored in Sprint 2026-Q3-S4, AB#3182).
- If a third-party provider (Clerk, Twilio, SendGrid) is the root cause, verify on the provider's status page, open a support ticket, and post a maintenance notice in the admin portal.
Incident record ​
Every critical alert that pages on-call must result in an ADO task (linked under Epic AB#3074 or the relevant feature Epic) recording:
- Alert triggered, time and date.
- Root cause.
- Resolution steps taken.
- Any follow-up work items created.
This record feeds the retrospective and informs alert threshold tuning.
Key contacts and resources ​
| Resource | Location |
|---|---|
| Log Analytics workspace | Azure portal → Heritage Community Hub resource group |
| Application Insights | Azure portal → Heritage Community Hub resource group |
| Clerk dashboard | dashboard.clerk.com |
| Twilio console | console.twilio.com |
| SendGrid dashboard | app.sendgrid.com |
| Key Vault | kv-hcs-vault-01 |
| Twilio spend alert | Configure in Twilio console billing settings; recommended cap: $30/month |
Related ADRs ​
| ADR | Title | Relevance |
|---|---|---|
| ADR 0003 | Authentication — Clerk (social login) | Layer-4 auth-provider logs; Clerk dashboard |
| ADR 0004 | Cloud/hosting stack, CI/CD, and free-tier path | Azure SWA, Key Vault, GitHub Actions — diagnostic settings |
| ADR 0005 | Observability model | Source of truth for the four-layer model; free-tier retention limits; mobile SDK constraint |
| ADR 0006 | Two-plane RBAC | Defines who has Plane-1 operator access to Azure Monitor and Log Analytics |
| ADR 0007 | Account & Family-Group identity | Child account logging constraints (COPPA); AuditLog session tracking requirement |
| ADR 0013 | Notification transport | Every delivery attempt logged to Application Insights; SMS cost anomaly alert |
| ADR 0024 | Cloud portability & provider abstraction | Container Apps log routing; SecretsProvider adapter for telemetry connection strings; non-Azure fallback |