0045 — Comprehensive observability and alerting &ZeroWidthSpace;

Status: Accepted

Date: 2026-06-27

ADO work item: AB#TBD

Deciders: Kristopher Turner (platform owner)

Amends: ADR 0005 (observability model) — extends it to cover all platform surfaces including SWA, Clerk, YouTube Live, Cloudflare DNS, Key Vault, and cost. ADR 0005 remains the canonical source for the four-layer model definition; this ADR operationalises each layer with concrete implementation requirements.

Context &ZeroWidthSpace;

ADR 0005 defined a four-layer observability model and was accepted when the platform was in early planning. The model is sound but the implementation coverage is incomplete:

Azure Static Web App (SWA) — no diagnostic settings, no CDN error visibility. The listen.heritageva.app blank-screen incident lasted multiple sessions because there were no logs showing a wrong SWA was serving content.
Clerk — no webhook integration, no alerting on auth failures, domain misconfigurations, or sign-in anomalies landing in Azure.
YouTube Live — no monitoring of stream health, viewer counts, or ingest errors when the platform uses YouTube as the live streaming provider (ADR 0010).
Cloudflare DNS — no alerting on DNS resolution failures or custom domain issues.
Azure Key Vault — KV secret-access failures are silent; the appinsights-connection-string secret was missing and telemetry was silently off for an unknown period.
Azure App Insights connection string — the Container App referenced a KV secret that did not exist, leaving Layer 2 API telemetry disabled in production with no alert.
Cost monitoring — cost alerts defined in the runbook but no Bicep alert rule deployed.
AuditLog completeness — Layer 1 is implemented but no alert fires when audit writes fail or fall silent.

This ADR defines what must be monitored, how, and what fires when something goes wrong across every surface of the Heritage Community Hub platform.

Decision &ZeroWidthSpace;

We will implement complete observability across all seven platform surfaces: (1) Azure infrastructure, (2) Azure Static Web App, (3) Azure Container Apps API, (4) Azure Database for PostgreSQL, (5) Clerk, (6) YouTube Live, and (7) Cloudflare DNS. Every surface has defined: what is collected, where it goes, and what alert fires. Nothing is silent.

Surface-by-surface specification &ZeroWidthSpace;

1. Azure infrastructure (cross-cutting) &ZeroWidthSpace;

Resource: Log Analytics workspace log-hch-prod-eus, App Insights appi-heritageva-prod, Action Group ag-hch-prod-eus.

Collection:

All Azure resources emit diagnostic logs to the Log Analytics workspace via Bicep diagnosticSettings.
Metrics flow to Azure Monitor automatically from all ARM resources.
Single Log Analytics workspace — all surfaces write to the same workspace so cross-surface KQL queries are possible.

Key implementation gap to close: Ensure every resource has an explicit Microsoft.Insights/diagnosticSettings Bicep block pointing at the workspace. Do not rely on Azure defaults.

Alerts (existing — verify deployed):

API error rate > 5% over 5 min → Severity 1.
API p99 response time > 2 s → Severity 2.

New alerts to add:

Monthly cost anomaly: Azure Cost Management budget alert at 130% of rolling 7-day baseline → email.
Key Vault access denied: any SecretGetFailed event in KV diagnostic logs → Severity 1 (indicates a broken secret reference, as seen with appinsights-connection-string).
Key Vault secret missing on startup: if Container App emits [telemetry] App Insights DISABLED in startup logs → Severity 1.

2. Azure Static Web App (SWA) &ZeroWidthSpace;

Resource: swa-heritageva-prod-eus2 in rg-heritageva-prod-eus. Serves both heritageva.app and (after fix) listen.heritageva.app.

What can fail:

CDN serving stale content from wrong SWA (root cause of the listen blank-screen incident).
Custom domain binding missing or in Failed state.
Build/deployment failure leaving old assets live.
JavaScript 404 (asset hash mismatch after deployment).
HTTP 4xx/5xx from the SWA CDN edge.

Collection:

Enable SWA diagnostic settings → Log Analytics: AppServiceHTTPLogs, AppServiceConsoleLogs, FunctionAppLogs.
Bicep Microsoft.Insights/diagnosticSettings on the SWA resource.

Alerts:

Custom domain status not Ready: query az staticwebapp hostname list in a Logic App or Azure Function on a daily schedule; alert if any domain is not in Ready state → Severity 1.
HTTP 5xx rate from SWA CDN: if available in SWA HTTP logs, alert on > 1% 5xx over 15 min → Severity 2.
Deployment failure: GitHub Actions Deploy workflow failure notifies via existing GitHub email; additionally, the workflow should write a structured log entry to Log Analytics on failure using a Send-AzMonitorLog step (Bicep-defined DCE/DCR endpoint).

Manual checks (until automated):

After every push: verify curl -sI https://heritageva.app | grep Last-Modified matches the expected build date.
After every push: verify curl -sI https://listen.heritageva.app | grep Last-Modified matches.
Custom domain health check: az staticwebapp hostname list --name swa-heritageva-prod-eus2 --resource-group rg-heritageva-prod-eus.

3. Azure Container Apps — API &ZeroWidthSpace;

Resource: ca-heritageva-api-prod-eus in rg-heritageva-prod-eus.

What can fail:

Container fails to start (Prisma migrate error, missing env var, schema mismatch).
Container scales to zero unexpectedly.
API returns 5xx (unhandled exception, DB connection lost).
App Insights telemetry silently disabled (missing KV secret — confirmed gap).
Migration failure on startup (known historical incident: P3009 failed migration blocking deploy).

Collection:

Container stdout → Log Analytics (wired via logAnalyticsConfiguration in containerapp.bicep — verified present).
App Insights SDK (applicationinsights npm package) via apps/api/src/lib/telemetry.ts — requires APPLICATIONINSIGHTS_CONNECTION_STRING env var pulled from KV secret appinsights-connection-string.
Gap to close: Create the KV secret appinsights-connection-string with the value from az monitor app-insights component show --app appi-heritageva-prod --resource-group rg-heritageva-prod-eus --query connectionString.

Log queries (KQL):

kql

// API errors in last hour
ContainerAppConsoleLogs_CL
| where TimeGenerated > ago(1h)
| where Log_s contains '"level":"error"'
| project TimeGenerated, Log_s
| order by TimeGenerated desc

// Migration failure on startup
ContainerAppConsoleLogs_CL
| where TimeGenerated > ago(2h)
| where Log_s contains "P3009" or Log_s contains "migrate deploy" or Log_s contains "ERROR"
| project TimeGenerated, Log_s

// Startup telemetry status check
ContainerAppConsoleLogs_CL
| where TimeGenerated > ago(2h)
| where Log_s contains "App Insights"
| project TimeGenerated, Log_s

Alerts:

Container replica count drops to 0 during ministry hours (06:00–22:00 ET) → Severity 1.
Container restart count > 2 in 10 min → Severity 1 (indicates crash-loop).
P3009 or migrate resolve in startup logs → Severity 1.
[telemetry] App Insights DISABLED in startup logs → Severity 1.
5xx error rate > 5% over 5 min → Severity 1 (existing alert — verify active).

4. Azure Database for PostgreSQL Flexible Server &ZeroWidthSpace;

Resource: psql-heritageva-prod-eus (or equivalent) in rg-heritageva-prod-eus.

What can fail:

Connection limit reached.
Storage at capacity.
Slow queries blocking the API.
Migration rollback state (failed_migration record in _prisma_migrations).
Backup failure.

Collection:

Enable PostgreSQL diagnostic settings → Log Analytics: PostgreSQLLogs, QueryStoreRuntimeStatistics, QueryStoreWaitStatistics.
Azure Monitor metric alerts on the Flexible Server resource.

Key metrics:

Metric	Threshold	Severity
`active_connections`	> 80% of instance max	2
`storage_percent`	> 80%	2
`cpu_percent`	Sustained > 80% for 10 min	2
`network_bytes_egress`	Unexpected spike (> 10× baseline)	1
Server availability	< 100% for 2 min	1

Log queries (KQL):

kql

// Failed migrations
AzureDiagnostics
| where Category == "PostgreSQLLogs"
| where Message_s contains "failed" or Message_s contains "error"
| project TimeGenerated, Message_s
| order by TimeGenerated desc

// Slow queries (> 1 second)
AzureDiagnostics
| where Category == "QueryStoreRuntimeStatistics"
| where total_time_d > 1000
| project TimeGenerated, query_sql_text_s, total_time_d
| order by total_time_d desc

AuditLog completeness check (Layer 1):

kql

// Detect if audit writes have gone silent (no events in 24h during active hours)
// Run as a scheduled query alert
let lastEvent = AppTraces
  | where Message contains "AuditLog"
  | summarize max(TimeGenerated);
lastEvent
| where max_TimeGenerated < ago(24h)

5. Clerk — authentication provider &ZeroWidthSpace;

What can fail:

Domain not registered in Clerk → blank screen on portal subdomains (root cause of listen.heritageva.app incident).
Clerk API down or degraded → all logins fail.
Suspicious sign-in activity (credential stuffing, brute force on child PIN endpoints).
JWT signing key rotation → all sessions invalidated unexpectedly.
Rate limiting on Clerk API → intermittent auth failures.

Collection — four paths:

Path A — Clerk dashboard (Layer 4, always-on):

Visit dashboard.clerk.com → Logs tab → filter by event type.
Shows: sign-in attempts, failures, session creates/destroys, domain errors.
Retention: varies by Clerk plan; check current plan at dashboard.clerk.com/settings/billing.

Path B — Clerk webhooks → Azure (recommended — implement this):

Configure a Clerk webhook endpoint in the Clerk dashboard → deliver events to an Azure Function or Container Apps ingress endpoint at POST /api/v1/webhooks/clerk.
Verify webhook signature using CLERK_WEBHOOK_SECRET (store in KV as heritage-clerk-webhook-secret).
The webhook handler writes structured events to Log Analytics via the trackEvent telemetry facade.
Events to subscribe to: user.created, user.updated, session.created, session.ended, session.removed, organization.updated, email.created.

Path C — API auth failure logging (already wired, verify active):

Every failed Clerk JWT validation in apps/api/src/middleware/requireAuth.ts calls trackException or trackEvent.
Verify these events appear in App Insights → Custom Events.

Path D — Clerk status page monitoring:

Subscribe to email notifications at status.clerk.com.
Add status.clerk.com to Azure Monitor URL ping test (availability test) — alerts if Clerk status page itself goes unhealthy.

Alerts:

Auth failure rate > 20/min across all endpoints → Severity 1 (App Insights custom metric alert).
Clerk webhook delivery failure (5xx from our endpoint) → Severity 2 (Clerk will retry; log and alert on repeated failure).
domain not recognized error in Clerk logs (webhook event or API error) → Severity 1 (causes portal blank screens).
Clerk API returning 5xx (detected via app auth failure spike) → Severity 1, escalate to status.clerk.com.

Log queries (KQL — after webhook integration):

kql

// Auth failures by type in last hour
AppEvents
| where TimeGenerated > ago(1h)
| where Name == "auth.failure"
| summarize count() by tostring(Properties.reason)
| order by count_ desc

// Clerk domain errors
AppEvents
| where Name == "clerk.domain_error"
| project TimeGenerated, Properties
| order by TimeGenerated desc

6. YouTube Live — streaming provider &ZeroWidthSpace;

When active: When the platform uses the YouTubeEmbedLiveProvider (ADR 0010 / apps/api/src/adapters/live/YouTubeEmbedLiveProvider.ts) — a steward creates a live event with a YouTube embed URL.

What can fail:

YouTube stream goes offline mid-service (OBS disconnect, network failure).
YouTube embed URL expires or is made private by accident.
YouTube API quota exceeded (if using Data API for metadata).
Viewer count spikes beyond expected range (unusual access pattern).
RTMP ingest rejected by YouTube (stream key revoked or expired).

Collection:

YouTube Studio (operator-managed, manual during stream):

YouTube Studio → Live → Stream Health shows encoder status, ingest bitrate, frame drops.
Stream operator (the person running OBS) monitors this during service.
YouTube auto-archives the recording; steward publishes it via the portal.

Application-layer health check (implement this):

The WatchLivePage.tsx already polls /api/v1/media/live/current every 15 seconds.
Extend the API to check YouTube embed URL reachability (HTTP HEAD request to the embed URL) on each poll.
If the embed URL returns non-200, the API sets live_event.status = 'degraded' and logs a trackEvent('youtube.embed_unreachable').
The Watch Live page shows a "Stream temporarily unavailable — reconnecting" banner rather than a blank embed.

YouTube Data API v3 (optional — implement if needed):

Use YouTube Data API v3 to query live broadcast status by video ID.
API key stored in KV as heritage-youtube-data-api-key.
Poll every 60 seconds during an active live event; write result to App Insights as a custom metric youtube.concurrent_viewers.
Alert if concurrent_viewers drops from > 0 to 0 mid-stream (stream ended unexpectedly) → Severity 2 → notify steward.

Alerts:

YouTube embed URL HEAD check fails for > 2 consecutive polls (30 s) → trackEvent('youtube.stream_down') → Log Analytics alert → Severity 1 during active service.
concurrent_viewers drops to 0 during a scheduled live event window → Severity 2.
YouTube Data API quota nearing limit (quotaExceeded error in API logs) → Severity 2.

Log queries (KQL):

kql

// YouTube stream health events
AppEvents
| where TimeGenerated > ago(4h)
| where Name startswith "youtube."
| project TimeGenerated, Name, Properties
| order by TimeGenerated desc

// Stream-down events
AppEvents
| where Name == "youtube.stream_down"
| project TimeGenerated, Properties

7. Cloudflare DNS &ZeroWidthSpace;

What can fail:

CNAME record pointing at wrong Azure resource (root cause of listen.heritageva.app incident).
DNS propagation delay after record change.
Cloudflare API token revoked → automated DNS operations fail silently.
DDoS or rate-limiting affecting DNS resolution.

Collection:

Cloudflare API polling (implement via Azure Function on schedule):

Every 10 minutes, query GET /client/v4/zones/{zone_id}/dns_records?name=listen.heritageva.app and ?name=heritageva.app.
Compare content field against expected values (stored as Azure Function app settings).
If content doesn't match expected SWA hostname → write trackEvent('dns.cname_mismatch') → Log Analytics alert → Severity 1.
Cloudflare API token stored in KV as hcs-platform-cloudflare-api-token (already present).

Expected DNS record values to monitor:

Hostname	Expected CNAME content
`listen.heritageva.app`	`salmon-beach-0c664ab0f.7.azurestaticapps.net`
`heritageva.app`	Azure SWA IP (A record via Cloudflare proxy)

Azure Monitor URL availability tests:

Add ping tests in App Insights for:
- https://heritageva.app — expect HTTP 200, response < 3 s.
- https://listen.heritageva.app — expect HTTP 200, response < 3 s.
- https://api.heritageva.app/health (or equivalent API health endpoint).
Test from 5 Azure Monitor test locations globally.
Alert if > 2 locations fail → Severity 1.

Log queries (KQL):

kql

// DNS mismatch events
AppEvents
| where Name == "dns.cname_mismatch"
| project TimeGenerated, Properties
| order by TimeGenerated desc

// Availability test failures
AppAvailabilityResults
| where TimeGenerated > ago(1h)
| where Success == false
| project TimeGenerated, Name, Location, Message
| order by TimeGenerated desc

Implementation checklist &ZeroWidthSpace;

The following items must be completed to make this ADR fully implemented. Each is a discrete ADO story.

Immediate (fixes broken monitoring) &ZeroWidthSpace;

[ ] Fix API telemetry (CRITICAL): Retrieve App Insights connection string from az monitor app-insights component show --app appi-heritageva-prod and store in KV as appinsights-connection-string. Verify [telemetry] App Insights ENABLED appears in Container App startup logs.
[ ] Fix listen.heritageva.app SWA hostname: Remove listen.heritageva.app from old SWA (rg-heritageva-prod-eus2); register it on production SWA (rg-heritageva-prod-eus).
[ ] Set alertEmail in Bicep deployment params: Ensure kris@hybridsolutions.cloud is passed as alertEmail in the production deployment so the Action Group and alert rules actually fire to someone.

Infrastructure (Bicep additions) &ZeroWidthSpace;

[ ] SWA diagnostic settings: Add Microsoft.Insights/diagnosticSettings to swa.bicep routing to Log Analytics.
[ ] PostgreSQL diagnostic settings: Add diagnostic settings routing PostgreSQLLogs and QueryStoreRuntimeStatistics to Log Analytics.
[ ] Key Vault diagnostic settings: Add diagnostic settings routing AuditEvent logs to Log Analytics (catches SecretGetFailed).
[ ] Container App crash-loop alert: Add metric alert rule on Container App RestartCount > 2 in 10 min.
[ ] Cost budget alert: Add Microsoft.Consumption/budgets Bicep resource with email alert at 100% and 130% of monthly budget.
[ ] App Insights URL availability tests: Add Microsoft.Insights/webtests Bicep resources for heritageva.app, listen.heritageva.app, and the API health endpoint.

Application (code additions) &ZeroWidthSpace;

[ ] Clerk webhook handler: Add POST /api/v1/webhooks/clerk route to apps/api. Verify signature with CLERK_WEBHOOK_SECRET (KV). Write events to trackEvent. Subscribe at dashboard.clerk.com.
[ ] Auth failure telemetry: Verify requireAuth.ts calls trackEvent('auth.failure', { reason }) on every failed JWT validation. Add if missing.
[ ] YouTube embed health check: Extend GET /api/v1/media/live/current to HEAD-check the embed URL and return embedReachable: boolean. Have WatchLivePage.tsx show degraded banner when false.
[ ] YouTube Data API viewer count: Implement GET /api/v1/media/live/current/stats that queries YouTube Data API v3 by video ID during active events. Store API key in KV as heritage-youtube-data-api-key.

Automation &ZeroWidthSpace;

[ ] DNS monitor Function: Azure Function (Timer trigger, 10 min) that validates Cloudflare CNAMEs against expected values and writes dns.cname_mismatch events to App Insights on mismatch.
[ ] SWA deployment asset freshness check: Add a step to the GitHub Actions Deploy workflow that curls Last-Modified from https://heritageva.app and https://listen.heritageva.app post-deploy and logs a warning if either is > 1 hour old.

What each stakeholder sees &ZeroWidthSpace;

Stakeholder	Tool	What they watch
Platform operator (Kris)	Azure Portal → App Insights → Failures	API errors, auth failures, YouTube stream events
Platform operator (Kris)	Azure Portal → Log Analytics	Cross-surface KQL queries; migration failures; KV access denied
Platform operator (Kris)	Azure Portal → Monitor → Alerts	Active alert rules firing to email
Platform operator (Kris)	`dashboard.clerk.com` → Logs	Auth session events; domain errors; suspicious sign-ins
Platform operator (Kris)	`studio.youtube.com` → Live	Stream health during service (manual)
Platform operator (Kris)	Email (`kris@hybridsolutions.cloud`)	Critical and warning alert emails from Action Group
Platform operator (Kris)	`status.clerk.com`	Clerk platform status (subscribed to email updates)
Minister	Admin portal → Audit Log	Member activity, approval actions (Layer 1)

Consequences &ZeroWidthSpace;

Positive &ZeroWidthSpace;

Every platform surface has defined collection, queries, and alerts — no more silent failures.
The listen.heritageva.app-class incidents (wrong CDN, missing KV secret, unregistered Clerk domain) all have alerts that would have caught them in minutes.
YouTube stream failures surface as an in-app banner rather than a blank embed.
DNS drift is detected automatically within 10 minutes of a record change.

Negative / trade-offs &ZeroWidthSpace;

The DNS monitor Function and YouTube Data API integration add operational surface area to maintain.
YouTube Data API v3 has a daily quota of 10,000 units. Polling every 60 s during a 2-hour stream = 120 requests × ~2 quota units each = ~240 units/stream. Well within quota.
Clerk webhook secret (CLERK_WEBHOOK_SECRET) must be rotated if compromised — add to secret rotation runbook (docs/internal/operations/secret-rotation.md).

Risks &ZeroWidthSpace;

Cloudflare token rotation: If hcs-platform-cloudflare-api-token rotates without updating KV, the DNS monitor Function will fail silently. Add a KV expiry policy for that secret.
Log Analytics retention cost: With all diagnostic settings enabled, log volume will increase. Monitor cost in the first month; adjust retention tiers if it spikes.
YouTube Data API dependency: If YouTube deprecates or changes the Live Broadcasts endpoint, the viewer count metric will fail. Design the endpoint to fail gracefully and return null for concurrentViewers rather than erroring.

References &ZeroWidthSpace;

Document	Relevance
ADR 0005 — Observability model	Foundation four-layer model; this ADR extends it
ADR 0003 — Authentication: Clerk	Clerk Layer 4; webhook integration
ADR 0004 — Cloud/hosting stack	Azure resource names and resource groups
ADR 0010 — Sermons and music hub	YouTube Live provider; stream health
ADR 0024 — Cloud portability	Provider abstraction; non-Azure fallback for telemetry
ADR 0026 — Production domain and DNS	Cloudflare DNS; custom domain config
ADR 0028 — Per-portal subdomains	listen.heritageva.app SWA binding
ADR 0029 — CI/CD and Azure auth	GitHub Actions deployment; workflow failure detection
Observability runbook	KQL queries; on-call; incident response
Secret rotation runbook	Clerk webhook secret; Cloudflare token rotation