Appearance
0045 — Comprehensive observability and alerting ​
Status: Accepted
Date: 2026-06-27
ADO work item: AB#TBD
Deciders: Kristopher Turner (platform owner)
Amends: ADR 0005 (observability model) — extends it to cover all platform surfaces including SWA, Clerk, YouTube Live, Cloudflare DNS, Key Vault, and cost. ADR 0005 remains the canonical source for the four-layer model definition; this ADR operationalises each layer with concrete implementation requirements.
Context ​
ADR 0005 defined a four-layer observability model and was accepted when the platform was in early planning. The model is sound but the implementation coverage is incomplete:
- Azure Static Web App (SWA) — no diagnostic settings, no CDN error visibility. The
listen.heritageva.appblank-screen incident lasted multiple sessions because there were no logs showing a wrong SWA was serving content. - Clerk — no webhook integration, no alerting on auth failures, domain misconfigurations, or sign-in anomalies landing in Azure.
- YouTube Live — no monitoring of stream health, viewer counts, or ingest errors when the platform uses YouTube as the live streaming provider (ADR 0010).
- Cloudflare DNS — no alerting on DNS resolution failures or custom domain issues.
- Azure Key Vault — KV secret-access failures are silent; the
appinsights-connection-stringsecret was missing and telemetry was silently off for an unknown period. - Azure App Insights connection string — the Container App referenced a KV secret that did not exist, leaving Layer 2 API telemetry disabled in production with no alert.
- Cost monitoring — cost alerts defined in the runbook but no Bicep alert rule deployed.
- AuditLog completeness — Layer 1 is implemented but no alert fires when audit writes fail or fall silent.
This ADR defines what must be monitored, how, and what fires when something goes wrong across every surface of the Heritage Community Hub platform.
Decision ​
We will implement complete observability across all seven platform surfaces: (1) Azure infrastructure, (2) Azure Static Web App, (3) Azure Container Apps API, (4) Azure Database for PostgreSQL, (5) Clerk, (6) YouTube Live, and (7) Cloudflare DNS. Every surface has defined: what is collected, where it goes, and what alert fires. Nothing is silent.
Surface-by-surface specification ​
1. Azure infrastructure (cross-cutting) ​
Resource: Log Analytics workspace log-hch-prod-eus, App Insights appi-heritageva-prod, Action Group ag-hch-prod-eus.
Collection:
- All Azure resources emit diagnostic logs to the Log Analytics workspace via Bicep
diagnosticSettings. - Metrics flow to Azure Monitor automatically from all ARM resources.
- Single Log Analytics workspace — all surfaces write to the same workspace so cross-surface KQL queries are possible.
Key implementation gap to close: Ensure every resource has an explicit Microsoft.Insights/diagnosticSettings Bicep block pointing at the workspace. Do not rely on Azure defaults.
Alerts (existing — verify deployed):
- API error rate > 5% over 5 min → Severity 1.
- API p99 response time > 2 s → Severity 2.
New alerts to add:
- Monthly cost anomaly: Azure Cost Management budget alert at 130% of rolling 7-day baseline → email.
- Key Vault access denied: any
SecretGetFailedevent in KV diagnostic logs → Severity 1 (indicates a broken secret reference, as seen withappinsights-connection-string). - Key Vault secret missing on startup: if Container App emits
[telemetry] App Insights DISABLEDin startup logs → Severity 1.
2. Azure Static Web App (SWA) ​
Resource: swa-heritageva-prod-eus2 in rg-heritageva-prod-eus. Serves both heritageva.app and (after fix) listen.heritageva.app.
What can fail:
- CDN serving stale content from wrong SWA (root cause of the listen blank-screen incident).
- Custom domain binding missing or in
Failedstate. - Build/deployment failure leaving old assets live.
- JavaScript 404 (asset hash mismatch after deployment).
- HTTP 4xx/5xx from the SWA CDN edge.
Collection:
- Enable SWA diagnostic settings → Log Analytics:
AppServiceHTTPLogs,AppServiceConsoleLogs,FunctionAppLogs. - Bicep
Microsoft.Insights/diagnosticSettingson the SWA resource.
Alerts:
- Custom domain status not Ready: query
az staticwebapp hostname listin a Logic App or Azure Function on a daily schedule; alert if any domain is not inReadystate → Severity 1. - HTTP 5xx rate from SWA CDN: if available in SWA HTTP logs, alert on > 1% 5xx over 15 min → Severity 2.
- Deployment failure: GitHub Actions
Deployworkflow failure notifies via existing GitHub email; additionally, the workflow should write a structured log entry to Log Analytics on failure using aSend-AzMonitorLogstep (Bicep-defined DCE/DCR endpoint).
Manual checks (until automated):
- After every push: verify
curl -sI https://heritageva.app | grep Last-Modifiedmatches the expected build date. - After every push: verify
curl -sI https://listen.heritageva.app | grep Last-Modifiedmatches. - Custom domain health check:
az staticwebapp hostname list --name swa-heritageva-prod-eus2 --resource-group rg-heritageva-prod-eus.
3. Azure Container Apps — API ​
Resource: ca-heritageva-api-prod-eus in rg-heritageva-prod-eus.
What can fail:
- Container fails to start (Prisma migrate error, missing env var, schema mismatch).
- Container scales to zero unexpectedly.
- API returns 5xx (unhandled exception, DB connection lost).
- App Insights telemetry silently disabled (missing KV secret — confirmed gap).
- Migration failure on startup (known historical incident:
P3009 failed migration blocking deploy).
Collection:
- Container stdout → Log Analytics (wired via
logAnalyticsConfigurationincontainerapp.bicep— verified present). - App Insights SDK (
applicationinsightsnpm package) viaapps/api/src/lib/telemetry.ts— requiresAPPLICATIONINSIGHTS_CONNECTION_STRINGenv var pulled from KV secretappinsights-connection-string. - Gap to close: Create the KV secret
appinsights-connection-stringwith the value fromaz monitor app-insights component show --app appi-heritageva-prod --resource-group rg-heritageva-prod-eus --query connectionString.
Log queries (KQL):
kql
// API errors in last hour
ContainerAppConsoleLogs_CL
| where TimeGenerated > ago(1h)
| where Log_s contains '"level":"error"'
| project TimeGenerated, Log_s
| order by TimeGenerated desc
// Migration failure on startup
ContainerAppConsoleLogs_CL
| where TimeGenerated > ago(2h)
| where Log_s contains "P3009" or Log_s contains "migrate deploy" or Log_s contains "ERROR"
| project TimeGenerated, Log_s
// Startup telemetry status check
ContainerAppConsoleLogs_CL
| where TimeGenerated > ago(2h)
| where Log_s contains "App Insights"
| project TimeGenerated, Log_sAlerts:
- Container replica count drops to 0 during ministry hours (06:00–22:00 ET) → Severity 1.
- Container restart count > 2 in 10 min → Severity 1 (indicates crash-loop).
P3009ormigrate resolvein startup logs → Severity 1.[telemetry] App Insights DISABLEDin startup logs → Severity 1.- 5xx error rate > 5% over 5 min → Severity 1 (existing alert — verify active).
4. Azure Database for PostgreSQL Flexible Server ​
Resource: psql-heritageva-prod-eus (or equivalent) in rg-heritageva-prod-eus.
What can fail:
- Connection limit reached.
- Storage at capacity.
- Slow queries blocking the API.
- Migration rollback state (
failed_migrationrecord in_prisma_migrations). - Backup failure.
Collection:
- Enable PostgreSQL diagnostic settings → Log Analytics:
PostgreSQLLogs,QueryStoreRuntimeStatistics,QueryStoreWaitStatistics. - Azure Monitor metric alerts on the Flexible Server resource.
Key metrics:
| Metric | Threshold | Severity |
|---|---|---|
active_connections | > 80% of instance max | 2 |
storage_percent | > 80% | 2 |
cpu_percent | Sustained > 80% for 10 min | 2 |
network_bytes_egress | Unexpected spike (> 10× baseline) | 1 |
| Server availability | < 100% for 2 min | 1 |
Log queries (KQL):
kql
// Failed migrations
AzureDiagnostics
| where Category == "PostgreSQLLogs"
| where Message_s contains "failed" or Message_s contains "error"
| project TimeGenerated, Message_s
| order by TimeGenerated desc
// Slow queries (> 1 second)
AzureDiagnostics
| where Category == "QueryStoreRuntimeStatistics"
| where total_time_d > 1000
| project TimeGenerated, query_sql_text_s, total_time_d
| order by total_time_d descAuditLog completeness check (Layer 1):
kql
// Detect if audit writes have gone silent (no events in 24h during active hours)
// Run as a scheduled query alert
let lastEvent = AppTraces
| where Message contains "AuditLog"
| summarize max(TimeGenerated);
lastEvent
| where max_TimeGenerated < ago(24h)5. Clerk — authentication provider ​
What can fail:
- Domain not registered in Clerk → blank screen on portal subdomains (root cause of listen.heritageva.app incident).
- Clerk API down or degraded → all logins fail.
- Suspicious sign-in activity (credential stuffing, brute force on child PIN endpoints).
- JWT signing key rotation → all sessions invalidated unexpectedly.
- Rate limiting on Clerk API → intermittent auth failures.
Collection — four paths:
Path A — Clerk dashboard (Layer 4, always-on):
- Visit
dashboard.clerk.com→ Logs tab → filter by event type. - Shows: sign-in attempts, failures, session creates/destroys, domain errors.
- Retention: varies by Clerk plan; check current plan at
dashboard.clerk.com/settings/billing.
Path B — Clerk webhooks → Azure (recommended — implement this):
- Configure a Clerk webhook endpoint in the Clerk dashboard → deliver events to an Azure Function or Container Apps ingress endpoint at
POST /api/v1/webhooks/clerk. - Verify webhook signature using
CLERK_WEBHOOK_SECRET(store in KV asheritage-clerk-webhook-secret). - The webhook handler writes structured events to Log Analytics via the
trackEventtelemetry facade. - Events to subscribe to:
user.created,user.updated,session.created,session.ended,session.removed,organization.updated,email.created.
Path C — API auth failure logging (already wired, verify active):
- Every failed Clerk JWT validation in
apps/api/src/middleware/requireAuth.tscallstrackExceptionortrackEvent. - Verify these events appear in App Insights → Custom Events.
Path D — Clerk status page monitoring:
- Subscribe to email notifications at
status.clerk.com. - Add
status.clerk.comto Azure Monitor URL ping test (availability test) — alerts if Clerk status page itself goes unhealthy.
Alerts:
- Auth failure rate > 20/min across all endpoints → Severity 1 (App Insights custom metric alert).
- Clerk webhook delivery failure (5xx from our endpoint) → Severity 2 (Clerk will retry; log and alert on repeated failure).
domain not recognizederror in Clerk logs (webhook event or API error) → Severity 1 (causes portal blank screens).- Clerk API returning 5xx (detected via app auth failure spike) → Severity 1, escalate to
status.clerk.com.
Log queries (KQL — after webhook integration):
kql
// Auth failures by type in last hour
AppEvents
| where TimeGenerated > ago(1h)
| where Name == "auth.failure"
| summarize count() by tostring(Properties.reason)
| order by count_ desc
// Clerk domain errors
AppEvents
| where Name == "clerk.domain_error"
| project TimeGenerated, Properties
| order by TimeGenerated desc6. YouTube Live — streaming provider ​
When active: When the platform uses the YouTubeEmbedLiveProvider (ADR 0010 / apps/api/src/adapters/live/YouTubeEmbedLiveProvider.ts) — a steward creates a live event with a YouTube embed URL.
What can fail:
- YouTube stream goes offline mid-service (OBS disconnect, network failure).
- YouTube embed URL expires or is made private by accident.
- YouTube API quota exceeded (if using Data API for metadata).
- Viewer count spikes beyond expected range (unusual access pattern).
- RTMP ingest rejected by YouTube (stream key revoked or expired).
Collection:
YouTube Studio (operator-managed, manual during stream):
- YouTube Studio → Live → Stream Health shows encoder status, ingest bitrate, frame drops.
- Stream operator (the person running OBS) monitors this during service.
- YouTube auto-archives the recording; steward publishes it via the portal.
Application-layer health check (implement this):
- The
WatchLivePage.tsxalready polls/api/v1/media/live/currentevery 15 seconds. - Extend the API to check YouTube embed URL reachability (HTTP HEAD request to the embed URL) on each poll.
- If the embed URL returns non-200, the API sets
live_event.status = 'degraded'and logs atrackEvent('youtube.embed_unreachable'). - The Watch Live page shows a "Stream temporarily unavailable — reconnecting" banner rather than a blank embed.
YouTube Data API v3 (optional — implement if needed):
- Use YouTube Data API v3 to query live broadcast status by video ID.
- API key stored in KV as
heritage-youtube-data-api-key. - Poll every 60 seconds during an active live event; write result to App Insights as a custom metric
youtube.concurrent_viewers. - Alert if
concurrent_viewersdrops from > 0 to 0 mid-stream (stream ended unexpectedly) → Severity 2 → notify steward.
Alerts:
- YouTube embed URL HEAD check fails for > 2 consecutive polls (30 s) →
trackEvent('youtube.stream_down')→ Log Analytics alert → Severity 1 during active service. concurrent_viewersdrops to 0 during a scheduled live event window → Severity 2.- YouTube Data API quota nearing limit (
quotaExceedederror in API logs) → Severity 2.
Log queries (KQL):
kql
// YouTube stream health events
AppEvents
| where TimeGenerated > ago(4h)
| where Name startswith "youtube."
| project TimeGenerated, Name, Properties
| order by TimeGenerated desc
// Stream-down events
AppEvents
| where Name == "youtube.stream_down"
| project TimeGenerated, Properties7. Cloudflare DNS ​
What can fail:
- CNAME record pointing at wrong Azure resource (root cause of listen.heritageva.app incident).
- DNS propagation delay after record change.
- Cloudflare API token revoked → automated DNS operations fail silently.
- DDoS or rate-limiting affecting DNS resolution.
Collection:
Cloudflare API polling (implement via Azure Function on schedule):
- Every 10 minutes, query
GET /client/v4/zones/{zone_id}/dns_records?name=listen.heritageva.appand?name=heritageva.app. - Compare
contentfield against expected values (stored as Azure Function app settings). - If
contentdoesn't match expected SWA hostname → writetrackEvent('dns.cname_mismatch')→ Log Analytics alert → Severity 1. - Cloudflare API token stored in KV as
hcs-platform-cloudflare-api-token(already present).
Expected DNS record values to monitor:
| Hostname | Expected CNAME content |
|---|---|
listen.heritageva.app | salmon-beach-0c664ab0f.7.azurestaticapps.net |
heritageva.app | Azure SWA IP (A record via Cloudflare proxy) |
Azure Monitor URL availability tests:
- Add ping tests in App Insights for:
https://heritageva.app— expect HTTP 200, response < 3 s.https://listen.heritageva.app— expect HTTP 200, response < 3 s.https://api.heritageva.app/health(or equivalent API health endpoint).
- Test from 5 Azure Monitor test locations globally.
- Alert if > 2 locations fail → Severity 1.
Log queries (KQL):
kql
// DNS mismatch events
AppEvents
| where Name == "dns.cname_mismatch"
| project TimeGenerated, Properties
| order by TimeGenerated desc
// Availability test failures
AppAvailabilityResults
| where TimeGenerated > ago(1h)
| where Success == false
| project TimeGenerated, Name, Location, Message
| order by TimeGenerated descImplementation checklist ​
The following items must be completed to make this ADR fully implemented. Each is a discrete ADO story.
Immediate (fixes broken monitoring) ​
- [ ] Fix API telemetry (CRITICAL): Retrieve App Insights connection string from
az monitor app-insights component show --app appi-heritageva-prodand store in KV asappinsights-connection-string. Verify[telemetry] App Insights ENABLEDappears in Container App startup logs. - [ ] Fix listen.heritageva.app SWA hostname: Remove
listen.heritageva.appfrom old SWA (rg-heritageva-prod-eus2); register it on production SWA (rg-heritageva-prod-eus). - [ ] Set alertEmail in Bicep deployment params: Ensure
kris@hybridsolutions.cloudis passed asalertEmailin the production deployment so the Action Group and alert rules actually fire to someone.
Infrastructure (Bicep additions) ​
- [ ] SWA diagnostic settings: Add
Microsoft.Insights/diagnosticSettingstoswa.biceprouting to Log Analytics. - [ ] PostgreSQL diagnostic settings: Add diagnostic settings routing
PostgreSQLLogsandQueryStoreRuntimeStatisticsto Log Analytics. - [ ] Key Vault diagnostic settings: Add diagnostic settings routing
AuditEventlogs to Log Analytics (catchesSecretGetFailed). - [ ] Container App crash-loop alert: Add metric alert rule on Container App
RestartCount > 2 in 10 min. - [ ] Cost budget alert: Add
Microsoft.Consumption/budgetsBicep resource with email alert at 100% and 130% of monthly budget. - [ ] App Insights URL availability tests: Add
Microsoft.Insights/webtestsBicep resources forheritageva.app,listen.heritageva.app, and the API health endpoint.
Application (code additions) ​
- [ ] Clerk webhook handler: Add
POST /api/v1/webhooks/clerkroute toapps/api. Verify signature withCLERK_WEBHOOK_SECRET(KV). Write events totrackEvent. Subscribe atdashboard.clerk.com. - [ ] Auth failure telemetry: Verify
requireAuth.tscallstrackEvent('auth.failure', { reason })on every failed JWT validation. Add if missing. - [ ] YouTube embed health check: Extend
GET /api/v1/media/live/currentto HEAD-check the embed URL and returnembedReachable: boolean. HaveWatchLivePage.tsxshow degraded banner whenfalse. - [ ] YouTube Data API viewer count: Implement
GET /api/v1/media/live/current/statsthat queries YouTube Data API v3 by video ID during active events. Store API key in KV asheritage-youtube-data-api-key.
Automation ​
- [ ] DNS monitor Function: Azure Function (Timer trigger, 10 min) that validates Cloudflare CNAMEs against expected values and writes
dns.cname_mismatchevents to App Insights on mismatch. - [ ] SWA deployment asset freshness check: Add a step to the GitHub Actions
Deployworkflow that curlsLast-Modifiedfromhttps://heritageva.appandhttps://listen.heritageva.apppost-deploy and logs a warning if either is > 1 hour old.
What each stakeholder sees ​
| Stakeholder | Tool | What they watch |
|---|---|---|
| Platform operator (Kris) | Azure Portal → App Insights → Failures | API errors, auth failures, YouTube stream events |
| Platform operator (Kris) | Azure Portal → Log Analytics | Cross-surface KQL queries; migration failures; KV access denied |
| Platform operator (Kris) | Azure Portal → Monitor → Alerts | Active alert rules firing to email |
| Platform operator (Kris) | dashboard.clerk.com → Logs | Auth session events; domain errors; suspicious sign-ins |
| Platform operator (Kris) | studio.youtube.com → Live | Stream health during service (manual) |
| Platform operator (Kris) | Email (kris@hybridsolutions.cloud) | Critical and warning alert emails from Action Group |
| Platform operator (Kris) | status.clerk.com | Clerk platform status (subscribed to email updates) |
| Minister | Admin portal → Audit Log | Member activity, approval actions (Layer 1) |
Consequences ​
Positive ​
- Every platform surface has defined collection, queries, and alerts — no more silent failures.
- The
listen.heritageva.app-class incidents (wrong CDN, missing KV secret, unregistered Clerk domain) all have alerts that would have caught them in minutes. - YouTube stream failures surface as an in-app banner rather than a blank embed.
- DNS drift is detected automatically within 10 minutes of a record change.
Negative / trade-offs ​
- The DNS monitor Function and YouTube Data API integration add operational surface area to maintain.
- YouTube Data API v3 has a daily quota of 10,000 units. Polling every 60 s during a 2-hour stream = 120 requests × ~2 quota units each = ~240 units/stream. Well within quota.
- Clerk webhook secret (
CLERK_WEBHOOK_SECRET) must be rotated if compromised — add to secret rotation runbook (docs/internal/operations/secret-rotation.md).
Risks ​
- Cloudflare token rotation: If
hcs-platform-cloudflare-api-tokenrotates without updating KV, the DNS monitor Function will fail silently. Add a KV expiry policy for that secret. - Log Analytics retention cost: With all diagnostic settings enabled, log volume will increase. Monitor cost in the first month; adjust retention tiers if it spikes.
- YouTube Data API dependency: If YouTube deprecates or changes the Live Broadcasts endpoint, the viewer count metric will fail. Design the endpoint to fail gracefully and return
nullforconcurrentViewersrather than erroring.
References ​
| Document | Relevance |
|---|---|
| ADR 0005 — Observability model | Foundation four-layer model; this ADR extends it |
| ADR 0003 — Authentication: Clerk | Clerk Layer 4; webhook integration |
| ADR 0004 — Cloud/hosting stack | Azure resource names and resource groups |
| ADR 0010 — Sermons and music hub | YouTube Live provider; stream health |
| ADR 0024 — Cloud portability | Provider abstraction; non-Azure fallback for telemetry |
| ADR 0026 — Production domain and DNS | Cloudflare DNS; custom domain config |
| ADR 0028 — Per-portal subdomains | listen.heritageva.app SWA binding |
| ADR 0029 — CI/CD and Azure auth | GitHub Actions deployment; workflow failure detection |
| Observability runbook | KQL queries; on-call; incident response |
| Secret rotation runbook | Clerk webhook secret; Cloudflare token rotation |