Troubleshooting Runbook
Use this runbook when: A live Wait Experience + Admin Email deployment is misbehaving โ caller complaint, missing email, broken CRM pop, or unexplained config drift. Find the symptom in the section list, follow the diagnostic steps. If multiple symptoms apply, start with the most caller-visible one.
1. Role / Authorityโ
Run as CTO-Connie. Diagnostic commands are read-only and safe to run without CEO approval. Any fix that touches live config, redeploys serverless, or modifies a Studio Flow requires CEO approval first โ no exceptions, even when the fix seems trivial.
If you suspect cross-account contamination (Symptom: "CCT/DevSandbox config changed unexpectedly"), stop all other work, escalate to CEO immediately, and follow the high-severity path in that section.
2. Required Parametersโ
Gather before diagnosing:
| Parameter | Why you need it |
|---|---|
| Connie client account name | Selects the right Twilio profile (twilio profiles:use <ClientName>) |
| Reported symptom | Match to a section below; if no match, see ยง7 If Your Variant Differs |
| When it started | Narrows the timeline; correlate with deploy log and git log |
| Sample call SID (if available) | Lets you pull exact Studio execution + serverless logs for that call |
| Reporting party | CEO, client admin, agent, automated alert? Determines escalation path. |
| Reproducibility | One-off, intermittent, or 100% repro? Affects diagnostic strategy. |
If "when it started" coincides with a recent deploy or change-request, review the dev-log for that deploy first โ most live issues trace to the most recent deploy.
3. Read Firstโ
~/projects/connie/rtc/basecamp-v26.02/CLAUDE.mdโ Flex Configuration Safety Protocol applies if you end up making fixes.- The dev-log for the most recent deploy/change in
~/projects/connie/rtc/dev-logs/wait-experience-<client>-*โ the most-likely cause is whatever changed last. ~/projects/connie/rtc/PAC.mdโ current SIDs, Mailgun domain, env file paths for the target client.- Setup runbook โ for understanding what should be in place.
4. Safety Rails for This Changeโ
Troubleshooting is read-only by default. The moment you cross from diagnose to fix, the Setup-runbook safety rails apply in full:
- Diagnostic commands are safe (curl GETs,
twilio api:* list,twilio serverless:logs --tail,jq-style filters). Run freely. - Fix commands are NOT safe without CEO approval. Redeploy, config write, Studio Flow republish, env file edit โ all require explicit go-ahead.
- Capture a PRE-fix snapshot before any change, even an "obvious" one. The fix may have unintended consequences worth diffing.
- NEVER write directly to the Flex Configuration API. Drift fixes go through the deploy pipeline or
/template-admin. - Cross-account contamination is HIGH severity. Stop investigating anything else, escalate to CEO, and follow the dedicated section below before further commands.
- Check the right profile is active.
twilio profiles:use <ClientName>before every command. Wrong profile = wrong account = potential second incident.
5. Procedureโ
Diagnostic toolkit (run these first)โ
# Confirm correct profile
twilio profiles:list
twilio profiles:use <ClientName>
# Last 5 calls
twilio api:core:calls:list --limit 5
# Studio Flow recent executions
twilio api:studio:v2:flows:executions:list --flow-sid <flow-sid> --limit 5
# Serverless function logs (live tail)
twilio serverless:logs --service-sid <service-sid> --tail
# Live config snapshot
curl -u "$API_KEY:$API_SECRET" https://flex-api.twilio.com/v1/Configuration | jq '.ui_attributes.custom_data.features'
Symptom: Caller hears nothing / dead air after greetingโ
Most likely cause: Wait URL is broken, malformed, or unreachable.
Diagnose:
- Open the Studio Flow โ click the
Send to Flexwidget โ check theWait URL. - The URL should look like:
https://<serverless-domain>/features/callback-and-voicemail-with-email/studio/wait-experience?WorkflowSid=<workflow-sid> curl -X POST <wait-url>from your terminal โ expect TwiML response (XML), not 404 or 5xx.- Check serverless logs for errors at the time of the reported call.
Common fixes:
- Serverless function not deployed in this environment โ redeploy.
- Serverless service deleted โ restore from
git, redeploy. - DNS/networking blip โ retry; if persistent, check Twilio status page.
Symptom: Callbacks/voicemails arrive in the wrong queueโ
Most likely cause: Missing or wrong WorkflowSid query param in the Wait URL.
Diagnose:
- Open the Studio Flow โ
Send to Flexwidget โ Wait URL. - Confirm
?WorkflowSid=<workflow-sid>is present and matches the queue you want callbacks/voicemails to land in.
Fix:
- Edit the Wait URL on the widget. Append/correct the
WorkflowSidquery param. - Publish the flow (don't forget โ unpublished changes don't take effect).
Background: wait-experience.protected.js falls back to a hardcoded H2H workflow SID when no query param is provided. Per-flow override is required. See GitHub issue WO-001 followup #2 for the architectural fix.
Symptom: Admin email not arrivingโ
Most likely cause (in order of frequency):
- Mailgun delay (1โ6 hours is normal for newly-added recipient addresses).
- Wrong API key โ domain-scope vs master-scope mismatch.
- Email landing in spam.
- Domain not verified in Mailgun.
ADMIN_EMAILenv var empty or malformed.
Diagnose:
- Check Mailgun dashboard โ Logs for the relevant domain โ filter by recipient.
- If "delivered" in Mailgun: it's the recipient's spam folder, mail server, or filtering. Out of Connie's control.
- If "rejected" or "failed": read the error reason. Most commonly invalid recipient or domain not verified.
- If no log entry at all: the function never called Mailgun. Check serverless logs for errors during the relevant call.
Common fixes:
- 401/403 from Mailgun โ wrong key. Verify it's a domain-scoped sending key, not the account-master key.
- 400 from Mailgun โ domain not verified. Check Mailgun dashboard for DNS verification status.
- No serverless log โ check
ADMIN_EMAILenv var. Empty string disables email. - Recording attachment 404 on first attempt โ normal. The function retries. If it fails repeatedly, check
ACCOUNT_SID/AUTH_TOKENfor recording fetch.
Symptom: Voicemail recording is missing or empty in admin emailโ
Most likely cause: The function's recording fetch happened before Twilio finished writing the recording.
Diagnose:
- Find the relevant call in
twilio api:core:recordings:list. - Confirm the recording has a non-zero
durationand a validuri. - Check serverless logs for "fetch recording" errors โ 404 or 401.
Common fixes:
- 404 โ first-attempt timing issue. Function retries. If retries fail, increase the retry delay in the function source.
- 401 โ wrong
ACCOUNT_SID/AUTH_TOKEN. Note: recording fetch uses account credentials, not API key/secret. Confirm.envhas the correct token.
Symptom: Transcription text missing from admin emailโ
Most likely cause: Twilio transcription service is async; the email may have been sent before transcription completed.
Diagnose:
- Wait 2 minutes after the original call. Check Mailgun for a follow-up email or check the agent task โ the transcript may have arrived later.
- Check Twilio Voice โ Transcriptions for the recording SID.
Common fixes:
- Transcription disabled โ re-enable in the Studio
Record Voicemailwidget settings or the function recording config. - Polly NTTS or transcription quota issues โ check Twilio status / billing.
- Transcription completed but email already sent โ expected behavior. The agent task gets the transcript on its async update; the admin email is a one-shot.
Symptom: Caller's CRM screen doesn't pop / shows generic pageโ
Most likely cause: profile_url is not set on the task attributes, OR the CRM container url is hardcoded instead of using the Liquid template.
Diagnose:
- Open the agent dashboard during a test call. Right-click the task โ Copy Attributes JSON. Look for
profile_url. - If
profile_urlis missing: the task creation logic isn't populating it. This is the upstream problem captured in GitHub issue #4. - If
profile_urlis present but CRM still shows generic: checkenhanced_crm_container.urlin the liveui_attributes. It MUST be{{task.attributes.profile_url}}โ a literal string with the Liquid braces.
Common fixes:
- Source file has a hardcoded URL โ edit
ui_attributes.<client>.json, seturlto{{task.attributes.profile_url}}, redeploy withOVERWRITE_CONFIG=true. - Live has hardcoded URL but source is correct โ use
/template-adminto fix live, then verify source matches. profile_urlmissing on task attributes โ refer to GitHub #4 for the upstream fix; in the meantime callers see the no-task fallback URL.
Symptom: Configuration drift โ source files don't match liveโ
Most likely cause: Someone wrote directly to the Configuration API, or made changes via /template-admin that weren't reflected in source files.
Diagnose:
curl -u "$API_KEY:$API_SECRET" https://flex-api.twilio.com/v1/Configuration | jq '.ui_attributes.custom_data' > live.json
diff <(cat flex-config/ui_attributes.<client>.json | jq '.custom_data') live.json
Fix:
- Update source files to match live (the source of truth has implicitly become live). Commit.
- Or, if the live state is wrong: deploy with
OVERWRITE_CONFIG=trueto make source win. - Either way, document why the drift happened so it doesn't recur.
Prevention: The deploy pipeline's merge({}, common, env, current) semantics make live always win on routine deploys. This is intentional. But it means source files can silently drift. A periodic drift check should be part of the monthly Connie audit.
Symptom: Cross-account contamination (CCT/DevSandbox config changed unexpectedly)โ
Severity: HIGH. Stop investigating other things and chase this first.
Most likely cause: A deploy ran with the wrong Twilio profile active.
Diagnose:
- Check git log for recent deploys.
- Compare CCT or DevSandbox PRE/POST snapshots from the most recent deploy.
- Identify which keys changed.
Fix:
- Revert the affected account by re-deploying its source files with
OVERWRITE_CONFIG=true, OR by manually correcting via/template-admin. - Document the incident in
~/projects/connie/rtc/docs/incidents/. - Update CLAUDE.md if a new safety check is needed to prevent recurrence.
Prevention: The Phase 0.3 forensic-baseline pattern in Setup catches this before it reaches a live caller. Never skip baselines.
When to escalateโ
Escalate to CEO immediately if:
- Cross-account contamination is confirmed.
- Live agents are receiving tasks with wrong attributes (caller IDs swapped, queue routing wrong).
- Voicemail recordings are corrupted or missing across multiple calls.
- Mailgun is returning systemic 5xx errors that don't resolve in a retry.
- Any change request involves modifying the H2H direct-to-voicemail flow on NSS production (
+17259999678).
For lower-severity issues, file a GitHub issue on ConnieML/basecamp-v26.02 and reference this troubleshooting guide.