SQL, a Head-of-Product dashboard, and an at-risk agent. Every decision points at the same thing: engagement that converts into the outcomes a health plan pays for.
A health plan pays for engagement that converts into what it is scored on: HEDIS/Star quality and avoidable medical cost, which drives retention. So all three parts answer one question in sequence:
The brief had three deliverables, but treating them as three disconnected exercises misses the point. The same business reality runs through all of them, so I let Part 1's analysis define Part 2's dashboard and Part 3's flag logic. The convergence at the end (two independent methods landing on the same 32 members) is the proof that the through-line is real, not imposed.
event_id), and two independent audit passes re-derived and reconciled every figure against a measured baseline. Defects logged to data/*_dq_report.md, cleaning re-derived in audit_*.md — cleaning as its own auditable deliverable, not a silent preprocessing step.orphan, and exclude them from roster metrics rather than silently dropping them.event_type still counts as one engagement; a NULL risk_score is never treated as high. Zero-filling would have quietly distorted every average.The messy data is the test. A clean-looking number built on silently dropped rows or zero-filled NULLs is the failure mode, so I treated cleaning as a first-class, independently re-derived deliverable. Every downstream number then traces back to a defensible call — and these two judgments are exactly what make the ED denominator 207 (not 210) and the cost averages honest on the next slide.
orphan and excluded from roster metrics. This is why my ED denominator is 207, not 210 (3 orphans: M0507/M0513/M0518). I declare it rather than letting the number look wrong.allowed_amount is unknown cost, skipped by SUM/AVG, never zero-filled; a NULL event_type still counts as one engagement; a NULL risk_score is never treated as high risk.(as_of − 1yr) + 1 day so 2024's leap day can't drop a New Year's event. A validator recomputes every headline number against an answer key: 36/36 PASS.Silently dropping rows or zero-filling NULLs produces a clean-looking number you can't defend. Because the dataset is deliberately seeded and scored, the messy data is the test, so the right move is to surface each call in the SQL comments and let a reviewer audit it. A number I can defend beats a tidier number I can't.
expected_annual_cost ≈ 2,733·risk + 4,355·ED − 1,760 an OLS fit of actual 2024 cost on risk + ED (the 484 scored members) — interpretable dollar drivers, not a black box. Plus Q3: 44.4% of ED claimants (92/207) had no engagement in the prior 30 days.
I built the single-axis engagement tier they asked for first — then the data killed it: cost is flat across tiers (1.2×), so engagement volume doesn't predict cost. So I kept engagement as the reachability axis and re-tiered on the outcome: a Cost-to-Serve regression that scores any member in dollars a clinician can interrogate. The move is the point — answer the literal question, prove it's insufficient, then build the one worth acting on.
Caveats stated up front, because the method is the deliverable: R²≈0.22 on synthetic data (real HCC/ACG models reach 0.4–0.5+), and same-period ED is a concurrent feature, so production would use trailing-period features (the cleaner v2 treats risk as the leakage-free stakes axis and ED as the action trigger, which is exactly what Part 3 keys on).
One honesty note worth saying out loud: Cost-to-Serve is only the avoidable-cost leg of plan value. The HEDIS/Star quality leg can’t be built on synthetic data, it needs the plan’s real gaps-in-care, ADT, and refill feeds. I name that gap rather than fake it.
Each tile is one hop on engagement → care-gaps closed → ED avoided → retention. Deliberately left off: DAU, total events, feature clicks, age/plan breakdowns. They don't move retention and they crowd the scan.
The Head of Product has one screen and 30 seconds, so the scarce resource is attention, not data. A metric earns a tile only if it sits on the path to a plan's scorecard and implies an action. Outcomes over output: I'd rather show three numbers that drive a decision than twelve that fill a grid.
Sensitivity vs. alert fatigue: hero rates move slowly, so a >5pt weekly drop is a real regression, not noise; but the ED money-metric flags amber on any rise, because under-reacting there is the expensive mistake. A data-freshness tile always sits in the corner, a stale dashboard is worse than none.
Two windows on purpose: the "engaged" hero uses 30 days (standard active window); the at-risk worklist uses a stricter 60-day "gone-dark" window. Different jobs, both labeled. And though the brief said no build was required, I shipped a self-contained HTML dashboard too, because the role's bias is to build, not just spec.
The brief said no build was required; I shipped a self-contained HTML dashboard anyway. The three heroes (63.4% / 32 / 44.4%), the Cost-to-Serve panel (13× spread), and the cost × 60-day-reach worklist (priority cell = 32) — the same numbers, on one screen, with the alert rules along the bottom.
reason_codes per member. 32 flagged. No model in this step.risk_score, days-since-engagement, last acute claim, events in 60d. No name, address, or diagnosis text.review_queue.json. Nothing leaves until a human runs --release-approved.A deterministic orchestrator, not an autonomous loop: five fixed contracts mean every step logs, tests, and evaluates in isolation. "What to do next" was never the hard part here, so the loop's freedom would add risk without value.
risk_score ≥ p75 (≈3.37) AND zero engagement in the last 60 days. High-risk members who have gone dark.p90 (≈$16,719), in the last 60 days with zero engagement after it. Acute events nobody followed up on.No single signal is enough: engagement alone is flat on cost, risk alone over-flags members who are already engaged. The actionable list is the intersection of clinical risk, behavioral disengagement, and acute utilization. Percentile thresholds make it reproducible and let a medical director argue p80/p95 against a clear reference instead of a hand-picked number.
days_since_last_acute_claim, not the engagement gap; no unsupported "no claims" absence.{pass, needs_human_review, violations}. Three states: pass / uncertain / fail. Catches invented facts, clinical advice, second-person address, tone.fail feeds violations back for exactly one repair attempt. uncertain never retries; it passes but elevates for human review.LLMs are unreliable at exact equality. The real bug, caught in testing: a note said "32 days since the claim" using the engagement field. An LLM critic asked "is the number present?" passes it. The regex guard catches it because 32 ≠ days_since_last_acute_claim. Same split as a reconciliation agent I shipped: LLM reasons, exact matching does the math, ~70% → 99.2%.
Then the eval earned its keep: QC caught two real fact-fidelity bugs the critic had been passing, a day-count on the wrong field and a "no claims" overclaim on history-only members. Both are now blocked by deterministic guards. The lesson: LLM fact-fidelity is long-tail, deterministic guards cover the known classes, an eval harness plus mandatory human review cover the unknown ones.
normal · queued, releasable after routine approvalblocked · not releasable until a human rewrites itelevated · releasable, but scrutinize first--notify + webhook, else stagedSilent degradation, or an agent autonomously contacting a fragile member, is the most expensive failure mode. So the footer is re-appended programmatically if dropped, a critic error becomes critic_unavailable instead of a crash or silent pass, and dry-run is the default (zero network, zero cost). The footer is a reminder; the queue is the gate. "The agent decided to contact a member" is structurally impossible. I also mirrored the whole pipeline as an importable n8n workflow (19 nodes, with a Slack send-and-wait-for-approval node, authored not run, Python is the tested source of truth) — so the same human-approval gate is expressible as a no-code workflow a non-engineer can run, not just Python. In production it runs in a BAA-covered environment on minimum-necessary fields only — no name, address, or diagnosis text.
The two 32-member sets overlap on 24. The same reality, high stakes and gone dark, surfaced two independent ways.
This is the payoff of the through-line. I didn't tune one method to match the other; they were built separately, a regression in SQL and behavioral rules in Python, and they converge because they're both reading a real pattern in the data. Strong agreement is evidence the cohort is right; the 8-member disagreement is the honest edge where the two definitions differ, which is itself a useful thing to review.
--live; the key is read at runtime only; PHI is minimum-necessary; production runs under a BAA against a read-replica.
✗ Rejected live-by-default: an agent that can spend or send on its first run is a footgun.
flagged → alerted → dispositioned) so a run drafts only on new transitions; a spike circuit-breaker (one digest, not N pages); read-only against a replica.The demo proves the judgment split; production needs the data and execution scaffolding so repeated runs don't re-alert the standing cohort or add load to the operational database.
The real bottleneck isn't compute. It's ground truth (no labeled "right action"), per-member context, and keeping a human in the loop affordably as volume grows. The roadmap is ordered by that list.
“Put the probabilistic model where judgment helps. Keep determinism where correctness is non-negotiable. Gate the spend, and the outreach, with people.”
SQL found the worklist, the dashboard made it a Monday decision, the agent acted on it safely. That boundary judgment is how I'd build the Next Best Action engine.