01 / 19
HEALTHCARE AGENT · AI PRODUCT · DESIGN NOTES

Three parts.
One through-line.

SQL, a Head-of-Product dashboard, and an at-risk agent. Every decision points at the same thing: engagement that converts into the outcomes a health plan pays for.

The through-line

Engagement is the lever, not the scoreboard.

A health plan pays for engagement that converts into what it is scored on: HEDIS/Star quality and avoidable medical cost, which drives retention. So all three parts answer one question in sequence:

1SQL — find the worklistRead the data honestly, then tier members by value at stake, not by activity. Output: the 32 who matter.
2Dashboard — make it a Monday decisionOne screen, 30 seconds: where does the Head of Product point the team this week?
3Agent — act on it safelyFlag deterministically, draft the why with an LLM, and send nothing without a human.
Why · the thinking

The brief had three deliverables, but treating them as three disconnected exercises misses the point. The same business reality runs through all of them, so I let Part 1's analysis define Part 2's dashboard and Part 3's flag logic. The convergence at the end (two independent methods landing on the same 32 members) is the proof that the through-line is real, not imposed.

Part 1 · SQL · cleaning came first

Before a single query: find every defect, fix none of it silently.

·The mess (deliberately seeded)Three tables — members (500), claims (1,109), events (12,617 raw) — seeded with 3 date formats, category casing/whitespace variants, 23 duplicate events (12,617→12,594 after a lossless dedup), 50 orphan foreign keys (18 IDs past the roster, M0502–M0519), and NULLs across cost / type / risk (33 / 343 / 16).
·Clean + audit, cross-checkedThree cleaning passes normalized it (month-first date parsing — verified, not assumed: 148 of 242 non-ISO dates have a day-field > 12, so day-first wouldn't parse; category normalization; dedup by event_id), and two independent audit passes re-derived and reconciled every figure against a measured baseline. Defects logged to data/*_dq_report.md, cleaning re-derived in audit_*.md — cleaning as its own auditable deliverable, not a silent preprocessing step.
·First judgment: orphans aren't dirtMember IDs past the roster ceiling (M0502–M0519) are post-snapshot enrollees, not bad data — so I keep them, flag them orphan, and exclude them from roster metrics rather than silently dropping them.
·First judgment: missing ≠ 0A NULL cost is unknown, not zero; a NULL event_type still counts as one engagement; a NULL risk_score is never treated as high. Zero-filling would have quietly distorted every average.
Why · the thinking

The messy data is the test. A clean-looking number built on silently dropped rows or zero-filled NULLs is the failure mode, so I treated cleaning as a first-class, independently re-derived deliverable. Every downstream number then traces back to a defensible call — and these two judgments are exactly what make the ED denominator 207 (not 210) and the cost averages honest on the next slide.

Part 1 · SQL · the calls inside the queries

Then every judgment that moves a number, declared in the SQL.

·Orphans kept and flagged, not droppedClaim/event rows for members outside the 500 roster (M0502–M0519) are labeled orphan and excluded from roster metrics. This is why my ED denominator is 207, not 210 (3 orphans: M0507/M0513/M0518). I declare it rather than letting the number look wrong.
·NULL ≠ 0A NULL allowed_amount is unknown cost, skipped by SUM/AVG, never zero-filled; a NULL event_type still counts as one engagement; a NULL risk_score is never treated as high risk.
·Breakpoints from the dataEngagement tertiles (Low ≤13 / Medium 14–23 / High ≥24) are the actual cut points in the data, not round numbers I picked.
·Leap-year-safe windows · self-checkedThe 12-month window is (as_of − 1yr) + 1 day so 2024's leap day can't drop a New Year's event. A validator recomputes every headline number against an answer key: 36/36 PASS.
Why · the thinking

Silently dropping rows or zero-filling NULLs produces a clean-looking number you can't defend. Because the dataset is deliberately seeded and scored, the messy data is the test, so the right move is to surface each call in the SQL comments and let a reviewer audit it. A number I can defend beats a tidier number I can't.

Part 1 · SQL · the analytical move

Answer the literal question, then upgrade it to one worth acting on.

① Engagement tierthe literal Q2 ask · single axis (activity)cost flat 1.2× · $5.3k / 6.4k / 5.6k / 6.1k
② Cost-to-Serve tierre-tier on outcome · risk + EDcost spreads 13× ($1,598 → $20,808)
③ Action matrixtwo axes · cost × 60-day reachabilitypriority cell = 32

expected_annual_cost ≈ 2,733·risk + 4,355·ED − 1,760   an OLS fit of actual 2024 cost on risk + ED (the 484 scored members) — interpretable dollar drivers, not a black box. Plus Q3: 44.4% of ED claimants (92/207) had no engagement in the prior 30 days.

Why · the thinking

I built the single-axis engagement tier they asked for first — then the data killed it: cost is flat across tiers (1.2×), so engagement volume doesn't predict cost. So I kept engagement as the reachability axis and re-tiered on the outcome: a Cost-to-Serve regression that scores any member in dollars a clinician can interrogate. The move is the point — answer the literal question, prove it's insufficient, then build the one worth acting on.

Caveats stated up front, because the method is the deliverable: R²≈0.22 on synthetic data (real HCC/ACG models reach 0.4–0.5+), and same-period ED is a concurrent feature, so production would use trailing-period features (the cleaner v2 treats risk as the leakage-free stakes axis and ED as the action trigger, which is exactly what Part 3 keys on).

One honesty note worth saying out loud: Cost-to-Serve is only the avoidable-cost leg of plan value. The HEDIS/Star quality leg can’t be built on synthetic data, it needs the plan’s real gaps-in-care, ADT, and refill feeds. I name that gap rather than fake it.

Part 2 · Dashboard · three heroes, and what I cut

One screen, 30 seconds: where do I point the team this week?

63.4%Engaged-Member Rate (317/500, 30d) — top of funnel; falls first, everything lags it
32At-Risk Cohort — the literal call list this week
44.4%ED, no prior engagement (92/207) — the money metric, clearest line to avoidable cost

Each tile is one hop on engagement → care-gaps closed → ED avoided → retention. Deliberately left off: DAU, total events, feature clicks, age/plan breakdowns. They don't move retention and they crowd the scan.

Why · the thinking

The Head of Product has one screen and 30 seconds, so the scarce resource is attention, not data. A metric earns a tile only if it sits on the path to a plan's scorecard and implies an action. Outcomes over output: I'd rather show three numbers that drive a decision than twelve that fill a grid.

Part 2 · Dashboard · alerts, layout, trade-offs

When a tile flags, and the calls behind the layout.

Engaged-Member Rateamber down 2–5 pts · red down > 5 pts
At-Risk Cohortamber up 1–10% · red up > 10%
ED, no prior engagementamber on any increase · red if cohort also red
Data freshnessamber rows shift > 2% · red any refresh failure
Why · the trade-offs

Sensitivity vs. alert fatigue: hero rates move slowly, so a >5pt weekly drop is a real regression, not noise; but the ED money-metric flags amber on any rise, because under-reacting there is the expensive mistake. A data-freshness tile always sits in the corner, a stale dashboard is worse than none.

Two windows on purpose: the "engaged" hero uses 30 days (standard active window); the at-risk worklist uses a stricter 60-day "gone-dark" window. Different jobs, both labeled. And though the brief said no build was required, I shipped a self-contained HTML dashboard too, because the role's bias is to build, not just spec.

Part 2 · Dashboard · what I shipped

Not just a brief. A working dashboard, rendered from the cleaned data.

The Monday dashboard — three heroes, Cost-to-Serve panel, and the cost-by-reach worklist

The brief said no build was required; I shipped a self-contained HTML dashboard anyway. The three heroes (63.4% / 32 / 44.4%), the Cost-to-Serve panel (13× spread), and the cost × 60-day-reach worklist (priority cell = 32) — the same numbers, on one screen, with the alert rules along the bottom.

Part 3 · Agent · the key call

The LLM never decides who is at risk. Rules do.

Datamembers · claims · events
RulesR1 / R2 flag the cohort
LLMdrafts the why only
Supervisorguard + critic
Gatehuman releases
For a vulnerable population the cohort must be auditable and arguable, so the model is kept off the decision. Paths considered: Let the LLM decide who's at risk — the cohort becomes a black box you can't reproduce, and a hallucinated inclusion is a real member getting a real call. A pure ML risk score — a single opaque number a medical director can't argue with; here the cutoffs are percentiles a clinician can move. ✓ Deterministic percentile rules pick the cohort; the LLM only writes the human-readable why; a structural gate means a person always sends. Correctness in code, language in the model, the decision with a human.
Part 3 · Agent · the pipeline

Five steps, each with a clear contract.

1Ingest (read-only)Orphans excluded, rows restricted to the roster, missing ≠ 0. The agent never holds write access.
2Flag (deterministic, reason-coded)Two percentile rules produce reason_codes per member. 32 flagged. No model in this step.
3Structured facts (minimum-necessary)Only risk_score, days-since-engagement, last acute claim, events in 60d. No name, address, or diagnosis text.
4Draft (gpt-4o-mini)Care-coordinator audience, never member-facing; 3–4 sentences; the why from given facts only; one next-best-action; no clinical advice.
5Supervise + gateTwo-layer QC, then review_queue.json. Nothing leaves until a human runs --release-approved.
Why · the thinking

A deterministic orchestrator, not an autonomous loop: five fixed contracts mean every step logs, tests, and evaluates in isolation. "What to do next" was never the hard part here, so the loop's freedom would add risk without value.

Part 3 · Agent · the cohort

Two transparent rules. 32 members. Numbers a clinician can move.

R1dormant_high_risk → 26 membersrisk_score ≥ p75 (≈3.37) AND zero engagement in the last 60 days. High-risk members who have gone dark.
R2post_acute_no_followup → 11 membersAn ED claim, or a claim ≥ p90 (≈$16,719), in the last 60 days with zero engagement after it. Acute events nobody followed up on.
32flagged (union; 5 fire both)
p75 / p90thresholds from the data, arguable
6.4%of the 500-member roster
Why · the thinking

No single signal is enough: engagement alone is flat on cost, risk alone over-flags members who are already engaged. The actionable list is the intersection of clinical risk, behavioral disengagement, and acute utilization. Percentile thresholds make it reproducible and let a medical director argue p80/p95 against a clear reference instead of a hand-picked number.

Part 3 · Agent · who owns correctness

Two reviewers, split by what each is actually good at.

Deterministic guard owns exact, rule-based invariants · LLM critic owns semantic fidelity
GDeterministic guard — always on, zero networkFooter present; no cost-editorialization on an un-sourced dollar amount; a "N days since the claim" count must equal days_since_last_acute_claim, not the engagement gap; no unsupported "no claims" absence.
CLLM critic — independent passReturns {pass, needs_human_review, violations}. Three states: pass / uncertain / fail. Catches invented facts, clinical advice, second-person address, tone.
Bounded self-heal — one retryA hard fail feeds violations back for exactly one repair attempt. uncertain never retries; it passes but elevates for human review.
Why · the thinking

LLMs are unreliable at exact equality. The real bug, caught in testing: a note said "32 days since the claim" using the engagement field. An LLM critic asked "is the number present?" passes it. The regex guard catches it because 32 ≠ days_since_last_acute_claim. Same split as a reconciliation agent I shipped: LLM reasons, exact matching does the math, ~70% → 99.2%.

Part 3 · Agent · eval, the calibration that was the real work

A critic is only useful once it stops crying wolf.

32 flagged notes critic = independent pass negative tests retained recall held constant
Critic v1 · uncalibrated 100% false-hold rate, every valid note flagged 3 false-positive classes: required outreach read as clinical advice · third-person read as second-person · date formats read as invented
Critic v3 · calibrated 0/32 false-holds, with recall intact Negative tests still catch invented facts, clinical advice, second-person address, and a bare "high risk" with no grounding.
Why · the thinking

Then the eval earned its keep: QC caught two real fact-fidelity bugs the critic had been passing, a day-count on the wrong field and a "no claims" overclaim on history-only members. Both are now blocked by deterministic guards. The lesson: LLM fact-fidelity is long-tail, deterministic guards cover the known classes, an eval harness plus mandatory human review cover the unknown ones.

Part 3 · Agent · the human gate, enforced

The run sends nothing. A human always does.

QC passednormal · queued, releasable after routine approval
held_failedblocked · not releasable until a human rewrites it
flagged_for_review / critic_unavailableelevated · releasable, but scrutinize first
release + notifySlack fires only with --notify + webhook, else staged
Why · the thinking

Silent degradation, or an agent autonomously contacting a fragile member, is the most expensive failure mode. So the footer is re-appended programmatically if dropped, a critic error becomes critic_unavailable instead of a crash or silent pass, and dry-run is the default (zero network, zero cost). The footer is a reminder; the queue is the gate. "The agent decided to contact a member" is structurally impossible. I also mirrored the whole pipeline as an importable n8n workflow (19 nodes, with a Slack send-and-wait-for-approval node, authored not run, Python is the tested source of truth) — so the same human-approval gate is expressible as a no-code workflow a non-engineer can run, not just Python. In production it runs in a BAA-covered environment on minimum-necessary fields only — no name, address, or diagnosis text.

Self-consistency · the parts agree

Two independent methods, one worklist.

Part 1 · a dollar formula 32 high-cost × disengaged (Cost-to-Serve × 60-day reach)
Part 3 · two behavioral rules 32 R1 dormant-high-risk ∪ R2 post-acute-no-followup

The two 32-member sets overlap on 24. The same reality, high stakes and gone dark, surfaced two independent ways.

Why · the thinking

This is the payoff of the through-line. I didn't tune one method to match the other; they were built separately, a regression in SQL and behavioral rules in Python, and they converge because they're both reading a real pattern in the data. Strong agreement is evidence the cohort is right; the 8-member disagreement is the honest edge where the two definitions differ, which is itself a useful thing to review.

Tech choices

Every choice, with what it rejected.

StackDuckDB + pandas + Python 3.14 Runs over the cleaned CSVs with zero infrastructure; a self-checking harness recomputes every headline number (36/36 pass). ✗ Rejected a warehouse / notebook stack for a build this size: the complexity budget belongs in the analysis, not the plumbing.
Modelgpt-4o-mini · temp 0.2 · 320 tok · 1 retry The drafting job is language, not judgment, so a small model with a tight token cap is the right cost. ✗ Rejected a frontier model everywhere: writing a 3-sentence note from given facts doesn't need it, and cost scales with the member base.
QCDeterministic guard + independent LLM critic Split by failure mode: the guard owns exact invariants, the critic owns semantics. ✗ Rejected "ask the critic to check everything": LLMs miss exact-equality bugs, which is exactly where member trust breaks.
SafetyDry-run default · key never logged · BAA path No network or spend unless you pass --live; the key is read at runtime only; PHI is minimum-necessary; production runs under a BAA against a read-replica. ✗ Rejected live-by-default: an agent that can spend or send on its first run is a footgun.
Demo → production · I

Foundations: a learned model, real-time triggers, incremental runs.

Learned cost model
ADT triggers
Incremental state
Outcome eval
Tiered autonomy
NBA engine
1Learned cost modelSwap the OLS score for a gradient-boosted / survival model on trailing-period features (prior-year risk, ED, diagnosis mix). This removes the demo's concurrent-ED leakage; real HCC/ACG models reach 0.4–0.5+ R².
2Real-time ADT triggersFire R2 at the hospital discharge event, the highest-leverage follow-up window, instead of a weekly batch.
3Incremental state machine + alert controlAn ingest watermark plus per-member alert state (flagged → alerted → dispositioned) so a run drafts only on new transitions; a spike circuit-breaker (one digest, not N pages); read-only against a replica.
Why · the thinking

The demo proves the judgment split; production needs the data and execution scaffolding so repeated runs don't re-alert the standing cohort or add load to the operational database.

Demo → production · II

Then eval, autonomy tiers, and the bridge to the Next Best Action engine.

Learned cost model
ADT triggers
Incremental state
Outcome eval
Tiered autonomy
NBA engine
4Outcome backtesting + trace / versioningLabel whether flagged members later had a gap-in-care closure or ED revisit, and tune thresholds against outcomes, not by eye. Add trace observability, prompt/version pinning, and a regression set any change must pass before shipping.
5Tiered autonomy at scaleHand-approving every send doesn't scale. Auto-send low-risk nudges, force human review on clinical-adjacent ones, gated by critic confidence. Keep the human where the stakes are.
6The same shape powers a production next-best-action engineIngest claims / gaps-in-care / refill / ADT → make it legible for an agent → the agent picks the next best action per member → eval against outcomes → human in the loop where it matters. This project is that pattern at small scale.
Why · the thinking

The real bottleneck isn't compute. It's ground truth (no labeled "right action"), per-member context, and keeping a human in the loop affordably as volume grows. The roadmap is ordered by that list.

“Put the probabilistic model where judgment helps. Keep determinism where correctness is non-negotiable. Gate the spend, and the outreach, with people.”

SQL found the worklist, the dashboard made it a Monday decision, the agent acted on it safely. That boundary judgment is how I'd build the Next Best Action engine.

Healthcare Agent · Design Notes · Matt Zheng · 2026