Architecture

Cross-cutting descriptive WHY for imperfect-api. Read by a future engineer onboarding to the service and by an LLM running /research over the prior-art graph. Pairs with AGENTS.md (the terse module/env/command map) and docs/learnings.md (retrospective WHY across issues/PRs); references back to specific learnings entries inline as #N.

System overview + alice integration boundary

imperfect-api is a FastAPI service that owns the user-facing API contract (auth, onboarding, dashboards, notifications) and delegates work to sibling services:

alice — owns every health-provider OAuth concern (credentials, PKCE, token storage, refresh, revocation, vendor webhooks, backfill) and publishes ingestion events on Redis pub/sub.
cheshire — owns natural-language reasoning over each user's fitness data: turns "how did I sleep this week?" or "design a weekly plan" into a text answer or chart by generating SQL/Python, executing against the Postgres mirror, and returning plain text (see Cheshire integration boundary below).
hare — owns the dossier source-event log and downstream wakeup queue. imperfect-api emits source events for user-authored messages across its conversation surfaces (onboarding intake, race-rec, Home check-ins, board replies) and replays historical user-authored messages via a CLI backfill.

The shape with alice:

┌─────────────────────────────────────────────────────────────┐
│                      imperfect-api                          │
│                                                             │
│  - Orchestrate OAuth flow (connect + callback endpoints)    │
│  - User ↔ vendor mapping (HealthProviderConnection)         │
│  - Delegate OAuth + PKCE to alice                           │
│  - Delegate disconnect to alice                             │
│  - Read health data via alice's HTTP API                    │
│  - React to alice events via Redis pub/sub                  │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                         alice                               │
│                                                             │
│  - Holds provider OAuth credentials                         │
│  - PKCE generation (state, code_verifier, code_challenge)   │
│  - Authorization URL construction                           │
│  - Code exchange + token storage (Valkey/Redis)             │
│  - Token refresh                                            │
│  - Revocation with providers (Garmin deregistration)        │
│  - Webhooks & backfill                                      │
│  - Publishes `alice:events:*` on Redis pub/sub              │
└─────────────────────────────────────────────────────────────┘

Why alice owns the credentials. Vendor OAuth credentials (client IDs, client secrets, webhook signing keys) are scarce, per-environment, and audited by the vendor. Keeping them in one service means there is one place to rotate them, one place to point a vendor's allowlist at, and one process boundary that an attacker has to cross to exfiltrate them. imperfect-api stays credential-free by design — the only secret it holds for a vendor is the user's provider_user_id, which is useless without the tokens alice keeps. The rule generalizes: any new vendor with OAuth tokens delegates to alice (#105/#121).

Why imperfect-api still owns the user↔vendor mapping. Tokens live in alice, but who the tokens belong to is a product concern, not a vendor concern. HealthProviderConnection (app/models/health_provider_connection.py) maps our user_id to the vendor's provider_user_id and carries product state — connection status, permissions, last sync, error messages, the brand discriminator for multi-brand vendors like Terra. That state belongs next to User, not next to the token, because:

Onboarding completion gates on the HPC row's existence, not on whether tokens exist in alice (#290 — see the Onboarding shape section below).
Webhooks coming back from alice key on provider_user_id; imperfect-api needs the reverse index to route the event to a user. HealthProviderConnection.get_by_provider_user_id() is the seam.
Reconnects must be idempotent on (user_id, provider, brand) (#427); modelling that on imperfect-api's side keeps the rule in one place, instead of duplicating "what does it mean for this user to already be connected" across two services.

Why we delegate disconnect, not just connect. alice owns the revocation call to the upstream vendor (e.g. Garmin's deregistration endpoint). If imperfect-api revoked locally and alice held a now-zombie token, the next refresh would 401 silently and the user's connection status would drift. The contract is: DELETE /health-providers/{provider} first asks alice to revoke + drop the token, and only then deletes the local mapping. If alice fails, we don't half-disconnect.

Vendor-string translation lives at the boundary. alice's wire vendor for Apple is apple; imperfect-api's enum is apple_health. The translation happens once, in the DataStoredEvent.normalize_vendor validator at the subscriber edge (#316/#317). Internal code never sees alice's naming. The same rule covers any other vendor-string drift: translate at the boundary, never leak the sibling service's vocabulary into our models.

Cheshire integration boundary

Cheshire is imperfect-api's second sibling service. Where alice owns ingestion + storage of health data, cheshire owns reasoning over it: it takes a natural-language prompt + a user identity, generates SQL/Python on the fly, executes against the same Postgres mirror alice populates, and returns plain-text answers or NDJSON-streamed chart payloads. The HTTP boundary lives in app/lib/cheshire/client.py.

┌─────────────────────────────────────────────────────────────┐
│                      imperfect-api                          │
│                                                             │
│  - Build the prompt (profile, situations, prior plan)       │
│  - Pick the data path (cross-provider /ask, per-provider    │
│    /visualize, cross-provider /visualize, conversations)    │
│  - Stream NDJSON back to the mobile client                  │
│  - Run a local Sonnet formatter when shape is required      │
│  - Fall back to a local agent when cheshire is unavailable  │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                        cheshire                             │
│                                                             │
│  - Holds the SQL/Python codegen model + prompts             │
│  - Reads the Postgres health mirror (RLS-scoped)            │
│  - Returns plain-text answers or NDJSON chart frames        │
│  - Heartbeats every 10s on streaming endpoints              │
└─────────────────────────────────────────────────────────────┘

Why delegate reasoning instead of running it in-process. The codegen path needs the full Postgres mirror, vendor-specific prompt templates, and a model that's been tuned against this dataset. Co-locating those with the database service (cheshire) means SQL never crosses a network hop except when imperfect-api needs an answer, and the cheshire team can iterate on prompts/models without redeploying imperfect-api. The corollary in #492: when delegating reasoning to a sibling service, keep the local agent as a gated fallback, not a deletion. The weekly plan calls cheshire's POST /users/{user_id}/ask first, structures the prose via a local Sonnet formatter into WeeklyPlanSchema, and only falls back to the in-process planning_agent if cheshire fails, the user has no connected provider, or the formatter rejects the output. Deletion would have made cheshire outages user-visible; gated fallback keeps the product functional with a degraded experience.

Migration shape: one PR, no dual-mode. Provider notifications were migrated to cheshire in a single PR — the old formatting system was killed in the same change, not parked behind a feature flag (#349). The design rule: migrations that replace a path entirely ship as a single PR; dual-mode shipping invites drift between the two implementations, and the old one rots while no one is looking. The gated fallback in #492 is not the same shape — it's a degradation path for runtime failure, not a parallel implementation kept alive for opt-in.

Vendor → cheshire data_type translation is server-side, never client-trusted. Apple's cheshire data_type is workouts, not activities (#466). The per-vendor viz_data_type_for mapping is applied in both the notification path and the public /cheshire/visualize proxy — the proxy was previously forwarding the request body verbatim, which let a malformed client bypass the translation entirely. The general rule: service-to-service vocabulary translation belongs on imperfect-api, not on the mobile client. The client speaks our enum (apple_health, activities); imperfect-api translates to cheshire's vocabulary (apple, workouts) at the proxy. The translation only applies to the single-vendor /visualize path; cross-provider /visualize (issue #785) drops data_type entirely because the route spans every linked vendor.

Cross-provider is the default; primary-vendor scope is only for charts with row context. Every /ask call from imperfect-api hits cheshire's cross-provider route POST /users/{user_id}/ask with the full providers: [...] list in the body (#785). The old single-vendor /users/{user_id}/{vendor}/{provider_user_id}/ask froze on a primary vendor and RLS-scoped every query to it — for users who reconnected to a different provider (e.g., Garmin → Terra/WHOOP + Apple), cheshire correctly reported "blackout" while the notification pipeline still had real data and shipped a confident push paired with a contradictory sync-blackout context card. Same routing applies to /visualize: multi-provider users hit cross-provider POST /users/{user_id}/visualize so the chart spans every linked vendor instead of being scoped to a stale primary. The only survivor of the single-vendor visualize path is the row-scoped Garmin/Terra GPS map (/users/{user_id}/{vendor}/{provider_user_id}/{data_type}/{row_id}/visualize), which cross-provider doesn't support and which only makes sense for one vendor at a time anyway — so single-provider users keep that path and multi-provider users get an LLM-rendered chart instead. The rule: the cross-provider route is the default; reach for single-vendor only when the feature is intrinsically per-vendor (GPS routes today).

Public route progress is intent-first, localized second. The browser route planner (imperfect.co/routes, parent #989) receives canonical progress intents, not Cheshire internals. viewer_locale localizes browser chrome and replayed progress; generation_locale is persisted with the route run and forwarded to Cheshire for route prose, units, metadata, and rendered artifacts. Raw thinking, SQL/code, tool arguments, signed URLs, provider ids, exact private coordinates, and debug links never cross the browser boundary. The testable contract lives in docs/public-route-contract.md and app/lib/public_routes/contract.py, including CJK and Arabic-script font stacks for Cheshire renders and Looking Glass route pages.

Public route planning streams through imperfect-api, not Cheshire. Browsers call api.imperfect.co/routes; imperfect-api calls Cheshire's private browser-route stream contract from app/lib/cheshire/public_routes.py. The request carries explicit browser-shared location, locale/units, prompt text, and optional read-only route ancestry. Cheshire returns cursor-bearing technical events (source_sequence + cursor) with terminal outcomes for clarification, route options, ready route, cannot-route, or typed error. imperfect-api persists a BrowserRouteRun with browser-safe events, terminal/candidate summaries, rounded location, prompt preview/hash, locale/units/timezone, and outcome analytics; private Cheshire route-session ids, cursors, provider ids, signed artifact URLs, tool details, and trace links stay server-side for follow-ups. Full live replay/catch-up belongs to #993, but #994 establishes the persisted replay-safe run shape and create/continue/list/detail/analytics endpoints.

Don't layer a heartbeat over a service that already streams them. When imperfect-api proxies a cheshire NDJSON stream (e.g. /cheshire/visualize/stream), the proxy used to wrap queue.get() in asyncio.wait_for(..., timeout=...) to emit its own heartbeat. That broke for two reasons: cheshire already heartbeats every 10 seconds on the same stream, so the duplication confused the mobile parser; and asyncio.wait_for against an unbounded queue silently loses items because of cpython #86296. The fix dropped the imperfect-api heartbeat entirely (#463). The rule: trust the downstream service's streaming contract; layering shapes on top of it is how silent-drop bugs land. The corollary in #482: when the downstream is async and the user only sees the result when it arrives, drop arbitrary asyncio.wait_for deadlines that exist only to bound latency — a degraded-but-fast push trades worse than no push.

Visualize path choice is glanceability, not richness. Cheshire exposes two visualization shapes: the /conversations/{id}/visualize reuse path produces richer charts (more series, denser annotations) but reads worse at 1000×1000 in a push-notification image; the long-template /visualize path is the survivor (#481, won't-fix; the A/B record is preserved at .docs/issue-481/DECISION.md). The rule: notification-image charts optimize for one-second glanceability, not for analytical richness. Visualizations consumed in-app, with a viewer the user can interact with, are a different optimization and may want the reuse path later — but the notification image pipeline is the binding constraint today.

Where cheshire and alice converge. Both siblings share the same Postgres mirror — alice owns writes (via its ingestion + mirror workers), cheshire owns reads (via row-level-security-scoped SQL the codegen produces). imperfect-api never queries that database directly; the rule from #105/#121 — imperfect-api stays credential-free — extends to data: imperfect-api stays direct-DB-free for the health mirror. Health-data reads go through cheshire (for reasoning) or alice's HTTP API (for raw records via fetch_data / fetch_data_range); they never go through a connection string on this service.

Home artifact freshness boundary

home-artifact-v1 is the canonical Home artifact key spec for the #1195 responsiveness work. It deliberately covers more than the current production event-plan-v1 check: local date/timezone, locale/units, active provider state, Alice-owned health freshness tokens, active event id/version, plan identity/version/content, schema/app-version bucket, prompt/context hashes, and app-visible client/location settings. The full field contract, comparison reason codes, and warmer client-context tradeoff live in docs/home-artifact-cache-key.md. Until #1200/#1203 wire it into serving, UserBoard._prepare_generation still uses the existing event/plan staleness behavior; #1199 only defines the pure key and the inert local/shadow comparison harness in app/lib/board/cache_key_shadow.py and imperfect-cli home-cache-key-shadow.

Hare dossier boundary

Hare owns durable dossier source events. imperfect-api is a producer because its conversation surfaces are where the user gives us stable facts directly: goals, constraints, injuries, preferences, schedule, training style, and naming. The surface-agnostic boundary lives in app/lib/hare/dossier.py — DossierSource + emit_dossier_event + fire_dossier_event — and writes POST /users/{user_id}/dossier/events. Onboarding was the first live producer; #769 added the live board-message surfaces on the same core.

Producer surfaces. Each surface tags a surface (so telemetry reports emitted/skipped/failed per origin) and keys a stable source_ref:

Surface	`surface` tag	Source ref	Call site
Onboarding intake	`onboarding_intake`	`imperfect-api:onboarding_session:{session_id}:user_message:{turn_index}`	`extract_memory_async` (`/onboarding/intake`)
Race-rec	`race_recommendation`	same onboarding-session scheme	`extract_memory_async` (`/onboarding/race-recommendations`)
Home check-in	`home_checkin`	`imperfect-api:board_message:{board_message_id}`	`UserBoard._prepare_generation`
Board reply	`checkin_reply`	`imperfect-api:board_message:{board_message_id}`	`respond_to_notification`
Attachment extraction	`attachment`	`imperfect-api:attachment:{user_id}:sha256:{sha256}`	`fire_attachment_dossier_event`

Attachment identity. Images/PDFs are canonicalized as (user_id, SHA-256 bytes). A byte-identical re-upload by the same user reuses the same Hare source ref, while each upload occurrence remains visible in attachment census metadata (turn/message id, channel, filename, MIME, byte length, timestamp, status, and source). The same filename with different bytes is a candidate version for operator review, not a silent duplicate.

Live app/channel uploads also write a local AttachmentProcessingReceipt keyed by the same (user_id, SHA-256) identity. It is product-state only: it lets the current reply acknowledge exact duplicates, retry-after-failure, and same-filename new-version candidates while Hare still owns durable source idempotency and dossier processing.

imperfect-cli attachments backfill is the manual replay path for historical attachments whose bytes already live on ChannelTurn (or a local operator file). It uses the same attachment extractor and Hare source builder as live ingest, defaults to dry-run, and writes only with --no-dry-run. The command emits once per canonical user+bytes attachment with all exact duplicate upload turns attached as provenance; dry-run cannot query Hare source existence, so it reports stored canonical attachment presence and write mode reports Hare's emitted vs skipped_duplicate response.

Historical replay uses the same source builders when it can recover the live id. imperfect-cli dossier-backfill <OIS_id> replays only that onboarding session. --user-id and --all also page through user-role BoardMessage rows (surface="board_message") and free-text NotificationResponse replies (surface="checkin_reply") up to --limit rows per family. Notification replies use imperfect-api:board_message:{board_message_id} when the decorated board-history row is recoverable; otherwise they fall back to imperfect-api:notification_response:{notification_response_id} so the replay remains stable.

User-authored text is primary truth. Memory/profile/situation rows are useful derived evidence, but they are not the canonical source for a dossier seed. Each surface emits only the user-authored text, never an agent's interpretation: the board reply emits the raw req.text, not the persisted feedback blob that prepends the notification context; the Home check-in emits the raw user_feedback, not the [image]/[pdf] attachment placeholder. Backfill skips assistant board rows, generated notification wrappers, button/non-reply notification actions, blank replies, and attachment-only messages; trailing [image] / [pdf:...] markers are stripped when adjacent user text remains. Assistant text becomes truth only when explicitly confirmed by the user or backed by trusted structured data.

Emit only after durable persistence. Onboarding emission waits for both OnboardingSession.append_message and finalize_turn; the board-message surfaces wait for BoardMessage.add_message to return a non-None (non-duplicate) row. If persistence fails — or dedupes to a no-op — Hare never receives an orphaned source event.

Replay and live emission share the idempotency key when possible. The onboarding turn index counts user-authored turns, so assistant replies do not perturb replay identity; the board surfaces dedupe on the BoardMessage id. imperfect-cli dossier-backfill prints each source_ref before sending, reports totals for sessions / board messages / notification responses plus a family-by-surface outcome table, and defaults to dry-run. --no-dry-run sends the same source refs Hare sees from live emission when a live id exists.

Dossier writes are non-blocking. Every surface schedules a background task that no-ops when Hare isn't configured (hare_base_url / hare_signal_token unset); enqueue failures are logged and metered but do not break the user-visible response. "Is Hare configured?" is the single gate — there is no per-service enable flag. Backfill is the recovery path for any missed live onboarding emission.

Channel bridge inbound + durable worker boundary

Bridge webhooks (WhatsApp via footman, Heylo via footman/heylo) are at-least-once delivery surfaces. imperfect-api's ACK boundary is the ChannelTurn row: a webhook route returns 200 only after it has either inserted (channel, message_id) into app/models/channel_turn.py or identified that tuple as an existing duplicate. The route does no model work and sends no outbound bridge call.

Bridge webhook
  → /whatsapp_webhooks or /heylo_webhooks parses + validates native payload
  → process_inbound dedupes `(channel, message_id)` into ChannelTurn
  → footman-outbox-worker claims the turn by Mongo lease
  → generation writes ChannelOutboundDelivery rows
  → footman-outbox-worker admits each payload to the bridge

Why ACK after persistence, not after BackgroundTasks. FastAPI background tasks are process-local. Before #923, a deploy or worker crash after the HTTP 200 could drop the coach reply or link invite permanently. ChannelTurn makes the inbound durable before ACK; the Footman outbox worker (python -m app.workers.channels) owns slow work and can resume after deploys by reclaiming expired leases.

Two durable phases, two retry policies. ChannelTurn tracks product work: associated senders become linked_reply, unassociated senders become link_invite, and a confirmed link enqueues a synthetic link-confirm:{token} linked reply for the parked question. Once generation succeeds, the turn is reply_ready and outbound payloads live in ChannelOutboundDelivery. Delivery retries bridge pre-admission failures (5xx or no response) because the bridge did not accept the payload; 4xx validation failures dead-letter the delivery, and a dead required delivery dead-letters the owning turn. Once a bridge accepts the payload, imperfect-api marks the delivery accepted and does not blindly retry because bridge-side exact-once idempotency is a separate transport concern.

Footman stays transport-only. Channel-specific routes own parsing and upload caps, BridgeClient owns the per-channel send verb, and footman owns WhatsApp/Heylo transport queues. Product decisions — linked reply vs. link invite, invite localization, post-link replay, retry/dead-letter status — stay in imperfect-api so WhatsApp and Heylo share one durable core.

Event freshness contract

When alice ingests health data — a Garmin webhook, an Apple Health upload, a Terra event — it publishes a DataStoredEvent on the alice:events:data_stored Redis pub/sub channel. imperfect-api subscribes via app/lib/alice/subscriber.py and the handler in app/tasks/data_stored.py regenerates the user's board and may fire push notifications.

At-most-once is a deliberate choice. Redis pub/sub is fire-and-forget: a subscriber that isn't listening when alice publishes loses the message; a subscriber that crashes mid-handler drops it. There is no DLQ, no retry queue, no replay. The subscriber's module docstring is explicit about this — see app/lib/alice/subscriber.py. The bet is that any downstream state that must be derived from a missed event can be reconciled by a periodic sweep or by the imperative read paths (a refresh hitting alice directly). When a use case can't tolerate a dropped event, the right answer is a streams-based shape (Redis Streams / a real queue), not bolting retries onto pub/sub.

Per-type debounce, no cross-type gating. Garmin sends bursts of events — activity_details, dailies, respiration, sleeps — within seconds of each other, and we want to regenerate the board once per burst rather than four times. The handler keeps a Valkey-backed debounce window (_REGEN_TYPES_PREFIX, DEBOUNCE_SECONDS = 8) per user. The earlier shape gated board regen on the burst containing a sleeps event — which silently skipped UTC-negative-timezone users whose bursts arrived with only activities (#387). The surviving rule: per-type debounce is per-type; cross-type gating drops legitimate updates. The hash field is composite (vendor:data_type) so a multi-vendor user firing the same data_type from two providers in one burst (e.g. Garmin sleeps + Apple sleeps) doesn't have one vendor's metadata silently overwrite the other.

Idempotency keys cover the data's natural cadence. Sleep notifications dedupe per-user-per-day via an 18-hour Valkey TTL key, not via the 10-minute debounce window (#358). Providers re-sync sleep 1–2 hours later, so a window that's tight enough to catch a webhook burst can't catch the follow-up. The general rule: idempotency keys cover the data's natural cadence, not the wire-event burst window. Activities dedupe per-vendor on activity_id with a 24-hour window (ACTIVITY_DEDUP_WINDOW); cross-vendor relevance filtering prevents the same workout from firing twice when Apple proxies Garmin (#457).

Canonical record per user-night. Garmin can publish multiple sleep records for the same night. The context builder reads via get_sleep(date=...) and the notification pipeline reads via get_sleep_history(days=1) — historically those returned different rows, and divergent reads drove contradictory coaching messages (#295). The design rule: a single user-night must resolve to one canonical sleep record across all pipelines. Divergent query shapes silently break this.

Vendor → cheshire data_type translation lives on imperfect-api. When a data_stored event fans out into a cheshire-backed visualization, the per-vendor viz_data_type_for mapping translates our enum to cheshire's vocabulary before the proxy call (#466). See Cheshire integration boundary above for the rule.

OAuth flow + PKCE state lifecycle

imperfect-api orchestrates OAuth2 with PKCE for vendor connections but does not generate the PKCE pair and does not see the tokens. alice produces state, code_verifier, and code_challenge; imperfect-api stores them in OAuthState for the duration of the flow and hands the verifier back during the callback so alice can finish the exchange.

Sequence:

Client: POST /health-providers/{provider}/connect
imperfect-api → alice: POST /{provider}/auth/authorize with the redirect_uri. alice returns authorization_url, state, code_verifier, code_challenge.
imperfect-api: persists an OAuthState document keyed by state with a 10-minute TTL. Returns the authorization_url to the client.
Client: opens the URL; the user authenticates with the vendor.
Vendor: redirects to imperfect://oauth/callback/{provider}?code=...&state=....
Client: POST /health-providers/{provider}/callback with code + state.
imperfect-api: looks up OAuthState by state, validates it exists + isn't expired + (when a caller is authenticated) belongs to the caller.
imperfect-api → alice: POST /{provider}/auth/exchange with code, redirect_uri, code_verifier. alice exchanges with the vendor, resolves the wellness-API user id, stores tokens in its own Valkey, returns wellness_api_user_id + permissions.
imperfect-api: HealthProviderConnection.create_or_update() writes the user_id ↔ provider_user_id mapping. OAuthState is deleted so the state can't be replayed.
Response goes back to the client with the connection details.

Why we don't store tokens. Two reasons, both load-bearing. First, credential containment (see System overview above): the only secret imperfect-api needs is the user-vendor mapping; tokens are alice's job. Second, refresh ownership: alice already implements per-vendor refresh logic, including refresh-token rotation and revocation. Holding a copy here would mean either re-implementing that logic (drift risk) or holding stale tokens that 401 on the next call (silent failure). The rule from #105/#121 — imperfect-api stays credential-free — is the load-bearing one; no-token-storage is its corollary.

Why the 10-minute TTL on OAuthState. The state document exists only to carry the code_verifier from connect → callback. Ten minutes is a comfortable upper bound on how long a human takes to complete an OAuth handshake (open the link, sign in, approve scopes, get redirected back). The MongoDB TTL index auto-expires the document; OAuthState.is_expired is a belt-and-suspenders check because the TTL sweeper has a ~60-second cleanup window during which an expired document may still be readable. After a successful callback the row is deleted explicitly to prevent replay — TTL is the fallback for abandoned flows, not the primary cleanup.

Why OAuthState.user_id is nullable. Anonymous onboarding flows start the OAuth handshake before a User document exists; identity is resolved at the callback step. The state document still carries the provider and the PKCE pair, just without a known user. The callback's _link_provider_identity is the resolver — it handles the four branches: connected (first-time, auto-create or reject based on login_only), already_connected (HPC matches caller), recovered (HPC exists under a different user, caller has no User yet → mint a custom token so the client can adopt the existing UID), and 409 (HPC exists under a different user and the caller is itself authenticated → refuse to steal the identity).

login_only mode. The mobile login flow can pass login_only=true to flip the connected branch from "auto-create a User" to "return 404". Without this, a "Welcome Back" Garmin login by someone who never signed up would silently create a ghost account (#461). The general rule: provider callbacks distinguish sign-in from sign-up; the caller declares which one it wants.

Idempotent connection rows. HealthProviderConnection is keyed on (user_id, provider, brand) and create_or_update reactivates an existing row instead of erroring on already_connected / recovered cases (#427). The row's lifecycle is independent of any single OAuth flow — reconnecting after a token expiry doesn't produce a duplicate. Status transitions (active ↔ expired ↔ revoked ↔ error) live on the same row, so the HPC is the durable "we know about this user-vendor relationship" record across the row's entire history. Index-signature changes here are not free: mongoengine doesn't drop old unique indexes when meta["indexes"] shifts, so any change to the unique tuple requires a manual prod drop + an orphan-index audit (#411).

Public Routes browser session boundary

The browser-first route planner at https://imperfect.co/routes uses /routes/* APIs on imperfect-api, but those APIs are intentionally not a developer API. A browser starts with GET /routes/session; imperfect-api creates or refreshes a BrowserRouteSession (app/models/browser_route_session.py) and returns a CSRF token while setting an HttpOnly, Secure, SameSite cookie scoped to /routes.

Anonymous route use is not a User. BrowserRouteSession.user_id is nullable on purpose: an anonymous browser can create, list, resume, and later publish route work without forcing a premature User row. Later auth/Garmin can attach the browser session to a real user, but the session identity is its own server-side row until that happens.

Cookie tokens and CSRF tokens are never stored raw. The browser receives an opaque high-entropy cookie token and an in-memory CSRF token; Mongo stores only SHA-256 hashes. Unsafe /routes methods require all three browser signals: an exact configured Origin, a valid browser-session cookie, and the X-CSRF-Token header. The exact-origin allowlist is PUBLIC_ROUTE_BROWSER_ORIGINS / settings.public_route_browser_origins; do not reflect arbitrary origins or use wildcards for this surface.

Abuse keys are hashed request material, not payloads. BrowserRouteSession records hashed first/last IP and user-agent material plus a hashed request fingerprint suitable for rate-limit keys. Logs and rate-limit keys may include the session id, fingerprint hash, or derived key, but never raw cookies, CSRF secrets, exact private coordinates, full prompts, or raw IP/user-agent strings.

Route ancestry is provenance, not authorization. The nullable source_route_hash, source_route_slug, source_public_route_id, and source_session_id fields exist so a later browser session can fork a published/shared route for prompt context ("make this longer") without owning the source session. Treat those fields as read-only ancestry and analytics metadata, never as an authorization grant.

Browser route runs are owned by the browser session. BrowserRouteRun (app/models/browser_route_run.py) is the durable row behind POST /routes, GET /routes, GET /routes/{run_id}, POST /routes/{run_id}/continue, POST /routes/{run_id}/publish, and POST /routes/{run_id}/analytics. Ownership checks use the BrowserRouteSession.id, not route ancestry. A child run stores parent_run_id / root_run_id for conversation shape and carries the private Cheshire route-session cursor server-side so "make it longer" can continue the prior stream without exposing that cursor to the browser. Each generated option can become its own idempotent /r/{share_id}/{slug} page record for Looking Glass, later attach, and future fork context; recording one page never changes another option from the same run. The BFF validates page paths and any inline manifest payload with the shared hard-cutover contract, and it never exposes signed artifacts, route/content hashes, share tokens, or source-session ownership. The hard-cutover bundle contract lives in docs/public-route-contract.md → Route-share hard cutover.

Web route-send state + shared Garmin course idempotency

The public Send to Garmin Connect™ route bridge starts from the hard-cutover /r/{share_id}/{slug} contract in docs/public-route-contract.md, not from the pre-reset route-hash URL shape. It uses a static Garmin callback path, https://imperfect.co/garmin/route-send/callback. The browser may keep sessionStorage for progress continuity, but the source of truth is RouteSendSession (app/models/route_send.py). OAuth-backed starts create one short-lived row that binds the public share_id, presentation slug, Alice callback state, PKCE verifier, redirect URI, optional firebase_uid, and status. Returning-link starts that pass the server-side auth checks create a short-lived resolved-link:* session with the resolved route/user fields, returning_authorizer_link_id, and sentinel PKCE material so the course-finalization path can stay session-shaped without exposing reusable credentials.

Callback state is single-use, not deleted-on-success. RouteSendSession.claim_callback() atomically moves a pending row to callback_received and stamps callback_consumed_at. A second callback for the same state returns replayed with the existing session instead of exchanging the OAuth code again. An unconsumed row past expires_at is marked expired; a missing row is missing (invalid or TTL-purged). Endpoint code should map those typed outcomes to deterministic UX: continue progress or success for replays, terminal error + GPX fallback for expired/missing, and never trust callback path params for route context.

Route-send users may have firebase_uid=None. A route recipient can authorize Garmin from the web before installing the app or signing into Firebase. If the Garmin identity has no existing HealthProviderConnection, route-send may create a guest User with no firebase_uid and attach the Garmin HPC to that placeholder. Later claim/merge is Garmin-identity driven: when the same person installs/signs in and the provider callback resolves to the same provider_user_id, the registered Firebase user should adopt or merge the placeholder user's route-send-only state. If the Garmin identity already belongs to a registered Imperfect user, route-send reuses that owner instead of creating a placeholder. If multiple HPC rows already claim the same (garmin, provider_user_id), the identity is ambiguous; route-send must stop with a collision/fallback rather than picking one arbitrarily.

Consent receipts are best-effort delivery metadata. RouteSendConsentReceipt records the web CTA's ToS/privacy acceptance with legal versions or content hashes, timestamp, locale, route identity, optional firebase_uid, and later the resolved user_id. Receipt writes and user-claim updates log failures but never block OAuth, route/course creation, or final redirects. WhatsApp standing consent for automatic route uploads lives on the active ChannelAssociation as route_auto_upload_enabled; the coach flips it conversationally, and auto-upload callers read the single association predicate instead of inferring consent from prior route sends.

Garmin identity ownership has a route-send guard. HealthProviderConnection intentionally remains keyed by (user_id, provider, brand) today, and its (provider, provider_user_id) lookup index is non-unique. RouteSendProviderIdentity adds a route-send-specific unique claim on (provider, provider_user_id) so parallel public callbacks for the same Garmin account converge on the first owner. The guard does not replace HPC; it protects the route-send resolver from creating multiple placeholder owners while the broader HPC index remains unchanged.

Returning browser links are durable but not authoritative. RouteSendGarminLink persists a hashed opaque pointer from a prior successful route-send OAuth exchange to the resolved Imperfect user and Garmin identity. Looking Glass may keep the clear pointer in its own session cookie, but imperfect-api treats it only as a lookup key. returning_authorizer_token is optional on POST /route-send/garmin/start, required on /route-send/garmin/links/resolve, and returned as a fresh opaque pointer only after callback OAuth succeeds and the Garmin identity has a resolved Imperfect owner. Both endpoints re-check the active HealthProviderConnection, Alice auth status, and COURSE_IMPORT before considering the browser linked. Missing tokens or link rows, missing or mismatched HPC rows, Alice 401 / 403 / 404 auth status, disconnected Alice status, or missing COURSE_IMPORT mean reauthorization: resolve returns authorization_required, while start falls through to a fresh Garmin authorization URL. Alice rate limits or transient status failures return rate_limited / retry instead of discarding the pointer. No Alice/Garmin tokens or provider credentials are returned to the browser.

Course idempotency is separate from OAuth state and shared across channels. OAuth state proves one authorization attempt; it is not the resend key. RouteSendCourse stores the public share_id/slug for UX, but dedupes by an internal source key derived from the Garmin course source content plus Garmin identity: (source, source_key, provider, provider_user_id). app/lib/route_course_service.py owns the route-bundle resolution, missing-source-field gate, RouteSendCourse claim/reuse, Alice Garmin Courses call, and ready/failed row transitions without depending on RouteSendSession. The web _finalize_course path is only an adapter from RouteCourseResult back onto the browser session, so web, WhatsApp, and future route channels reuse the same dedupe key. Before calling Alice's course-create endpoint, imperfect-api claims that row as creating; concurrent send retries or sibling sessions for the same route/Garmin account return poll_needed while the winner is in flight. Once Alice creates the Garmin course, imperfect-api marks the row ready with the garmin_course_id and canonical connect.garmin.com/modern/course/{id} URL. Retries, duplicate callbacks, or later resends verify that the stored Garmin URL still resolves before returning already_sent. A provider 404 / 410 marks the row failed with provider_course_missing and reclaims the same row for replacement creation; auth-gated, private, transient, or unavailable checks keep the existing URL so ambiguous provider responses do not create duplicates. A changed share_id or slug for the same source content reuses the same course row when reuse is intended; the public route identity is not treated as the Garmin resend key.

Existing Garmin owners with WhatsApp enabled get a durable course-ready DM. When the web callback resolves to an existing_hpc_owner with an active WhatsApp ChannelAssociation whose notifications_enabled flag is true, imperfect-api enqueues a synthetic ChannelTurn keyed by the RouteSendCourse. The channel worker creates or reuses a route_course_ready Notification, localizes the copy, sends the WhatsApp message with the Garmin course URL as the link preview, and falls back through the normal notification send path. RouteSendCourse.notification_sent_at is the one-send claim, so duplicate callbacks or worker retries reuse the same notification instead of creating a second message.

Public route-send endpoints return UX states, not credentials. The web bridge uses POST /route-send/garmin/start, POST /route-send/garmin/callback, POST /route-send/garmin/links/resolve, GET /route-send/garmin/sessions/{session_id}, and POST /route-send/garmin/sessions/{session_id}/send. These endpoints are intentionally separate from Firebase-authenticated /health-providers/*; the start/callback/session/send paths return RouteSendStatus values such as authorization_required, course_ready, already_sent, cancelled, expired, route_not_found, route_not_ready, missing_course_import, provider_identity_collision, alice_error, rate_limited, retry, poll_needed, and session_not_found, while link resolution returns RouteSendLinkResolveStatus. Browser-visible responses may include the Garmin OAuth URL, the public route/GPX fallback, retry hints, an opaque returning-authorizer pointer after successful OAuth, and the final Garmin course URL, but never Alice credentials, PKCE verifiers, Cheshire credentials, or provider identity IDs. course_ready and already_sent responses also mark response_surface=garmin_connect_course and set success_redirect_url to the Garmin course URL so the web callback can redirect directly to Garmin Connect; every non-success state remains response_surface=imperfect_owned for Imperfect-hosted error, retry, permission, polling, or GPX fallback pages.

Terra `reference_id` lifecycle: fresh per session + canonicalize (#1094)

POST /session mints a fresh, never-reused reference_id; it is not the Firebase UID. Reinstalling the app resets the anonymous Firebase UID, so when we used the UID as the Terra reference_id Terra minted a new terra_user_id and we orphaned/duplicated the account (#670, #979). The fix (#1094) decouples the two: app/resources/terra.py generates a throwaway per-session id, and the device identity (the Firebase UID) plus timezone ride along on the TerraSession row instead. Terra never sees the same reference twice across reinstalls, so user_reauth never fires.

The auth webhook recovers identity from the row, then canonicalizes. app/tasks/terra_lifecycle.py::_handle_auth looks the TerraSession row up by widget_session_id, resolves/creates the User by the row's firebase_uid (legacy rows that stored the UID as the reference fall back to it), and then calls _canonicalize_terra_reference: terra.user.modifyuser(reference_id=user.id) re-keys Terra to the imperfect-api user id, and alice's POST /terra/admin/rewrite-reference-id (exposed by sibling #1095) re-keys any rows already stored under the throwaway id. Canonicalize runs before trigger_terra_backfill so the historical data Terra republishes lands keyed on user.id from the start — matching the contract the TerraHealthDataService and AliceClient.trigger_terra_backfill docstrings already assumed. Both legs are best-effort: a transient Terra/alice failure is logged but never aborts User/HPC creation. Because the live connection's reference is now the user id, later deauth/connection_error events resolve via id first (_resolve_user_by_reference), with the firebase_uid fallback for legacy connections.

This stops new orphans on its own, independent of the matching/merge engine below — a genuinely-new user keeps their canonicalized id; a returning user whose data later fingerprint-matches an existing account gets re-pointed to the older account by the merge orchestrator (deferred: parent #1093, waves 4–5).

Terra account matching (data-local + scheduler)

A returning user's session republishes the user's historical data to alice (under the canonicalized reference). Account matching decides whether that underlying provider account belongs to an existing imperfect user — i.e. whether to recover the prior HealthProviderConnection or to treat it as a brand-new sign-up. The match itself runs in alice (#1127), where the Terra rows live; imperfect-api schedules the check and owns the decision (#1128).

The candidate marker is additive, not a deferral (#1096). _handle_auth still creates the account immediately (#1094, parent principle "everyone onboards normally and immediately") and also writes a PendingTerraConnection marker (user_id, terra_user_id, brand, session reference_id) so the scheduler (#1128) knows which freshly-onboarded connections to evaluate against older same-brand accounts. It is deliberately a separate model rather than a HealthProviderConnection status so the live connection stays active (onboarding gates on an active HPC, #290). The write is best-effort and never blocks signup.

Why matching moved to alice (#1127, supersedes #1097). The first cut (#1097) streamed each backfill data_stored event into imperfect-api, buffered it, and ran a per-candidate cross-user compare here — fetching each candidate's records back out of alice over HTTP. That shipped the same data over the wire repeatedly to do a same-brand cross-user join, which is cheapest where the data already sits. Worse, it gated on a data_stored field that wasn't reliably populated, so it sat inert in prod. #1128 deletes that machinery (subscription, Valkey debounce, accumulator, the #765/#766 fingerprint Python) and replaces it with a scheduled poll over the durable candidate marker.

Scheduled, bounded polling (#1128). A Render cron (app/tasks/terra_match_sweeper.py, every ~5 min) sweeps the active PendingTerraConnection markers old enough for Terra's backfill to have landed (due_for_sweep, after MATCH_CHECK_DELAY ≈ 5 min) and resolves each via app/services/identity/terra_match.py:

Ask alice POST /terra/admin/match-candidate (#1127) with the candidate's canonical user_id + brand. alice scores same-brand reference_id overlap where the rows live and returns matches + the candidate's own record_counts — it is blind to account state.
Apply imperfect-api's domain rules alice can't know: confirm at cumulative score >= 3.0 (alice's summary_id tier scores 3.0 per shared activity/sleep, so a single shared record confirms — the same bar #1093 locked in); filter matched reference_ids to eligible accounts (exist, not disabled/merged, not the candidate itself); apply the oldest-account rule (survivor = oldest, throwaway = the fresh candidate).
Confident match → call merge_terra_accounts(...) (#1098) and retire the marker merged. No confident match → leave the marker active and re-check next tick (Terra backfill arrives in waves — this is the bounded retry), until the candidate ages past MATCH_GIVE_UP_AFTER (≈ 20 min) and is retired standalone.

   PendingTerraConnection marker (durable, Mongo)
                       │
   terra-match-sweeper cron (every ~5 min)
                       ▼
   due_for_sweep(older_than = now − MATCH_CHECK_DELAY)  ──▶ skip too-fresh / retired
                       ▼  per active candidate
   alice POST /terra/admin/match-candidate(user_id, brand)   [#1127, data-local]
                       ▼
   matches with score ≥ 3.0 → map to eligible accounts → oldest-account rule
        │ survivor found                          │ none
        ▼                                          ▼
   merge_terra_accounts (#1098)            final (age ≥ give-up)? ─yes─▶ retire standalone
   retire marker "merged"                          └no─▶ pending (retry next tick)

Bounded polling replaces the perpetual subscription. Because the marker is durable in Mongo, the cron survives restarts (unlike #1097's in-process timers) and needs no event subscription. Retirement (merged / standalone) is terminal — a resolved candidate is never re-evaluated — and CANDIDATE_MAX_AGE (24h) is the outer backstop for a marker the sweeper never reaches (e.g. alice persistently down), lazily retired standalone rather than scanned forever. A genuinely-new or sparse returning user simply stays its own fresh account (the "sparse-data returning users may not auto-recover — accepted" limitation of parent #1093).

Fail-clean, never finalize on an outage. The coordinator never raises: a transient alice error or an aborted merge resolves pending (retry later), so an alice outage can neither merge wrongly nor prematurely finalize a candidate standalone. An alice 400 (unsupported brand — no scoring recipe) is the one error that resolves standalone immediately, since retrying can't help. The threshold + eligibility + oldest-account rule live in imperfect-api; alice owns only the data-overlap scoring (weights, summary_id tiers, the body/weight_kg exclusion — documented in alice's vendors/terra/matching.py).

The merge executor (#1098)

Matching moved to alice (#1127); a scheduler in imperfect-api (#1128) picks the survivor and calls merge_terra_accounts(survivor_user_id, throwaway_user_id, terra_user_id, brand) (app/services/identity/merge.py). #1098 is the merge mechanics only — a standalone callable that executes an already-made decision. It does no matching, subscribes to no events, and contains no survivor-selection logic; #1128 owns the trigger and passes survivor/throwaway in. The merge is a redirect, not a copy (parent #1093) and runs in a fixed order:

Redirect the live connection — modifyuser re-keys the live terra_user_id's future Terra webhooks to the survivor; alice's rewrite-reference-id (#1095) re-keys the rows already stored under the throwaway (dropping duplicates). This is the only step that can strand data, so it runs first and aborts the whole merge on failure — leaving the throwaway as its own account is the accepted degradation; a half-merge that rebinds identity while data sits under the wrong key is not.
Deauth the survivor's stale connection — its pre-reinstall terra_user_id is dead. Best-effort via alice's deauth (#1095).
Repoint the HPC — create_or_update gives the survivor one active (terra, brand) HPC at the live terra_user_id (the unique index forbids a second row), so data_stored — which routes board regen by provider_user_id — sends the live connection's events to the survivor; the throwaway's HPC is revoked.
Rebind identity + device — move the new device's firebase_uid from the throwaway to the survivor (clearing the throwaway's first to free the unique-sparse key) and update the imperfect_user_id claim. The new device then resolves to the survivor via _find_user with no re-login (validated in #1093); the Apple-integrated token force-refresh is #1099. The reinstalled device's NotificationTokens are re-pointed onto the survivor too (#1100): they were registered under the throwaway and FCM rotates the token on reinstall, so the survivor's pre-reinstall token is dead — without the move, post-merge pushes (the recovery notice first) can't reach the live device. Best-effort: a failed move self-heals on the client's next token re-registration.
Soft-disable + audit — User.disabled_at marks the throwaway disabled and a reversible AccountMerge row records the overwritten before-state. The throwaway's app data is discarded, not migrated.

Why deauthing the old connection can't cascade-revoke the live one. Deauthing the survivor's stale terra_user_id makes Terra emit a deauth webhook carrying the survivor's reference_id, which would otherwise hit _handle_disconnect(survivor.id, brand) and revoke the survivor's repointed HPC. So _handle_disconnect/_handle_connection_error now match on provider_user_id when the webhook carries a user_id (#1098): the stale-connection deauth targets the old terra_user_id, which no longer matches the survivor's HPC, so the redirect survives. Payloads without a user_id fall back to the prior brand-only match.

Idempotent, fail-clean. #1128 calls the executor once per decision; the disabled_at guard makes a replay a no-op, and a hard failure aborts before touching identity (step 1) so the throwaway stays its own account rather than ending up half-merged.

Account recovered: notification + telemetry (#1100)

The user-facing tail of recovery lives in app/services/identity/recovery.py — neither matching nor mechanics, so it hangs off the coordinator's decision rather than the executor. On a fresh merge (not an idempotent replay), the coordinator (terra_match._merge) does two things the executor doesn't:

Notify the survivor. notify_account_recovered sends a NotificationType.account_recovered push + in-app notice ("We recovered your account") to the survivor — the account the user now logs into. It carries one button, "This wasn't me": a tap is the only first-party false-positive signal we have (a full reversal is deferred — parent #1093), so disputing feeds telemetry, it is not a one-tap rollback. The push reaches the live device only because the executor migrated its tokens onto the survivor (step 4 above).
Record telemetry. Parent #1093 is designed around Terra's unreliable backfill latency, so recovery is instrumented with structured Logfire events (the codebase aggregates rates with Logfire queries, not in-process counters):
terra_recovery_resolved — one per terminal resolution (merged / standalone), carrying outcome, brand, time_to_resolution_seconds (candidate-creation → now; for a merge that is time-to-merge), match_score, and the standalone reason. → match rate + time-to-merge.
terra_recovery_disputed — emitted when the user taps "This wasn't me" (the response lands at POST /notifications/{id}/response, which routes an account_recovered notice through handle_recovery_response and skips the board-feedback path — a recovery notice has no board). → false-positive rate (disputed ÷ merged).

Both surfaces are best-effort and never raise into the coordinator: a failed push or telemetry hiccup must not flip an otherwise-final merge back to pending and re-merge on the next sweep.

Onboarding shape

A user is "onboarded" when imperfect-api can serve them a dashboard. The product question — what does a successful first connection look like? — and the data question — does alice have any payloads for this person yet? — are deliberately decoupled.

Data flow ≠ onboarding completion. Apple Health users were getting stuck mid-onboarding because their data flowed through alice (the iOS app uploads directly) but POST /health-providers/apple-health/register was never called from the client. Without an HPC row, GET /health-providers returned empty, and the mobile client re-entered onboarding even though alice was holding real data for them (#290). The design rule that survived: onboarding gates on the HPC row's existence, not on whether data has arrived. The HPC row is the product fact ("this user agreed to connect their Apple Health"); data presence is an operational fact. They can be true independently and one doesn't imply the other.

The firebase_uid backfill seam. User.firebase_uid was added late (#424) — earlier users were keyed purely on email. The lookup in app/resources/base.py::_find_user carries the backfill: try firebase_uid first; on miss, fall back to email and persist the current firebase_uid to the row for next time. New users get firebase_uid set at creation; the email fallback only ever runs for legacy rows. The pattern extends to user creation — POST /users is an idempotent upsert keyed on firebase_uid that promotes account_type=guest → registered when an anonymous user links a Google/Apple credential (#477/#478). The mobile contract: after linkWithCredential, force-refresh the ID token and call POST /users again. The apple-health register endpoint follows the same robustness pattern — it's idempotent on a firebase_uid race; the duplicate-key path resolves to the existing User rather than 500-ing (#599).

login_only belongs to onboarding. The OAuth login_only flag (#461, also covered in OAuth flow above) is the onboarding-side of the same rule: provider callbacks must distinguish "this is a sign-in for an existing account" from "this is a sign-up for a new one". The client knows which mode it's in; the server enforces it.

The HPC is the seam between onboarding and the rest of the product. Once the row exists, every downstream consumer — dashboard generation, notification routing, the context builder — keys on it. Onboarding's job is to produce a valid HPC row for the caller's vendor, with a known provider_user_id so alice's webhooks can route back. Everything else — does the data look good, is the body battery available, has the first sleep arrived — is post-onboarding state and resolves on the dashboard side, not on the onboarding side.