Learnings
Hand-curated index of design lessons from imperfect-api's GH issues and PRs. Read by an LLM running /research (first pass over the prior-art graph, short-circuiting gh search) and by humans asking "what did we already try?"
Not a changelog. Only learnings earn entries — disproven hypotheses, reversed decisions, won't-fixes with rationale, audit-driven cleanups, validated-cheaply spike wins.
Entry shape
- **#N** (state) — one-line load-bearing takeaway. Parent: #X. Winner: #Y.
Superseded by: #Z. See: <pointer>. [tags: subsystem, subsystem]
#Nand(state)are required. The takeaway is one sentence, present-tense, focused on the rule that survived, not the diff.state∈ {merged,not-planned,superseded,won't-fix,partial} for issue entries, or(PR closed)/(PR closed, not merged)when the load-bearing artifact is the closed PR itself (no separate close-comment on the parent issue, or no parent issue at all). Compound with a structural modifier when relevant — e.g.(merged, umbrella)for umbrella entries.- Optional tails:
Parent:(umbrella),Winner:(which sibling shipped),Superseded by:(what replaced it),See:(deep-dive pointer — a file path, an architecture-doc anchor, or a comment URL),[tags: ...](subsystem tags sogrep "tags:" docs/learnings.mdrecovers cross-cutting clusters). - Umbrella issues open with a leading bullet stating the decision rule, followed by nested bullets for each child. The umbrella entry is the load-bearing one; children are searchable pointers.
- The leading
#Nis wrapped in**...**so the Markdown parser doesn't read it as a block-level heading — keep that wrapping when adding new entries.
Admission rule
An entry earns its place if its takeaway would change a decision in a future /research run. Concretely, prefer entries that capture:
- A disproven hypothesis (the issue closed
not-plannedafter measuring the prediction came out wrong). - A reversed decision (we shipped X, learned, shipped Y; the survivor is Y).
- A won't-fix with rationale (the next person reaching for the same shape needs to see why we didn't take it).
- A design rule (an invariant that wasn't obvious from the code but the close-comment teaches).
- An audit summary (a zoom-out review that yielded one or two transferable rules; record the rule, not the 16 findings).
OAuth flow & alice-client integration
- #105 + #121 (merged) — health provider OAuth credentials, PKCE generation, token storage, refresh, and revocation all live in alice; imperfect-api only orchestrates the flow and stores
user_id ↔ provider_user_idmapping. Design rule: any new vendor with OAuth tokens delegates to alice — imperfect-api stays credential-free. See:app/lib/alice/client.py,docs/architecture.md→ System overview + alice integration boundary. [tags: oauth, alice, vendors] - #187 (merged) —
POST /userswrites the imperfectuser_idas a Firebase custom claim so alice can extract the identity from the JWT without round-tripping to imperfect-api. See:app/resources/users.py::set_user_id_claim. [tags: auth, alice] - #461 (merged) — health-provider OAuth callback supports a
login_onlymode that returns 404 (no auto-create) when no User exists for the Firebase UID; prevents ghost accounts from a "Welcome Back" login flow on Garmin. Design rule: provider callbacks distinguish "sign in" from "sign up". [tags: oauth, auth, onboarding] - #411 (merged) — mongoengine does not drop old unique indexes when
meta["indexes"]changes. The stale(user_id, provider)unique index kept rejecting Terra brand inserts after the model moved to(user_id, provider, brand). Maintenance hook: index-signature changes require a manual drop in prod + a follow-up audit of orphan indexes. See:app/models/health_provider_connection.py. [tags: mongodb, index, terra] - #427 (merged) —
_link_provider_identityreactivates an existingHealthProviderConnectioninstead of erroring onalready_connected/recoveredcases. Design rule: connection rows are idempotent across reconnects, not unique-per-attempt. [tags: oauth, health-providers] - #599 (merged) —
POST /health-providers/apple-health/registeris idempotent on afirebase_uidrace; the duplicate-key path resolves to the existing User rather than 500ing. [tags: apple, race] - #954 / PR #957 (merged) — public route-send OAuth keeps short-lived callback state, Garmin identity ownership, and durable route+provider course rows separate so duplicate callbacks, guest users, and resends converge deterministically. Parent: #952. See:
docs/architecture.md→ Web route-send state + shared Garmin course idempotency. [tags: route-send, oauth, garmin, idempotency] - #985 / PR #1007 (merged) — Route delivery and consent persistence stay decoupled: public route-send ToS/privacy acceptance is recorded as a best-effort receipt that can later claim a user, while standing WhatsApp route auto-upload consent is an explicit
ChannelAssociationpreference, never inferred from prior sends or allowed to block OAuth/course delivery. See:docs/architecture.md→ Web route-send state + shared Garmin course idempotency. [tags: route-send, consent, whatsapp, oauth] - #1022 (merged) — Returning route-send browser links are durable lookup pointers, not authority: imperfect-api re-checks the active
HealthProviderConnection, Alice auth status, andCOURSE_IMPORTbefore skipping Garmin OAuth, and stale links reauthorize rather than exposing credentials. See:docs/architecture.md→ Web route-send state + shared Garmin course idempotency. [tags: route-send, oauth, garmin] - #1128 (merged, PR #1149) — Terra reinstall-recovery account matching runs in alice (data-local, #1127), not by streaming
data_storedrecords into imperfect-api. The #1097 streaming matcher shipped the same rows over the wire repeatedly and went inert in prod (it gated on adata_storedfield that wasn't reliably populated); imperfect-api owns scheduling and the decision (alice only scores): a Render cron (terra_match_sweeper) polls the durablePendingTerraConnectionmarker with bounded retries, then applies the threshold + account eligibility + oldest-account rule (score ≥ 3.0 → merge #1098, else standalone). Design rule: a same-brand cross-user data join belongs where the rows live (alice), not the consumer; the survivor is always the strictly-older account so a marker race can't fold older→newer. Parent: #1093. Superseded: #1097. See:app/services/identity/terra_match.py,docs/architecture.md→ Terra account matching. [tags: terra, alice, identity, matching]
Health provider routing (Garmin / Whoop / Coros / Apple)
- #152 (merged, umbrella) — Terra moves to alice-backed architecture: alice owns webhooks + token storage + Postgres data, imperfect-api consumes via
GET /terra/{data_type}like Garmin. Children: #154 (alice-backed service) / #155 (lifecycle via pub/sub) / #156 (factory) / #157 (ops webhook redirect) / #158 (drop old webhook + Mongo models) / #160 (migrateterra_user_idto HPC). Pattern reused for any future vendor that goes through alice. [tags: terra, alice, vendors] - #442 (merged) —
/onboarding/garmin/insightsgeneralized to/onboarding/insightswith Garmin-first priority, Terra fallback. Design rule: onboarding endpoints degrade gracefully (200 + empty insights) instead of 422 when a vendor is absent — the teaser is opt-in across providers. [tags: onboarding, vendors] - #466 (merged) — Apple's Cheshire
data_typeisworkouts, notactivities. Per-vendorviz_data_type_formapping must be remapped both in the notification path and the public/cheshire/visualizeproxy — the proxy was forwarding the body verbatim. Design rule: vendor → Cheshire data-type translation is server-side, never client-trusted. [tags: apple, cheshire, vendors] - #290 (merged) — Apple Health users were stuck in onboarding because data flowed through alice but
POST /health-providers/apple-health/registerwas never called; without anHPCrow,GET /health-providersreturns empty and the client re-enters onboarding. Design rule: data flow ≠ onboarding completion — onboarding gates on the HPC row's existence. [tags: apple, onboarding] - #448 (merged) — backwards-compat floor on
POST /usersandPOST /health-providers/apple-health/register:/userslooks up byemailfirst thenfirebase_uidso provider switches don't 500; the apple-health register accepts a missing body and returns the legacyHealthProviderConnectionResponse. Design rule: do not break older mobile builds while a min-version is still in the wild. [tags: apple, mobile, compat] - #587 (merged) — once the supported mobile floor moves past the old shapes, delete the bodyless-fallback branches; cross-reference
imperfect-mobileat the supported version tag + live Logfire UA traffic before removing dead paths. [tags: cleanup, mobile] - #316/#317 (merged) —
DataStoredEventfrom alice uses vendorapple; imperfect-api normalizes toapple_healthat the event boundary. Design rule: alice's vendor strings can differ from imperfect-api's enum — translate at the subscriber, never leak alice naming into our models. [tags: events, apple] - #1196 / PR #1220 (merged) — Apple
data_stored.user_idis the Imperfect user id, whileHealthProviderConnection.provider_user_idis the mobile device UUID; the subscriber routes Apple byuser_idand only usesprovider_user_idfor vendors that publish vendor identities. Parent: #1195. See:app/tasks/data_stored.py,app/tasks/CLAUDE.md. [tags: apple, alice, events, board]
Agents, prompts & evals
- #481 (won't-fix) — Cheshire
/conversations/{id}/visualizereuse path produces visually richer charts but worse glanceability for push-notification images at 1000×1000. Kept the long-template/visualizepath. Closed PR #490 +.docs/issue-481/DECISION.mdpreserve the A/B for the next agent considering it. Design rule: notification-image charts optimize for 1-second glanceability, not for richness. [tags: cheshire, notifications, charts] - #534 / PR #535 (merged) — onboarding memory agent silently dropped writes in prod while the eval suite passed 100%. Two contracts: evals counted
ToolCallPart(emitted), prod counted tool returns marked success. Fix: switch from default tool-modefinal_resulttoPromptedOutputso structured reply doesn't compete with write tools; addWritesActuallyExecutedevaluator that reuses prod'ssummarize_turn_writes. Design rule: evaluators must read the same signal as the prod success metric. [tags: agents, evals, prompts] - #522 (merged) — onboarding memory agent eval suite uses real-prod failure transcripts (Charlie + dog, hip-labrum recovery, hybrid-athlete essay, Spanish chronic-injury) as floor cases; the warning at
onboarding_intake.py:151("text ≥10 chars yields zero writes") is the prod signal that produces new eval cases. [tags: agents, evals] - #492 / PR #493 (merged) — weekly plan delegates reasoning to Cheshire's
POST /users/{user_id}/askand uses a local Sonnet formatter to structure the prose intoWeeklyPlanSchema. Local Opusplanning_agentstays as the fallback for Cheshire failure / no connected provider / formatter validation failure. Design rule: when delegating reasoning to a sibling service, keep the local agent as a gated fallback, not a deletion. [tags: cheshire, agents] - #401 / PR #403 (merged) — agents inject user locale via
agent.instructions(locale_instruction)(regenerated every run, top-levelsystem), not@system_prompt(which persists asSystemPromptPartin message history and gets clobbered byagent.override(model=...)). Cross-project pattern from cheshire. Addlocale+user_idas span attributes for queryable drift. [tags: agents, locale, observability] - #799 (partial) —
dossier-backfillexpanded from onboarding-only replay to historical user-authored replay across onboarding sessions, board messages, and notification replies. Design rule: backfill should prefer the same source ref live emission used (BoardMessageid for board-backed surfaces), but can use a stable replay-only fallback when no live id is recoverable; never replay generated notification wrappers, button taps, assistant text, blank replies, or attachment-only placeholders as user truth. See:docs/architecture.md→ Hare dossier boundary. [tags: hare, backfill, events] - #506/#507 / PR #509, #511 (merged) — context-builder
<recent_activities>block must (a) dedup Apple workouts acrosssource_bundle_idproxies before summing durations, and (b) use absolute "(N days ago)" labels instead of "(N days)" so the agent can't infer streaks from row order. Design rule: LLM-readable context strings encode calendar adjacency explicitly — agents will infer from layout if the data lets them. [tags: agents, prompts, context] - #295 (merged) —
ContextBuilder._fetch_raw_data(get_sleep(date=...)) and the notification pipeline (get_sleep_history(days=1)) read different Garmin sleep records for the same night when Garmin sends multiple entries. Design rule: a single user-night must resolve to one canonical sleep record across all pipelines — divergent query shapes silently drive contradictory coaching. [tags: agents, garmin, data-freshness] - #791 / PR #791 (merged) — imperfect-api emits onboarding-transcript user messages as Hare dossier source events. Design rule: emit to a sibling event log only after durable persistence (both
append_message+finalize_turnsucceed) so a failed save never orphans a source event; live emission and thedossier-backfillreplay share one idempotency key (source_ref = imperfect-api:onboarding_session:{id}:user_message:{turn}) so replay can't double-write; emission is non-blocking (originally flag-gated byHARE_DOSSIER_EVENTS_ENABLED; flag removed in #1143 — the gate is now just "is Hare configured?"). The transcript (user-authored turns), not agent interpretation, is the canonical dossier seed. See:app/lib/hare/dossier.py,docs/architecture.md→ Hare dossier boundary. [tags: hare, onboarding, events] - #1143 / PR (merged) — removed
HARE_DOSSIER_EVENTS_ENABLED; dossier emission is now unconditional and gated solely on Hare being configured (hare_base_url/hare_signal_token). Design rule: a per-service boolean that must be set identically on every emitting service is drift-prone —footman-outbox-workersilently dropped every coach dossier emission (#1040's_fire_coach_dossier_event, never reaching Hare since the channel path landed in #796) because it carried the Hare creds but not the enable flag, which defaultsFalse. The presence of the credential is the right single gate;fire_dossier_eventalready no-ops (skipped_unconfigured) where Hare is unwired, and unsettinghare_base_urlis the kill-switch if ever needed. See:app/lib/hare/dossier.py. [tags: hare, config, cleanup, events] - #1184 / PR #1206 (merged) — attachment identity is
(user_id, SHA-256 bytes), not board/message ids or filenames: exact same-user reuploads collapse for Hare while occurrence metadata/census preserves every upload, same-filename different-hash files stay candidate versions, and old board-scoped Hare keys are intentionally stranded at the migration boundary. See:app/lib/attachments.py,docs/architecture.md→ Hare dossier boundary. [tags: attachments, hare, idempotency] - #961 / PR #961 (open) — split the memory work by responsibility: this side owns "does the agent use injected memory?", hare owns "is the memory good?". The user_state_agent eval gains 10 curated real (anonymized) dossier-consumption cases (
tests/evals/dossier_cases.py+fixtures/memory/): each pairs a production board context with the real Hare dossier markdown that was injected and asserts the board honors it (QualityJudgegained a free-textcustom_rubric/custom_red_flagspath so per-case behaviors don't need the fixedshould_*vocabulary). Design rule: a memory consumer's eval must inject the memory system's real rendered output, not an idealizedDossierContext(rendered=...)— the hand-written block tests the board given a perfect dossier and never catches upstream loss. The recall question (does the dossier carry the user's facts — extraction/consolidation fidelity) belongs to hare's full-flow gate cases, not here. [tags: evals, dossier, memory] - #961 measurement that drove the hare split (scored once during the spike, not kept as a standing eval here): the dossier beat the legacy
user_situations+user_preferencesblocks overall (68% vs 34% fact recall) but dropped standing preferences (65%, below legacy's 69%) because hare's consolidation gate classifiedpreferences/constraints/training_stylewhile the consolidator had no anchor to store them under — unanchored prose was silently discarded. An unmapped category between a gate taxonomy and a storage taxonomy is silent data loss. Drove hare#106/#107/#110; open follow-up hare#113 (preferences still lossy + stale contradictions rendered as opposite-of-truth). Operational note for regenerating cases: hare/alice are Render private services (no public URL; SSH refuses TCP forwarding), so prod dossier reads run a snippet inside the prod imperfect-api container over SSH. [tags: hare, evals, dossier, pii] - #924 (merged) — Route tools choose by user intent, not recency: standalone asks start a new search, follow-ups continue only via a resolved semantic route handle (
quoted_route/latest_route), and raw Cheshire session IDs stay framework-side. See:app/agents/tools/cheshire.py,app/models/route_artifact.py. [tags: cheshire, routes, agents, evals] - #991 / PR #1001 (merged) — public route planning streams terminate at imperfect-api: browsers call
api.imperfect.co/routes, while Cheshire stays backend-facing and returns cursor-bearing technical events plus typed terminal outcomes; raw route-session ids, provider ids, signed artifacts, tool details, and trace links stay server-side. Parent: #989. See:app/lib/cheshire/public_routes.py,docs/architecture.md→ Cheshire integration boundary. [tags: cheshire, routes, sse, public-routes] - #1123 / PR #1166 (merged) — Cheshire ask streams retry upstream transport drops only until imperfect-api emits the first client-visible SSE event; turn-1 retries are safe without an idempotency key because Cheshire persists after stream completion, but follow-up turns need idempotency before retrying. See:
app/lib/cheshire/client.py,app/resources/cheshire_chat.py. [tags: cheshire, sse, retry]
MongoDB models & schema
- #592 / PR #602 (merged) —
EventDetails(one-to-one withEventviaevent_idPK) embeds optional enrichment blocks in Mongo while dual-writing to a Postgresevent_detailstable with JSONB columns per block. Design rule: enrichment lives in a sibling document, not as nested fields onEvent, so each enrichment phase can ship independently and PG schema evolution isJSONB-cheap. [tags: events, schema, postgres] - #582 / PR #586 (merged) —
events/user_eventsPG tables moved to a dedicatedcoreschema (notpublic). Design rule: cross-service Postgres tables get their own schema so RLS roles and migration ownership stay scoped. [tags: postgres, schema] - #579 (merged) — Phase 1 of events Postgres migration is dual-write: Mongo stays source of truth, PG failures are logged + swallowed via
safe_dual_write. Pattern: migrate hot collections via dual-write before swapping reads. [tags: postgres, migration] - #580 / PR #581 (merged) — guest users can't satisfy
emailrequired onAccountDeletion; mongoengineValidationError500s the deletion endpoint. Design rule: account-deletion records makeemailoptional because guest accounts exist by construction. [tags: mongo, account] - #424 (merged) —
User.firebase_uidadded +User.emailmade nullable to support guest accounts created from an anonymous Firebase session (no email yet). Pure schema + backfill — behavior changes ride later PRs. [tags: user, schema] - #477/#478 / PR #479 (merged) —
POST /usersis an idempotent upsert keyed byfirebase_uid: anonymous → Google/Apple link path promotesaccount_type=guest→registeredand persists email/name/provider. Mobile contract: afterlinkWithCredential, force-refresh the ID token and callPOST /usersagain. [tags: user, auth, mobile]
Notifications & push pipeline
- #484 / PR #485 (merged) — infographic push payload exceeded FCM's 4KB limit because we stuffed the full prompt into
body+ duplicated URLs indata. Mobile only readsdata['image_url'](auto-injected fromimage_urls_by_platform) and an optional caption. Design rule: notification payloads carry URLs + minimal caption, never embedded content. [tags: notifications, fcm] - #978 / PR #980 (merged) — WhatsApp chart profiles set logical
width, omitheight, and useresolution.fitfor the delivery pixel budget; do not pin both dimensions for channel images unless the content itself is square. See:app/lib/whatsapp/charts.py. [tags: notifications, cheshire, charts, whatsapp] - #1090 / PR #1101 (merged) — WhatsApp chart surfaces keep logical height omitted but bound Cheshire's derived canvas with flat
min_aspect_ratio/max_aspect_ratio; display crop is a render-contract problem, not somethingresolution.fitor transport metadata can fix. See:app/lib/whatsapp/charts.py,app/lib/cheshire/client.py. [tags: notifications, cheshire, charts, whatsapp] - #482 / PR #483 (merged) — dropped the 120s
asyncio.wait_fordeadline on_render_notification_images; the cron is async and the user sees the push when it arrives, so the artificial cap only caused us to ship pushes withimage_urls_by_platform={}. Design rule: in async pipelines, drop timeouts that exist only to bound latency — they trade a degraded-but-fast push for no push. [tags: notifications, cheshire] - #463 / PR #464 (merged) — imperfect-api's own
/visualizeheartbeat is dropped: Cheshire already heartbeats every 10s on the same NDJSON stream, andasyncio.wait_for(queue.get(), timeout=...)against an unbounded queue silently loses items via cpython issue #86296. Design rule: don't layer a heartbeat over a service that already streams them — duplicating shapes confuses mobile parsers. [tags: sse, cheshire, async] - #358 / PR #367 (merged) — sleep notifications dedupe per-user-per-day via a Valkey key with 18h TTL, not just via the 10-minute debounce; providers re-sync sleep 1–2h later and a debounce window can't catch it. Design rule: idempotency keys cover the data's natural cadence, not the wire-event burst window. [tags: notifications, debounce]
- #387 (merged) —
data_storedsubscriber must not gate board regen on the burst containing asleepsevent; UTC-negative-timezone users get bursts with only activity events and were silently skipped. Design rule: per-type debounce is per-type; cross-type gating drops legitimate updates. [tags: events, debounce] - #366 / PR (merged) — clean up dead FCM tokens on any
FirebaseError(includingThirdPartyAuthError), not justUnregisteredError; per-user token rows accumulate across reinstalls otherwise. [tags: fcm, tokens] - #379 (merged) — when the SSE client disconnects mid-stream the board generation continues in a background task and persists; the user's feedback was being lost on disconnect because the save lived inside the generator. Design rule: side effects in SSE generators live outside the
yieldloop, so disconnect doesn't drop them. [tags: sse, board] - #1199 / PR #1224 (merged) — Home artifact freshness is one canonical
home-artifact-v1key over date/timezone, locale/units, provider state, Alice freshness tokens, event/plan versions, schema, prompt/context, and app-visible client settings; stale-first reads and warmers compare against that same key instead of inventing narrower cache boundaries. Parent: #1195. See:docs/home-artifact-cache-key.md. [tags: board, cache, home, perf] - #457 (merged) — push notifications for activities dedupe per-vendor on
activity_id; cross-vendor relevance filter prevents the same workout from firing twice when Apple proxies Garmin. [tags: notifications, dedup] - #349 / PR #352 (merged) — all provider notifications migrated to Cheshire; the old formatting system was killed in one PR. Design rule: migrations that replace a path entirely (no dual-mode) ship as a single PR — dual-mode invites drift. [tags: notifications, cheshire, migration]
- #785 / PR #786 (merged) — every
/cheshire/askcall from imperfect-api now hits cross-providerPOST /users/{user_id}/askwith aproviders: [...]body (cheshire #168). The old single-vendor URL froze on a primary vendor and RLS-scoped every query to it — for users who reconnected (e.g. Garmin → Terra/WHOOP + Apple) cheshire correctly reported "blackout" while the notification pipeline still shipped a confident push paired with a contradictory sync-blackout card. Multi-provider users now also hit cross-provider/users/{user_id}/visualizeso the chart spans every linked vendor; single-provider users keep the per-vendor/visualize(only path that supports row-scoped GPS routes). Design rules: the cross-provider route is the default; reach for single-vendor only when the feature is intrinsically per-vendor (GPS routes today). Cache/askresponses on every routing-relevant input —(user_id, question, effort, timezone, locale, providers_tuple)— so simultaneous sleep + activity asks don't collide and a reconnect (providers list change) bypasses the cache, re-seeding cheshire'suser_providerstable on the next live call (a cache keyed onuser_idalone caused both bugs in review). Never log raw provider refs — vendor user_ids are PII; emitprovider_vendors=[...]+provider_count=Ninstead. [tags: cheshire, notifications, cross-provider] - #866 (merged; retired by #933) — user-facing route artifacts used deterministic
dl.imperfect.coDub redirects whose expiry tracked the signed GPX artifact; channel surfaces never exposed raw signed cloud-storage URLs, and the first pass favored graceful expiry fallback over a re-signing API until owned route pages replaced the path. [tags: cheshire, route-shares, dub] - #919 (merged; retired by #933) — route-share Dub metadata used a crawler-friendly social preview rendition, not the full-resolution signed map URL; the owned-page canary preserves the metadata assertion while rejecting signed route-share metadata. See:
app/lib/route_share_canary.py. [tags: cheshire, route-shares, dub, social-preview] - #931 (merged; retired by #1043) — route shares used the owned extensionless
/r/{hash}/{slug}page as the canonical channel URL; route assets stayed under same-origin/r/...paths and the canary verified crawler metadata, content types, and deterministic missing/slug redirects before Dub/signed fallbacks could be retired. The route-share reset deletes this hash-based public contract instead of migrating it. Parent: #930. See:docs/route-share-canary.md. [tags: cheshire, route-shares, owned-pages, dub] - #933 (retired by #1043) — pre-reset route shares had exactly one delivery path: the owned
/r/{hash}/{slug}page link preview. If Cheshire omitted or malformed that owned-page bundle, imperfect-api attached no Dub, GPX, signed map, or media fallback. The reset keeps the page-first principle but replaces the public identity with opaque/r/{share_id}/{slug}and deletes the old route objects. Remaining Dub/go-link usage is unrelated to route shares (event short links and channel invites). [tags: cheshire, route-shares, owned-pages, dub] - #1131 (merged, umbrella) — Route-share presentation has one owner across Cheshire and imperfect-api: Cheshire emits localized presentation packets for preview metadata, delivery mode, and reply context, while imperfect-api channels apply those fields verbatim and treat legacy inference as a logged fallback. Children: #1132 / #1134 / #1133 / cheshire#1055 / #1136 / #1135. Winner: #1150. See:
app/agents/tools/cheshire.py,app/types/chat.py,docs/public-route-contract.md. [tags: cheshire, route-shares, channels, social-preview] - #923 (merged) — channel webhooks ACK only after the inbound turn is durable; long-running reply generation runs in a worker, and accepted bridge deliveries are not retried until transport-side idempotency exists. See:
docs/architecture.md→ Channel bridge inbound + durable worker boundary. [tags: channels, durability, workers] - #1126 / PR #1129 (merged) — A slow Heylo turn's "I'm on it" hold note (#1073) is edited into the final reply in place (one message; prefixed with the asker's
@-mention, quote dropped) instead of posting a second quoted message — but only for text-only replies. A reply carrying media (uploaded image or Cheshire chart) keeps the new-message send path, because Heylo's edit reuses the genericapi-updateCloud Function which silently drops any field other thancontent+mentions— verified against live Heylo: amediamap on an edit does not persist (the call still 200s and stampsedited:true). So an edit can only preserve media already on a message, never add/change it; there is no media-on-edit path. See:app/workers/channels/runner.py::_coach_reply_spec,app/lib/channels/CLAUDE.md; footman #153/#154/#156. [tags: channels, heylo, message-edit, charts] - #1131 (merged, umbrella) — Route presentation is a versioned packet produced by Cheshire and only relayed by imperfect-api: Cheshire emits a
RouteSharePresentation(per-option link-preview title/description/image, explicitdelivery.mode∈ {bare_url,url_with_text}, page prose, reply context, slate summary withoption_label_policy), and imperfect-api + channel bridges consume those named surfaces verbatim instead of inferring social-preview copy or delivery mode from overloadeddescription/message_textfields. Design rule: a user-visible route card has exactly one producer;Nonenever carries product meaning (delivery mode is explicit), and stale lettered-option picker copy is stripped from model-facing prompts + final summaries unless the user's own message asked for letters. Children: #1134 (the contract + EN/es-MX fixtures + CI drift-guard), cheshire#1055 (Cheshire emits packets, reusingroute_share_preview_descriptionso API ≡ manifest), #1135 (imperfect-api consumes packets verbatim across both OutboundLink builders, with an explicit logged compat fallback), #1132 (hotfix: link previews usepreview_description), #1133 (kill the stale lettered-option handoff via a sharedroute_share_textguard + prompt edit), #1136 (e2e channel-transcript gate + delivery-mode-gated bare-URL harness). See:app/lib/public_routes/share_contract.py,app/agents/tools/cheshire.py,app/lib/channels/coach.py. [tags: cheshire, routes, route-presentation, channels, contract]
Dev harness (imperfect-cli)
- #465 / PR #467 (merged) —
imperfect-cli serveboots qdrant + uvicorn (+ opt-in ngrok) fromdev.yaml; the CLI is the local-prod harness, never a parallel system. Rule (also inAGENTS.md): CLI commands always import fromapp/— the moment a CLI grows its own logic it stops being a harness and starts lying about prod behavior. [tags: cli, dev] - #538 / PR #515 (merged) —
imperfect-cli replay-notification <NOT_id>re-sends a stored Notification doc verbatim via FCM (reusesNotification._data_for_platform+send_push_notification); defaults to--dry-runso it can't accidentally hit prod devices. Pattern: replay tools default-safe and reuse the prod send path. [tags: cli, notifications] - #573 / PR #576 (merged) —
imperfect-cli psquerymirrors alice'sdb psql: two modes only (local +--prod), no--previewor--stagebecause previews are ephemeral and there is no long-lived staging Postgres. IP-allowlist patch is on-demand (probe first, PATCH only on timeout). [tags: cli, postgres] - #555 / PR #565 (merged) —
imperfect-cli visualizemirrors the mobile onboarding wow-screen call (same system prompt,theme=light,effort=max) — useful for sharable campaign assets with a--personaflag to swap user names with deceased painters/writers so PII never leaks. [tags: cli, visualize] - #536 (merged) —
imperfect-cli onboard-replay+/onboard-backfillskill recover users hit bymemory_agent_no_writes. Pattern: the harness is the recovery tool for skills triggered by Logfire warnings. [tags: cli, recovery] - #291 (merged) —
imperfect-cli auth <email>mints a Firebase custom token so the harness can sign in as any real user; foundation for every other CLI command that operates on prod user state. [tags: cli, auth] - #1221 (merged) — Apple onboarding Layer A "live driver" (
app/lib/apple_onboarding_live_driver.py, CLIapple-onboarding-live-driver): the no-device backend e2e loop mirrorsroute_share_live_driverexactly — process-global snapshot/restore in afinally, in-process worker draining + polled readback with a full-state timeout dump, and an injectedPersonaSeeder(default shells out toalice apple emulate --persona, alice#818) — and reuses #1209's devstack plan +apple_onboarding_contractasserts instead of rebuilding. Design rule: Apple is NOT served byGET /onboarding/{provider}/insights(garmin/terra only), so the driver computes insight cards viaAppleHealthDataService+extract_insightsdirectly. Parent: #1208. See: app/lib/apple_onboarding_live_driver.py. [tags: cli, dev, apple, onboarding]
Observability & dependency health
- #1213 (merged) — OpenAI retention policy is enforced at request construction: Whisper accepts
store=falsebut has no retrievable response id to audit after the fact, so every OpenAI call must route through the audited storage-disabled helpers and static guard tests. See:app/lib/openai/policy.py. [tags: openai, privacy, tests] - #910 (merged) — Tests that persist TTL-indexed Mongo documents must derive fixture dates from a controlled clock, not calendar literals; mongomock enforces TTL expiry at insert time, so hard-coded “future” rows rot into branch-wide CI failures once wall time crosses
expires_at. See: #911. [tags: tests, mongodb, ttl] - #885 (merged) — local pytest defaults stay hermetic: keep coverage, real Postgres, and telemetry export as explicit CI or opt-in targets, because global pytest addopts plus env-template service URLs can turn a unit suite into a 10-minute external-side-effect hunt. See: issue #883. [tags: tests, perf, postgres, logfire]
- #571 / PR #572, PR #575, PR #578 (merged) — mem0's PostHog telemetry spawns a daemon Consumer thread per
capture_eventcall;MEM0_TELEMETRY=Falsedoesn't prevent it (the env var only setsdisabled=Trueafter the client + thread are constructed). Fix: bumpmem0ai>=1.0.11(lazy singleton, mem0ai/mem0#4535) + setMEM0_TELEMETRY=0inrender.yamlbelt-and-suspenders. Design rule: a vendored library leaking threads at ~72/hour is observable asfutex_wait_queuecount growth — instrument before guessing. [tags: deps, threads, observability] - #570 (not-planned) —
pydantic-ai[mistral] → mistralai (404 on PyPI)was assumed to block everyuv lockrefresh, butpyproject.tomlusespydantic-ai-slim[anthropic,google,openai]— no[all], no[mistral]. Verified with--no-cacheresolves clean. Lesson: validate the dependency-graph assumption before declaring a lockfile-wide blocker. [tags: deps, lockfile] - #556 / PR #560 (merged) — Logfire system-metrics doesn't cover Python-specific runtime gauges (thread count by name,
gc.get_count(), per-thread frame summaries). Add them inapp/main.pylifespan on a 30s cadence; this is what identified #571's leak source without ptrace on Render. Design rule: when Render blockspy-spy, frame-snapshot in-process. [tags: observability, logfire] - #553 (merged) — SSE error paths emit a typed
event: error(matchingTranscriptionError's shape) instead of closing the stream silently; the mobile client surfaces specific messages instead of "stream ended unexpectedly". Design rule: streaming endpoints have a typed error event, never silent close. [tags: sse, errors] - #561 / PR #563 (merged) — pool the Valkey connection via
lru_cacheinstead ofredis.from_url(...)per event in_get_valkey; per-event constructor was a measurable CPU spike during the morning sync window. [tags: perf, valkey]
Deployment & preview environments
- #234 (merged) — Render preview environments via
render.yamlinimperfect-api(not a dedicated infra repo): Render only triggers previews when the file lives in the repo with code changes.projectsinrender.yamlis organizational only, doesn't control previews. [tags: deploy, render, previews] - #1010 / PR #1011 (merged) — Production Render services only honor
render.yamlsettings when the service isBlueprint managed; verify the service header or Blueprint Resources membership before relying on YAML-owned start commands, env groups, or build filters, because plain Git-backed services deploy code but ignore those config fields. See:render.yaml,app/lib/render_uvicorn.py. [tags: deploy, render, blueprints] - #990 / PR #1000 (merged) — Render public URLs traverse Cloudflare before Render's private proxy, so public Uvicorn services centralize forwarded-header trust, reject wildcard
*, refresh Cloudflare ranges at process start with a vendored fallback, and consume the normalized ASGI client address instead of rawX-Forwarded-For. Parent: #989. See:app/lib/render_uvicorn.py,app/lib/ip_geo.py. [tags: deploy, render, cloudflare, public-routes] - #264 + #267 (merged) — preview cleanup: closing imperfect-api PR auto-closes the alice PR + destroys both preview environments; Render API returns wrong port for private services — patched in
.github/scripts/preview_manager.py. [tags: previews, alice] - #438 / PR #439 (merged) —
.github/scripts/gh.py::run_ghraises on non-zeroghexit instead of returning empty string + logging. The previous swallow meant a staleCROSS_REPO_PATproduced a green "manage-previews" run with zero work performed. Design rule: cross-repo automation must fail loud — silent-empty defaults hide auth/network failures. [tags: ci, previews] - #432 / PR #434 (merged) — Terra webhook destination auto-configures on alice preview lifecycle; preview envs reach feature-parity with prod for vendors that own external webhook endpoints. [tags: previews, terra]
- #539 / PR #548 (merged) —
browser-use(Playwright + Chromium, ~600 MB – 1.5 GB per agent run) moves to a dedicateddiscovery-workerservice. No concurrency cap on the web path was OOMing the standard-tier 4GB container at 3–4 concurrent invocations. Side benefit: webDockerfileno longer needs the 2GB Playwright base. Design rule: heavy non-HTTP workloads get their own Render service when their per-request memory crosses ~500MB. [tags: deploy, render, workers] - #564 (merged) —
discovery-workerbumped to Render's pro plan as the immediate sequel to #539; the worker pattern from alice (#270's mirror/purge) is canonical for any new vendor heavy-task. [tags: deploy, workers]
Process & docs
- #591 (open, umbrella) — supersedes the 4-phase event enrichment plan (#592 scaffolding + #593 weather + #594 elevation + #595 finish-times) after the #596 POC validated a single Browser Use Cloud + Opus 4.7 popularity classifier task. #593/#594/#595 (not-planned) — closed in favor of one integrated workstream. Selective enrichment via classifier cuts catalog backfill from ~$380 to ~$100. Design rule: when a cheap POC surfaces a simpler architecture, retract the phased issues and re-issue as one — don't keep the phased plan alive for sunk-cost reasons. [tags: events, process, browser-use]
- #569 (merged) —
app/reorganization as 6 thin slices: flatapp/<domain>/siblings ofapp/agents//app/workers//app/cli//app/resources/.lib/rule: a file stays inlib/only if ≥2 unrelated domains import it for the same generic capability; everything domain-laced gets absorbed during its slice. Each slice is a pure file-move + import-rewrite + locksteprender.yaml/Dockerfile.worker/AGENTS.mdupdate. [tags: refactor, process] - #491 (merged) — long-form decision records (the
.docs/issue-481/pattern) live onmaineven when the implementation lives on a closed PR, so the reasoning survives branch cleanup. GitHub's URL parser can't disambiguate slash-bearing branch names against.docspaths — raw URLs 404. [tags: docs, process] - #598 (merged) — ported alice#576 / cheshire#375's doc-infra pattern: added this index, capped
AGENTS.mdat 500 lines viamake lint, audited a first batch of module docstrings for substantive WHY. Doc relocation rule (where new content goes when the cap fires):AGENTS.md= descriptive top-level (modules, env vars, commands, conventions);docs/architecture.md(future) = cross-cutting descriptive WHY (system overview, data flow, design decisions); module docstrings = module-local retrospective WHY (why this module's invariant exists, the specific bug that drove the seam);docs/learnings.md= retrospective WHY across issues/PRs (the file you're reading). Whenmake docs-budgetfails, classify the addition by category and relocate accordingly. [tags: docs, process] - #611 (merged) — completed the post-port content trim:
AGENTS.md482 → 275 lines, cap ratcheted 500 → 400. Architecture diagram + OAuth sequence + cross-service rationale moved todocs/architecture.md; CLI section condensed to a command table + guardrails + a pointer to source. Design rule (carried forward from #598): the cap only does its job when it's actually ratcheted to the steady-state baseline — a loose cap records the high-water mark instead of preventing regrowth. [tags: docs, process]