Hare paired-turn eval (#969)
Date: 2026-06-09
Runner: one-off Hare extraction comparison using the sampled contextual-reply
fixtures from imperfect-co/hare#131. The run used the sibling Hare repo with
MODEL_PROVIDER=bedrock.
Purpose: compare baseline user-only source text against treatment source text plus paired assistant context from the dossier event payload. Assistant text is context only; the user reply remains the authoritative evidence.
| Case | Baseline | Treatment | Result |
|---|---|---|---|
still_injured_no |
skip/not_durable |
update/injury_history |
Paired context resolves the negation target. |
still_hurts_override |
skip/not_durable |
update/injury_history |
Paired context resolves the affirmative injury target. |
unrelated_no_control |
skip/not_durable |
skip/not_durable |
Control still abstains. |
multi_question_no_control |
skip/not_durable |
skip/not_durable |
Ambiguous multi-question control still abstains. |
No production replay was executed.