Skip to content

Hare paired-turn eval (#969)

Date: 2026-06-09

Runner: one-off Hare extraction comparison using the sampled contextual-reply fixtures from imperfect-co/hare#131. The run used the sibling Hare repo with MODEL_PROVIDER=bedrock.

Purpose: compare baseline user-only source text against treatment source text plus paired assistant context from the dossier event payload. Assistant text is context only; the user reply remains the authoritative evidence.

Case Baseline Treatment Result
still_injured_no skip/not_durable update/injury_history Paired context resolves the negation target.
still_hurts_override skip/not_durable update/injury_history Paired context resolves the affirmative injury target.
unrelated_no_control skip/not_durable skip/not_durable Control still abstains.
multi_question_no_control skip/not_durable skip/not_durable Ambiguous multi-question control still abstains.

No production replay was executed.