LLMs can learn annotator-specific label-explanation behavior from human label variation via cross-annotator preference optimization, outperforming prompting and standard fine-tuning on two sentence-pair tasks.
Agree, Disagree, Explain: Decomposing Human Label Variation in NLI through the Lens of Explanations
3 Pith papers cite this work. Polarity classification is still indexing.
abstract
Natural Language Inference (NLI) datasets often exhibit human label variation. To better understand these variations, explanation-based approaches analyze the underlying reasoning behind annotators' decisions. One such approach is the LiTEx taxonomy, which categorizes free-text explanations in English into reasoning categories. However, previous work applying LiTEx has focused on within-label variation: cases where annotators agree on the NLI label but provide different explanations. This paper broadens the scope by examining how annotators may diverge not only in the reasoning category but also in the labeling. We use explanations as a lens to analyze variation in NLI annotations and to examine individual differences in reasoning. We apply LiTEx to two NLI datasets and align annotation variation from multiple aspects: NLI label agreement, explanation similarity, and taxonomy agreement, with an additional compounding factor of annotators' selection bias. We observe instances where annotators disagree on the label but provide similar explanations, suggesting that surface-level disagreement may mask underlying agreement in interpretation. Moreover, our analysis reveals individual preferences in explanation strategies and label choices. These findings highlight that agreement in reasoning categories better reflects the semantic similarity of explanations than label agreement alone. Our findings underscore the richness of reasoning-based explanations and the need for caution in treating labels as ground truth.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
CIE-Scorer detects unfaithful CoT by tracing compact sentence-level circuits, building internal-external reasoning graphs, and scoring their discrepancy with Fused Gromov-Wasserstein distance, reporting SOTA results on FaithCoT-Bench with reduced circuit cost.
Annotation disagreement on toxic language can be moderately predicted from textual features, with high-opposition items proving harder for models to estimate accurately.
citing papers explorer
-
Human Label Variation as Stable Signal: Learning Annotator-Specific Explanation Behavior via Cross-Annotator Preference Optimization
LLMs can learn annotator-specific label-explanation behavior from human label variation via cross-annotator preference optimization, outperforming prompting and standard fine-tuning on two sentence-pair tasks.
-
Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy
CIE-Scorer detects unfaithful CoT by tracing compact sentence-level circuits, building internal-external reasoning graphs, and scoring their discrepancy with Fused Gromov-Wasserstein distance, reporting SOTA results on FaithCoT-Bench with reduced circuit cost.
-
Quantifying and Predicting Disagreement in Graded Human Ratings
Annotation disagreement on toxic language can be moderately predicted from textual features, with high-opposition items proving harder for models to estimate accurately.