Interpretable Discriminative Text Representations via Agreement and Label Disentanglement

Leo Yang Yang; Tong Wang; Yiqing Xu

arxiv: 2605.20693 · v1 · pith:XG22FBUKnew · submitted 2026-05-20 · 💻 cs.CL · cs.AI· stat.ML

Interpretable Discriminative Text Representations via Agreement and Label Disentanglement

Tong Wang , Yiqing Xu , Leo Yang Yang This is my paper

Pith reviewed 2026-05-21 05:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AIstat.ML

keywords interpretable text representationsfeature discoverylabel disentanglementconceptual clarityLLM-assisted feature selectiontext classificationinter-annotator agreement

0 comments

The pith

Screening text features for human agreement and label independence produces interpretable coordinates that match strong baselines in accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors aim to make discriminative text representations interpretable by requiring each feature to be conceptually clear to independent annotators and disentangled from the target label. They develop LLM-assisted Feature Discovery to generate candidate features from contrasting text pairs, filter them using agreement scores between language models, and retain those that add predictive power beyond the label itself. On ten classification tasks, this approach performs as well as a text bottleneck baseline but yields features with higher agreement among humans and less leakage of the label information. This provides a practical way to create auditable features for text classification.

Core claim

LLM-assisted Feature Discovery (LFD) generates lexical and semantic features from contrastive pairs of texts with opposed outcomes, screens them via cross-LLM Cohen's kappa to ensure agreement, and selects those with residual predictive gain on held-out data. This produces representations that achieve comparable accuracy to baselines while showing substantially higher human-human and human-LLM agreement and lower label leakage in audits with 232 raters across seven corpora.

What carries the argument

The LLM-assisted Feature Discovery (LFD) process, which proposes features from contrastive outcome-opposed text pairs, applies a cross-LLM Cohen's kappa screen for conceptual clarity, and uses residual held-out predictive gain to ensure label disentanglement.

If this is right

Features can be applied consistently by independent auditors without access to the original model.
Predictive performance remains on par with strong text bottleneck methods.
Human raters judge the features as less likely to leak the target label.
Agreement between human annotators and between humans and LLMs is higher than for baseline concepts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These agreement-tested features could serve as building blocks for more transparent hybrid human-AI decision systems.
The screening approach highlights how formal reliability checks can bound annotation noise in feature definitions.

Load-bearing premise

That features passing the cross-LLM kappa screen and residual predictive gain test will maintain conceptual clarity and label disentanglement when applied by independent human auditors outside the original development process.

What would settle it

A replication where new human raters, unaware of the development process, apply the LFD features to held-out texts and show no improvement in agreement rates or judge them as equally or more label-entangled compared to the baseline concepts.

Figures

Figures reproduced from arXiv: 2605.20693 by Leo Yang Yang, Tong Wang, Yiqing Xu.

**Figure 1.** Figure 1: Two disentanglement measures of LFD features vs. TBM concepts, pooled across the 10 tasks (lower the better). |ρ(f, y)| > 0.60 on the test set) on 4 of the 10 tasks — the †-flagged cells in [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗

**Figure 2.** Figure 2: Human disentanglement rubric: mean rating per task. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Interpretable text representations should expose coordinates that are not only predictive, but also meaningful enough for independent auditors to apply. Existing discriminative representations often use anonymous embedding directions, while concept-bottleneck and LLM-assisted methods attach natural-language names to features without ensuring that those definitions are reproducible or distinct from the target label. We propose an operational criterion for interpretable discriminative text representations: each coordinate should satisfy conceptual clarity, measured by chance-adjusted agreement between independent annotators applying the feature definition, and label disentanglement, meaning the feature should not merely paraphrase the prediction target. We instantiate this criterion in LLM-assisted Feature Discovery (LFD), an iterative method that proposes lexical and semantic features from contrastive outcome-opposed text pairs, screens candidates using cross-LLM Cohen's $\kappa$, and selects features by residual held-out predictive gain. A stylized analysis connects the $\kappa$ screen to a per-feature annotation-noise bound, formalizing agreement as a reliability check. Across ten text-classification tasks spanning seven corpora, LFD matches the predictive performance of a strong text bottleneck baseline while producing substantially clearer and less label-entangled features. Human audits with 232 raters show that LFD features achieve higher human--human and human--LLM agreement than baseline concepts, and raters consistently judge them as less label-leaking. These results suggest that agreement-tested, label-disentangled coordinates provide a practical auditability standard for interpretable text classification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They operationalize interpretable text features with agreement and disentanglement, and it mostly works but leakage is still possible.

read the letter

The punchline is that they have turned interpretability into something you can actually measure with agreement scores and a check against label leakage, and the method holds its own on real tasks. They define conceptual clarity via chance-adjusted annotator agreement and label disentanglement as not just echoing the target. Then they build LFD around proposing features from contrastive pairs, screening with cross-LLM kappa, and selecting on residual held-out gain. That combination and the iterative loop feel new compared to standard concept bottleneck work. On the positive side, it matches baseline accuracy on ten tasks from seven corpora while getting stronger human agreement from 232 raters and lower judgments of label leaking. The noise-bound analysis for the kappa screen adds some formality to the reliability argument. The soft spot is the one the stress test flags. Even after kappa and residual filtering, there could be leftover semantic ties to the label that only show up to auditors outside the development loop. The human audits back the disentanglement claim, but without a bound on residual mutual information the guarantee is more empirical than theoretical. This paper is for people in applied NLP who need auditable features for things like moderation or clinical notes. Anyone working on making discriminative models more transparent will find the criterion and the human evaluation results useful. It shows honest engagement with the interpretability literature and has enough substance to go through peer review. I would send it to referees rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes an operational criterion for interpretable discriminative text representations requiring conceptual clarity (via cross-annotator Cohen's kappa) and label disentanglement (features should not paraphrase the target label). It instantiates this via LLM-assisted Feature Discovery (LFD), which generates candidate features from contrastive outcome-opposed pairs, screens them with cross-LLM kappa, and retains those with positive residual held-out predictive gain. Across ten text-classification tasks on seven corpora, LFD is claimed to match a strong text-bottleneck baseline in predictive performance while yielding clearer, less label-entangled features; human audits with 232 raters confirm higher human-human and human-LLM agreement and lower perceived label leakage.

Significance. If the empirical claims and selection procedure hold under independent scrutiny, the work supplies a concrete, auditability-focused standard for feature discovery in text classification that bridges LLM-assisted concept generation with reproducibility requirements. The combination of kappa-based reliability screening and residual-gain selection, together with the stylized noise-bound analysis, offers a practical template that could be adopted beyond the reported tasks.

major comments (2)

[Method and analysis sections] The stylized noise-bound analysis (mentioned in the abstract) formalizes per-feature annotation reliability from the kappa screen but does not derive a bound on residual mutual information between the selected coordinate and the target label after the full kappa-plus-residual-gain procedure; this leaves the central label-disentanglement claim vulnerable to subtle lexical or semantic correlations that survive the internal filters yet become detectable by independent human raters.
[Experimental evaluation] The abstract states that LFD matches baseline predictive performance while producing less label-entangled features, yet provides no details on data splits, exact definitions of residual held-out predictive gain, or statistical tests for the human-audit comparisons; without these, it is impossible to assess whether post-hoc feature exclusions or fitting choices inflate the reported human-agreement advantages.

minor comments (2)

[Method] Notation for the residual predictive gain term should be introduced with an explicit equation rather than described only in prose.
[Human evaluation] The human-audit protocol would benefit from a table listing the exact rating scales and instructions given to the 232 raters.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the positive assessment of the work's potential impact. We address each major comment below, providing clarifications and indicating revisions where they strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Method and analysis sections] The stylized noise-bound analysis (mentioned in the abstract) formalizes per-feature annotation reliability from the kappa screen but does not derive a bound on residual mutual information between the selected coordinate and the target label after the full kappa-plus-residual-gain procedure; this leaves the central label-disentanglement claim vulnerable to subtle lexical or semantic correlations that survive the internal filters yet become detectable by independent human raters.

Authors: The stylized analysis connects the cross-LLM kappa screen to a per-feature annotation-noise bound, formalizing reliability as a prerequisite for interpretability. Label disentanglement is enforced operationally by the residual held-out predictive gain, which retains only those features that improve held-out accuracy beyond a model using the target label alone. This selection step is intended to exclude features whose predictive value derives primarily from paraphrasing the label. While a closed-form bound on residual mutual information after the combined kappa-plus-gain procedure is not derived, the human-audit results (higher agreement and lower perceived leakage across 232 raters) provide direct empirical evidence that surviving correlations are limited in practice. We will add a short discussion of this theoretical gap and its empirical mitigation in the revised analysis section. revision: partial
Referee: [Experimental evaluation] The abstract states that LFD matches baseline predictive performance while producing less label-entangled features, yet provides no details on data splits, exact definitions of residual held-out predictive gain, or statistical tests for the human-audit comparisons; without these, it is impossible to assess whether post-hoc feature exclusions or fitting choices inflate the reported human-agreement advantages.

Authors: We agree that explicit reporting of these elements is necessary for reproducibility. The revised manuscript will add a dedicated experimental-details subsection (or appendix) that specifies: (i) the train/validation/test splits for each of the seven corpora, (ii) the exact definition of residual held-out predictive gain as the accuracy increment on held-out data when the candidate feature is added to a label-only baseline, and (iii) the statistical procedures used for the human-audit comparisons (including the tests applied to agreement and leakage ratings). We confirm that feature selection followed only the pre-specified kappa and residual-gain criteria with no additional post-hoc exclusions; this will be stated explicitly to rule out inflation concerns. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes LFD by generating candidate features from contrastive pairs, screening via cross-LLM Cohen's kappa for conceptual clarity, and retaining those with positive residual held-out predictive gain for label disentanglement. Predictive performance is compared against an external text-bottleneck baseline on held-out test sets across ten tasks, while clarity and disentanglement are measured by separate human audits with 232 independent raters applying the definitions without access to the original generation process. The stylized noise-bound analysis formalizes the kappa screen as a reliability check but does not equate the final human-audited outcomes or performance parity to the internal selection criteria by construction. All load-bearing empirical claims rest on external data, annotators, and baselines rather than self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method relies on standard LLM prompting, Cohen's kappa, and held-out evaluation, all drawn from prior literature.

pith-pipeline@v0.9.0 · 5791 in / 1272 out tokens · 43596 ms · 2026-05-21T05:40:39.906676+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose an operational criterion... conceptual clarity, measured by chance-adjusted agreement... and label disentanglement
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 1... cross-rater κ bounds per-feature annotation noise

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 3 internal anchors

[1]

Intrinsic dimen- sionality explains the effectiveness of language model fine-tuning

Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning.arXiv preprint arXiv:2012.13255, 2020

work page arXiv 2012
[2]

Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022

work page 2022
[3]

Latent Dirichlet allocation.Journal of Machine Learning Research, 3:993–1022, 2003

David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet allocation.Journal of Machine Learning Research, 3:993–1022, 2003

work page 2003
[4]

Peirce’s theory of abduction.Philosophy of science, 13(4):301–306, 1946

Arthur W Burks. Peirce’s theory of abduction.Philosophy of science, 13(4):301–306, 1946

work page 1946
[5]

A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

work page 1960
[6]

Construct validity in psychological tests.Psychological bulletin, 52(4): 281, 1955

Lee J Cronbach and Paul E Meehl. Construct validity in psychological tests.Psychological bulletin, 52(4): 281, 1955

work page 1955
[7]

Analyzing redundancy in pretrained transformer models

Fahim Dalvi, Hassan Sajjad, Nadir Durrani, and Yonatan Belinkov. Analyzing redundancy in pretrained transformer models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4908–4926, 2020

work page 2020
[8]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of NAACL-HLT, pages 4171–4186, 2019

work page 2019
[9]

ChatGPT outperforms crowd workers for text- annotation tasks.Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023

Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. ChatGPT outperforms crowd workers for text- annotation tasks.Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023

work page 2023
[10]

BERTopic: Neural topic modeling with a class-based TF-IDF procedure

Maarten Grootendorst. BERTopic: Neural topic modeling with a class-based TF-IDF procedure.arXiv preprint arXiv:2203.05794, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Editing Models with Task Arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InInternational Conference on Learning Representations (ICLR), 2023. arXiv:2212.04089

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

LightGBM: A highly efficient gradient boosting decision tree

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems, volume 30, pages 3146–3154, 2017

work page 2017
[13]

Concept bottleneck models

Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. InInternational Conference on Machine Learning, pages 5338–5348. PMLR, 2020

work page 2020
[14]

Sage Publications, 4 edition, 2018

Klaus Krippendorff.Content Analysis: An Introduction to Its Methodology. Sage Publications, 4 edition, 2018

work page 2018
[15]

The measurement of observer agreement for categorical data

J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1):159–174, 1977

work page 1977
[16]

When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks.The Journal of Finance, 66(1):35–65, 2011

Tim Loughran and Bill McDonald. When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks.The Journal of Finance, 66(1):35–65, 2011

work page 2011
[17]

Interpretable-by-design text understanding with iteratively generated concept bottleneck.arXiv preprint arXiv:2310.19660, 2024

Josh Magnus Ludan, Qing Lyu, Yue Yang, Liam Dugan, Mark Yatskar, and Chris Callison-Burch. Interpretable-by-design text understanding with iteratively generated concept bottleneck.arXiv preprint arXiv:2310.19660, 2024

work page arXiv 2024
[18]

Promises and pitfalls of black-box concept learning models

Anita Mahinpei, Justin Clark, Isaac Lage, Finale Doshi-Velez, and Weiwei Pan. Promises and pitfalls of black-box concept learning models.arXiv preprint arXiv:2106.13314, 2021. 10

work page arXiv 2021
[19]

Do concept bottleneck models learn as intended? InICLR 2021 Workshop on Responsible AI, 2021

Andrei Margeloiu, Matthew Ashman, Umang Bhatt, Yanzhi Chen, Mateja Jamnik, and Adrian Weller. Do concept bottleneck models learn as intended? InICLR 2021 Workshop on Responsible AI, 2021

work page 2021
[20]

John Stuart Mill.A System of Logic, Ratiocinative and Inductive. John W. Parker, London, 1843

work page
[21]

All-but-the-top: Simple and effective postprocessing for word represen- tations

Jiaqi Mu and Pramod Viswanath. All-but-the-top: Simple and effective postprocessing for word represen- tations. InInternational Conference on Learning Representations (ICLR), 2018

work page 2018
[22]

Ad- Paraphrase: Paraphrase dataset for analyzing linguistic features toward generating attractive ad texts

Soichiro Murakami, Peinan Zhang, Hidetaka Kamigaito, Hiroya Takamura, and Manabu Okumura. Ad- Paraphrase: Paraphrase dataset for analyzing linguistic features toward generating attractive ad texts. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 1426–1439, 2025

work page 2025
[23]

Dhillon, Pradeep K

Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep K. Ravikumar, and Ambuj Tewari. Learning with noisy labels. InAdvances in Neural Information Processing Systems, 2013

work page 2013
[24]

Label-free concept bottleneck models

Tuomas Oikarinen, Subhro Das, Lam M Nguyen, and Tsui-Wei Weng. Label-free concept bottleneck models. InInternational Conference on Learning Representations, 2023

work page 2023
[25]

iSarcasm: A dataset of intended sarcasm

Silviu Oprea and Walid Magdy. iSarcasm: A dataset of intended sarcasm. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1279–1289, 2020

work page 2020
[26]

Finding deceptive opinion spam by any stretch of the imagination

Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T Hancock. Finding deceptive opinion spam by any stretch of the imagination. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 309–319, 2011

work page 2011
[27]

The Linear Representation Hypothesis and the Geometry of Large Language Models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models.arXiv preprint arXiv:2311.03658, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Making deep neural networks robust to label noise: A loss correction approach

Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1944–1952, 2017

work page 1944
[29]

Linguistic inquiry and word count: Liwc 2001.Mahway: Lawrence Erlbaum Associates, 71(2001):2001, 2001

James W Pennebaker, Martha E Francis, Roger J Booth, et al. Linguistic inquiry and word count: Liwc 2001.Mahway: Lawrence Erlbaum Associates, 71(2001):2001, 2001

work page 2001
[30]

Ramaswamy, Sunnie S

Vikram V . Ramaswamy, Sunnie S. Y . Kim, Ruth Fong, and Olga Russakovsky. Overlooked factors in concept-based explanations: Dataset choice, concept learnability, and human capability. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023
[31]

Structural topic models for open- ended survey responses.American Journal of Political Science, 58(4):1064–1082, 2014

Margaret E Roberts, Brandon M Stewart, Dustin Tingley, Christopher Lucas, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, and David G Rand. Structural topic models for open- ended survey responses.American Journal of Political Science, 58(4):1064–1082, 2014

work page 2014
[32]

Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996

Robert Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996

work page 1996
[33]

ChatGPT-4 outperforms experts and crowd workers for annotating political Twitter messages with zero-shot learning.arXiv preprint arXiv:2304.06588, 2023

Petter Törnberg. ChatGPT-4 outperforms experts and crowd workers for annotating political Twitter messages with zero-shot learning.arXiv preprint arXiv:2304.06588, 2023

work page arXiv 2023
[34]

Language in a bottle: Language model guided concept bottlenecks for interpretable image classification

Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, and Mark Yatskar. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19187– 19197, 2023

work page 2023
[35]

Post-hoc concept bottleneck models

Mert Yuksekgonul, Maggie Wang, and James Zou. Post-hoc concept bottleneck models. InInternational Conference on Learning Representations, 2023

work page 2023
[36]

Character-level convolutional networks for text classification

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. InAdvances in Neural Information Processing Systems, volume 28, pages 649–657, 2015. 11

work page 2015

[1] [1]

Intrinsic dimen- sionality explains the effectiveness of language model fine-tuning

Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning.arXiv preprint arXiv:2012.13255, 2020

work page arXiv 2012

[2] [2]

Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022

work page 2022

[3] [3]

Latent Dirichlet allocation.Journal of Machine Learning Research, 3:993–1022, 2003

David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet allocation.Journal of Machine Learning Research, 3:993–1022, 2003

work page 2003

[4] [4]

Peirce’s theory of abduction.Philosophy of science, 13(4):301–306, 1946

Arthur W Burks. Peirce’s theory of abduction.Philosophy of science, 13(4):301–306, 1946

work page 1946

[5] [5]

A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

work page 1960

[6] [6]

Construct validity in psychological tests.Psychological bulletin, 52(4): 281, 1955

Lee J Cronbach and Paul E Meehl. Construct validity in psychological tests.Psychological bulletin, 52(4): 281, 1955

work page 1955

[7] [7]

Analyzing redundancy in pretrained transformer models

Fahim Dalvi, Hassan Sajjad, Nadir Durrani, and Yonatan Belinkov. Analyzing redundancy in pretrained transformer models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4908–4926, 2020

work page 2020

[8] [8]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of NAACL-HLT, pages 4171–4186, 2019

work page 2019

[9] [9]

ChatGPT outperforms crowd workers for text- annotation tasks.Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023

Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. ChatGPT outperforms crowd workers for text- annotation tasks.Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023

work page 2023

[10] [10]

BERTopic: Neural topic modeling with a class-based TF-IDF procedure

Maarten Grootendorst. BERTopic: Neural topic modeling with a class-based TF-IDF procedure.arXiv preprint arXiv:2203.05794, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Editing Models with Task Arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InInternational Conference on Learning Representations (ICLR), 2023. arXiv:2212.04089

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

LightGBM: A highly efficient gradient boosting decision tree

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems, volume 30, pages 3146–3154, 2017

work page 2017

[13] [13]

Concept bottleneck models

Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. InInternational Conference on Machine Learning, pages 5338–5348. PMLR, 2020

work page 2020

[14] [14]

Sage Publications, 4 edition, 2018

Klaus Krippendorff.Content Analysis: An Introduction to Its Methodology. Sage Publications, 4 edition, 2018

work page 2018

[15] [15]

The measurement of observer agreement for categorical data

J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1):159–174, 1977

work page 1977

[16] [16]

When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks.The Journal of Finance, 66(1):35–65, 2011

Tim Loughran and Bill McDonald. When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks.The Journal of Finance, 66(1):35–65, 2011

work page 2011

[17] [17]

Interpretable-by-design text understanding with iteratively generated concept bottleneck.arXiv preprint arXiv:2310.19660, 2024

Josh Magnus Ludan, Qing Lyu, Yue Yang, Liam Dugan, Mark Yatskar, and Chris Callison-Burch. Interpretable-by-design text understanding with iteratively generated concept bottleneck.arXiv preprint arXiv:2310.19660, 2024

work page arXiv 2024

[18] [18]

Promises and pitfalls of black-box concept learning models

Anita Mahinpei, Justin Clark, Isaac Lage, Finale Doshi-Velez, and Weiwei Pan. Promises and pitfalls of black-box concept learning models.arXiv preprint arXiv:2106.13314, 2021. 10

work page arXiv 2021

[19] [19]

Do concept bottleneck models learn as intended? InICLR 2021 Workshop on Responsible AI, 2021

Andrei Margeloiu, Matthew Ashman, Umang Bhatt, Yanzhi Chen, Mateja Jamnik, and Adrian Weller. Do concept bottleneck models learn as intended? InICLR 2021 Workshop on Responsible AI, 2021

work page 2021

[20] [20]

John Stuart Mill.A System of Logic, Ratiocinative and Inductive. John W. Parker, London, 1843

work page

[21] [21]

All-but-the-top: Simple and effective postprocessing for word represen- tations

Jiaqi Mu and Pramod Viswanath. All-but-the-top: Simple and effective postprocessing for word represen- tations. InInternational Conference on Learning Representations (ICLR), 2018

work page 2018

[22] [22]

Ad- Paraphrase: Paraphrase dataset for analyzing linguistic features toward generating attractive ad texts

Soichiro Murakami, Peinan Zhang, Hidetaka Kamigaito, Hiroya Takamura, and Manabu Okumura. Ad- Paraphrase: Paraphrase dataset for analyzing linguistic features toward generating attractive ad texts. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 1426–1439, 2025

work page 2025

[23] [23]

Dhillon, Pradeep K

Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep K. Ravikumar, and Ambuj Tewari. Learning with noisy labels. InAdvances in Neural Information Processing Systems, 2013

work page 2013

[24] [24]

Label-free concept bottleneck models

Tuomas Oikarinen, Subhro Das, Lam M Nguyen, and Tsui-Wei Weng. Label-free concept bottleneck models. InInternational Conference on Learning Representations, 2023

work page 2023

[25] [25]

iSarcasm: A dataset of intended sarcasm

Silviu Oprea and Walid Magdy. iSarcasm: A dataset of intended sarcasm. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1279–1289, 2020

work page 2020

[26] [26]

Finding deceptive opinion spam by any stretch of the imagination

Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T Hancock. Finding deceptive opinion spam by any stretch of the imagination. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 309–319, 2011

work page 2011

[27] [27]

The Linear Representation Hypothesis and the Geometry of Large Language Models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models.arXiv preprint arXiv:2311.03658, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Making deep neural networks robust to label noise: A loss correction approach

Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1944–1952, 2017

work page 1944

[29] [29]

Linguistic inquiry and word count: Liwc 2001.Mahway: Lawrence Erlbaum Associates, 71(2001):2001, 2001

James W Pennebaker, Martha E Francis, Roger J Booth, et al. Linguistic inquiry and word count: Liwc 2001.Mahway: Lawrence Erlbaum Associates, 71(2001):2001, 2001

work page 2001

[30] [30]

Ramaswamy, Sunnie S

Vikram V . Ramaswamy, Sunnie S. Y . Kim, Ruth Fong, and Olga Russakovsky. Overlooked factors in concept-based explanations: Dataset choice, concept learnability, and human capability. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023

[31] [31]

Structural topic models for open- ended survey responses.American Journal of Political Science, 58(4):1064–1082, 2014

Margaret E Roberts, Brandon M Stewart, Dustin Tingley, Christopher Lucas, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, and David G Rand. Structural topic models for open- ended survey responses.American Journal of Political Science, 58(4):1064–1082, 2014

work page 2014

[32] [32]

Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996

Robert Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996

work page 1996

[33] [33]

ChatGPT-4 outperforms experts and crowd workers for annotating political Twitter messages with zero-shot learning.arXiv preprint arXiv:2304.06588, 2023

Petter Törnberg. ChatGPT-4 outperforms experts and crowd workers for annotating political Twitter messages with zero-shot learning.arXiv preprint arXiv:2304.06588, 2023

work page arXiv 2023

[34] [34]

Language in a bottle: Language model guided concept bottlenecks for interpretable image classification

Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, and Mark Yatskar. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19187– 19197, 2023

work page 2023

[35] [35]

Post-hoc concept bottleneck models

Mert Yuksekgonul, Maggie Wang, and James Zou. Post-hoc concept bottleneck models. InInternational Conference on Learning Representations, 2023

work page 2023

[36] [36]

Character-level convolutional networks for text classification

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. InAdvances in Neural Information Processing Systems, volume 28, pages 649–657, 2015. 11

work page 2015