What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

Emma Pierson; Rajiv Movva; Sewon Min; Smitha Milli

arxiv: 2510.26202 · v2 · submitted 2025-10-30 · 💻 cs.CL · cs.AI

What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

Rajiv Movva , Smitha Milli , Sewon Min , Emma Pierson This is my paper

Pith reviewed 2026-05-18 03:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords human feedbacksparse autoencoderspreference datainterpretabilityAI alignmentdata curationpersonalization

0 comments

The pith

Sparse autoencoders extract a small number of interpretable features that capture most of the preference signal in human feedback data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors introduce WIMHF to automatically extract interpretable descriptions from human preference data without pre-specifying what to look for. They aim to prove that sparse autoencoders can find a small number of features that explain most of the predictive power of complex models on this data. If true, this matters because it would give a clear view into what feedback actually teaches models, helping avoid unwanted behaviors and allowing targeted improvements like removing harmful examples. The results across seven datasets support that these features capture real differences in what various groups of humans prefer.

Core claim

WIMHF is a method to explain feedback data using sparse autoencoders. It characterizes both the preferences a dataset is capable of measuring and the preferences that the annotators actually express. Across 7 datasets, WIMHF identifies a small number of human-interpretable features that account for the majority of the preference prediction signal achieved by black-box models.

What carries the argument

Sparse autoencoders trained on preference data to recover a small set of sparse, human-interpretable features that represent what the data encodes.

If this is right

Features reveal wide diversity in preferences, such as Reddit users favoring informality and jokes while HH-RLHF and PRISM annotators disfavor them.
The method surfaces unsafe preferences, including LMArena users voting against refusals often in favor of toxic content.
Re-labeling harmful examples identified by the features produces large safety gains of 37 percent with no reduction in general performance.
Learning annotator-specific weights over the subjective features improves preference prediction on datasets like Community Alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition could be applied to ongoing data collection to monitor whether preferences shift as models improve or new annotator pools are added.
Reward models built from these features might enable more precise control over alignment objectives than current scalar reward approaches.
Testing the method on preference data from non-text domains or multimodal models would check whether the low-dimensional structure generalizes.

Load-bearing premise

The sparse autoencoder features learned from the preference data are faithful representations of the underlying human preferences rather than artifacts of the autoencoder training process or the specific model used to embed the data.

What would settle it

Retraining the sparse autoencoders on the same datasets with different random seeds or embedding models and observing that the extracted features no longer predict held-out preferences or yield safety gains when used for re-labeling would falsify the central claim.

Figures

Figures reproduced from arXiv: 2510.26202 by Emma Pierson, Rajiv Movva, Sewon Min, Smitha Milli.

**Figure 2.** Figure 2: While some preferences are consistent across datasets, many vary significantly, even [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Two applications of WIMHF. (a) Data Curation: On Arena, WIMHF finds that annotators prefer when models fulfill harmful (illegal, sexual, etc.) requests instead of refusing; flipping the chosen and rejected responses for up to 1000 examples that activate this feature increases RewardBench2 safety (green) and preserves overall performance (blue). (b) Personalization: On CA, we show that learning annotator-sp… view at source ↗

**Figure 4.** Figure 4: Human preferences are relatively well-explained by a small number of interpretable [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: Despite not being used in any step of WIMHF, we find that the SAE’s learned features often match annotator-written explanations on the CA dataset. Specifically, 59.9% of annotator explanations match at least one of the four most-active SAE features (vs. 33.7% random; 𝑁 = 5,000). Matches are judged by gpt-5-low, with the prompt given in [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Elo, as computed using Chatbot Arena preferences, changes after re-labeling unsafe [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Computing 𝜏𝑗 subjectivity values using different methods yields highly-correlated results. We compute 𝜏𝑗 using both restricted maximum likelihood (REML) and the Paule-Mandel (PM) estimates across all 31 statistically significant features in CA, using all annotators with at least 200 annotations. We also randomly split the set of eligible annotators into two halves A and B, and recompute 𝜏𝑗 using only half … view at source ↗

**Figure 8.** Figure 8: Prompt for comparing annotator explanations to top-activating SAE features using the [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for describing an SAE feature using a set of example preference pairs that have a [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

read the original abstract

Human feedback can alter language models in unpredictable and undesirable ways, as practitioners lack a clear understanding of what feedback data encodes. While prior work studies preferences over certain attributes (e.g., length or sycophancy), automatically extracting relevant features without pre-specifying hypotheses remains challenging. We introduce What's In My Human Feedback? (WIMHF), a method to explain feedback data using sparse autoencoders. WIMHF characterizes both (1) the preferences a dataset is capable of measuring and (2) the preferences that the annotators actually express. Across 7 datasets, WIMHF identifies a small number of human-interpretable features that account for the majority of the preference prediction signal achieved by black-box models. These features reveal a wide diversity in what humans prefer, and the role of dataset-level context: for example, users on Reddit prefer informality and jokes, while annotators in HH-RLHF and PRISM disprefer them. WIMHF also surfaces potentially unsafe preferences, such as that LMArena users tend to vote against refusals, often in favor of toxic content. The learned features enable effective data curation: re-labeling the harmful examples in Arena yields large safety gains (+37%) with no cost to general performance. They also allow fine-grained personalization: on the Community Alignment dataset, we learn annotator-specific weights over subjective features that improve preference prediction. WIMHF provides a human-centered analysis method for practitioners to better understand and use preference data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WIMHF shows sparse autoencoders can extract a small set of readable features from preference data that recover most black-box prediction performance across seven datasets, plus some clear downstream uses.

read the letter

The main takeaway is that sparse autoencoders trained on preference pairs surface a handful of human-interpretable directions that explain the bulk of what a black-box model learns from the same data. The authors run this on seven datasets and show the features differ in sensible ways: Reddit annotators like informality and jokes while HH-RLHF and PRISM annotators do not, and Arena users sometimes favor toxic content over refusals. They also give two concrete applications—re-labeling harmful Arena examples for a 37% safety lift with no general-performance drop, and learning per-annotator weights on the Community Alignment set that improve prediction. That combination of cross-dataset consistency and usable outputs is the part worth paying attention to. The method itself is a direct transfer of the SAE technique from model internals to preference data, which is new enough in this setting to be useful even if the underlying idea is not original. The results look reproducible on the surface because they report held-out checks and the same pattern across datasets. The soft spot is exactly the one the stress-test flagged: we still need to see how sensitive the recovered features and their explanatory power are to the SAE dictionary size, the L1 coefficient, and the choice of embedding model. The abstract claims the features account for the majority of the signal, but without those ablations it is hard to know whether a different hyperparameter choice would surface a different small set that explains just as much. The faithfulness argument is plausible because they evaluate on held-out labels rather than training data alone, yet a direct comparison of full SAE reconstruction versus the top-k interpretable subset versus the original black-box would make the central claim tighter. This is the kind of paper that RLHF practitioners and dataset curators will actually read and try. It is not a foundational theoretical advance, but it gives a concrete inspection tool that was missing. I would send it to peer review; the core experiment is straightforward and the applications are concrete enough that referees can give targeted feedback on the missing ablations without rejecting the whole thing.

Referee Report

2 major / 3 minor

Summary. The paper introduces WIMHF, a method applying sparse autoencoders to human preference datasets to extract small sets of human-interpretable features. Across seven datasets it claims these features capture the majority of the predictive signal achieved by black-box preference models, surface diverse and sometimes unsafe preferences (e.g., Reddit favoring informality while HH-RLHF disfavors it; Arena users voting against refusals), and enable downstream uses such as safety-oriented data curation (+37% safety gains) and annotator-specific personalization on the Community Alignment dataset.

Significance. If the empirical claims are robust, WIMHF supplies a practical post-hoc interpretability technique for preference data that could help practitioners diagnose hidden biases, improve curation, and support personalized alignment. The approach extends SAE-based dictionary learning to the preference-modeling setting and the reported curation and personalization results indicate concrete utility beyond pure analysis.

major comments (2)

§4 (Experiments) and associated tables: the central claim that a small number of interpretable SAE features recover the majority of black-box preference prediction performance requires a direct head-to-head comparison (black-box model vs. full SAE vs. top-k interpretable features) on held-out preference labels with quantitative metrics such as accuracy or log-likelihood. No such comparison is reported, leaving open whether the selected features truly account for most of the signal or whether performance drops substantially once only the human-interpretable subset is retained.
§3.2 (Method) and §4.1: the manuscript does not present ablations on SAE hyperparameters (dictionary size, L1 sparsity coefficient) or on the choice of embedding model. Because the strongest claim depends on the SAE decomposition being both faithful and sufficiently sparse, sensitivity to these choices is load-bearing; without them the robustness of the “small number of features” result across the seven datasets cannot be assessed.

minor comments (3)

Notation for the autoencoder loss (reconstruction + sparsity) should be introduced with explicit symbols in §3 rather than left implicit.
Figure captions for feature activation visualizations would benefit from example preference pairs that activate each feature, to aid reader interpretation.
A short related-work paragraph contrasting WIMHF with prior attribute-specific analyses (length, sycophancy, etc.) would help situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We agree that additional quantitative comparisons and robustness checks will strengthen the empirical support for our central claims. We address each major comment below and will incorporate the suggested revisions in the next version of the manuscript.

read point-by-point responses

Referee: §4 (Experiments) and associated tables: the central claim that a small number of interpretable SAE features recover the majority of black-box preference prediction performance requires a direct head-to-head comparison (black-box model vs. full SAE vs. top-k interpretable features) on held-out preference labels with quantitative metrics such as accuracy or log-likelihood. No such comparison is reported, leaving open whether the selected features truly account for most of the signal or whether performance drops substantially once only the human-interpretable subset is retained.

Authors: We agree that a direct head-to-head comparison on held-out data is necessary to rigorously support the claim. In the revised manuscript we will add results comparing the black-box preference model, the full SAE reconstruction, and a linear model using only the top-k human-interpretable features, evaluated with accuracy and log-likelihood on held-out preference pairs. This will quantify the fraction of predictive signal retained by the interpretable subset across the seven datasets. revision: yes
Referee: §3.2 (Method) and §4.1: the manuscript does not present ablations on SAE hyperparameters (dictionary size, L1 sparsity coefficient) or on the choice of embedding model. Because the strongest claim depends on the SAE decomposition being both faithful and sufficiently sparse, sensitivity to these choices is load-bearing; without them the robustness of the “small number of features” result across the seven datasets cannot be assessed.

Authors: We acknowledge that sensitivity analyses are important for establishing robustness. We will add ablations in the revised manuscript that vary dictionary size and the L1 coefficient, and we will report results using at least one alternative embedding model. These experiments will confirm that the finding of a small number of interpretable features capturing most of the signal is stable across reasonable hyperparameter regimes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on held-out evaluations and independent experiments

full rationale

The paper trains sparse autoencoders on preference data to extract features and then evaluates how well a small number of interpretable features recover the predictive performance of black-box models. This evaluation is described as occurring across 7 datasets with applications to curation and personalization, implying standard held-out testing rather than direct equivalence to fitted parameters. No equations or steps reduce the central claim to a self-definition, a renamed fit, or a load-bearing self-citation chain. The derivation remains self-contained against external benchmarks such as black-box performance and downstream task gains.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that sparse autoencoders trained on preference embeddings will recover human-interpretable directions that generalize beyond the training distribution and that these directions are not dominated by dataset-specific artifacts.

free parameters (1)

sparsity level / L1 coefficient
Controls how many features are active; chosen to balance reconstruction and interpretability but not derived from first principles.

axioms (1)

domain assumption Sparse autoencoders can decompose preference signals into human-interpretable features without supervision on those features.
Invoked when claiming the learned features are both predictive and interpretable.

pith-pipeline@v0.9.0 · 5803 in / 1207 out tokens · 26590 ms · 2026-05-18T03:36:23.852399+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Understanding Annotator Safety Policy with Interpretability
cs.AI 2026-05 unverdicted novelty 6.0

Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
Compared to What? Baselines and Metrics for Counterfactual Prompting
cs.CL 2026-05 conditional novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 2 Pith papers

[1]

doi: 10.1002/jrsm.1255. B. Bussmann, P. Leask, and N. Nanda. BatchTopK Sparse Autoencoders, Dec. 2024. B. Bussmann, N. Nabeshima, A. Karvonen, and N. Nanda. Learning Multi-Level Features with Matryoshka Sparse Autoencoders, Mar. 2025. W. Chi, V. Chen, A. N. Angelopoulos, W.-L. Chiang, A. Mittal, N. Jain, T. Zhang, I. Stoica, C. Donahue, and A. Talwalkar. ...

work page doi:10.1002/jrsm.1255 2024
[2]

Group preference optimization: Few-shot alignment of large language models.arXiv preprint arXiv:2310.11523, 2023

URLhttps://arxiv.org/abs/2310.11523. 15 Y. Zhao, K. Zhang, T. Hu, S. Wu, R. L. Bras, T. Anderson, J. Bragg, J. C. Chang, J. Dodge, M. Latzke, Y. Liu, C. McGrady , X. Tang, Z. Wang, C. Zhao, H. Hajishirzi, D. Downey , and A. Cohan. SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks, July 2025. K. Zhou, J. D. Hwang, X...

work page arXiv 2025
[3]

Remove rows with empty prompts or responses

work page
[4]

Remove non-English prompts as annotated by the original dataset creators, or otherwise with fastText language ID5

work page
[5]

Remove very long conversations with over 2048 tokens (this is<1% of all data)

work page 2048
[6]

Randomly swap response A and response B to avoid any position bias

work page
[7]

addresses the user’s request with creativity and clarity

Remove any rows where both response A and response B are marked as subjective by gpt-4.1-mini, using the same prompt as in Huang et al. [2025]. 6.Reddit : To preprocess the Stanford Human Preferences data, in addition to the above steps, we include only the pairs of comments where the preferred comment has at least 10 upvotes and at least twice as many up...

work page 2025
[8]

Helpful: Does this concept help you understand what humans prefer? If you were studying this dataset, and your goal was to understand what humans prefer, is this a concept you would explore further? Rate 1 if yes, 0 if no or only a little

work page
[9]

Interpretable,

Interpretable: When you read the concept, is it clear what it means? If you saw a prompt and a response, could you easily decide whether that response contains that concept? Rate 1 if yes, 0 if no / would often be subjective. 6The ICAI default is 𝑃= 5, and we show that trends hold with this in App. C. We use 𝑃= 10 to more clearly establish differences bet...

work page 2017
[10]

does not discuss AI

The task is to assess whether the automated response attributes are closely related to the attributes mentioned by the annotator. For example, if the feature is “does not discuss AI”, and the annotator says “I didn’t like the discussion about AI”, then the feature IS present. On the other hand, if the annotator says “I liked the suggestion of a surprise p...

work page
[11]

does not discuss AI

Directionality does not matter, only relevance. For example, if the feature is “does not discuss AI”, and the annotator says “I really liked the discussion of AI”, then the feature IS considered present, since AI is mentioned as relevant to the annotator’s choice

work page
[12]

does not discuss the environment

Prioritize precision. If a feature is imprecise, and it only loosely matches the annotator’s explanation, then it should NOT be counted. Annotator Explanation:{annotator_explanation} Features Predicted by Automated Explanation:{automated_explanation} If No, output an empty list: [] If Yes, output a list of indices of the features that are present in the a...

work page

[1] [1]

doi: 10.1002/jrsm.1255. B. Bussmann, P. Leask, and N. Nanda. BatchTopK Sparse Autoencoders, Dec. 2024. B. Bussmann, N. Nabeshima, A. Karvonen, and N. Nanda. Learning Multi-Level Features with Matryoshka Sparse Autoencoders, Mar. 2025. W. Chi, V. Chen, A. N. Angelopoulos, W.-L. Chiang, A. Mittal, N. Jain, T. Zhang, I. Stoica, C. Donahue, and A. Talwalkar. ...

work page doi:10.1002/jrsm.1255 2024

[2] [2]

Group preference optimization: Few-shot alignment of large language models.arXiv preprint arXiv:2310.11523, 2023

URLhttps://arxiv.org/abs/2310.11523. 15 Y. Zhao, K. Zhang, T. Hu, S. Wu, R. L. Bras, T. Anderson, J. Bragg, J. C. Chang, J. Dodge, M. Latzke, Y. Liu, C. McGrady , X. Tang, Z. Wang, C. Zhao, H. Hajishirzi, D. Downey , and A. Cohan. SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks, July 2025. K. Zhou, J. D. Hwang, X...

work page arXiv 2025

[3] [3]

Remove rows with empty prompts or responses

work page

[4] [4]

Remove non-English prompts as annotated by the original dataset creators, or otherwise with fastText language ID5

work page

[5] [5]

Remove very long conversations with over 2048 tokens (this is<1% of all data)

work page 2048

[6] [6]

Randomly swap response A and response B to avoid any position bias

work page

[7] [7]

addresses the user’s request with creativity and clarity

Remove any rows where both response A and response B are marked as subjective by gpt-4.1-mini, using the same prompt as in Huang et al. [2025]. 6.Reddit : To preprocess the Stanford Human Preferences data, in addition to the above steps, we include only the pairs of comments where the preferred comment has at least 10 upvotes and at least twice as many up...

work page 2025

[8] [8]

Helpful: Does this concept help you understand what humans prefer? If you were studying this dataset, and your goal was to understand what humans prefer, is this a concept you would explore further? Rate 1 if yes, 0 if no or only a little

work page

[9] [9]

Interpretable,

Interpretable: When you read the concept, is it clear what it means? If you saw a prompt and a response, could you easily decide whether that response contains that concept? Rate 1 if yes, 0 if no / would often be subjective. 6The ICAI default is 𝑃= 5, and we show that trends hold with this in App. C. We use 𝑃= 10 to more clearly establish differences bet...

work page 2017

[10] [10]

does not discuss AI

The task is to assess whether the automated response attributes are closely related to the attributes mentioned by the annotator. For example, if the feature is “does not discuss AI”, and the annotator says “I didn’t like the discussion about AI”, then the feature IS present. On the other hand, if the annotator says “I liked the suggestion of a surprise p...

work page

[11] [11]

does not discuss AI

Directionality does not matter, only relevance. For example, if the feature is “does not discuss AI”, and the annotator says “I really liked the discussion of AI”, then the feature IS considered present, since AI is mentioned as relevant to the annotator’s choice

work page

[12] [12]

does not discuss the environment

Prioritize precision. If a feature is imprecise, and it only loosely matches the annotator’s explanation, then it should NOT be counted. Annotator Explanation:{annotator_explanation} Features Predicted by Automated Explanation:{automated_explanation} If No, output an empty list: [] If Yes, output a list of indices of the features that are present in the a...

work page