pith. machine review for the scientific record. sign in

arxiv: 2605.12177 · v1 · submitted 2026-05-12 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Correcting Selection Bias in Sparse User Feedback for Large Language Model Quality Estimation: A Multi-Agent Hierarchical Bayesian Approach

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:49 UTC · model grok-4.3

classification 💻 cs.CL
keywords selection bias correctionhierarchical Bayesian modelingLLM evaluationuser feedbacktopic clusteringBeta-Binomialquality estimation
0
0 comments X

The pith

Mild priors on feedback selection recover true LLM quality within 4-13 percentage points despite severe bias.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that user feedback on LLMs is heavily biased toward extreme satisfaction levels, causing naive averages to miss true quality by 40-50 points. It introduces a multi-agent system that clusters interactions by topic, models selection bias and quality hierarchically with Beta-Binomial distributions, and reweights to the overall topic distribution. With a mild prior drawn from typical feedback rates, the corrected estimate stays close to the true quality across wide ranges of bias. This matters because production systems can use dashboard-visible statistics to calibrate without needing labeled data, enabling reliable monitoring and recalibration.

Core claim

The Hierarchical-Informed Bayesian model, equipped with a mild prior on the feedback channel's positive rate and negative-to-positive ratio, produces aggregate quality estimates within 4-13 percentage points of the true Q* = 0.6249 on UltraFeedback data, even as simulated bias ratios vary from 1:1 to 30:1, and its 95% credible intervals cover the true value in all 50 random-seed replicates tested at maximum cluster size 10.

What carries the argument

The Bias Modeling Agent's two-stage hierarchical Beta-Binomial model under NUTS sampling, which infers per-topic selection rates s_c and qualities q_c with partial pooling, combined with reweighting by empirical topic prevalences in the Synthesis Agent.

If this is right

  • The approach outperforms naive averaging and inverse propensity weighting baselines under simulated biases.
  • Credible intervals provide reliable uncertainty quantification for the corrected quality.
  • Drift signals from the model support online recalibration as topic prevalences or biases change.
  • Partial pooling allows borrowing strength across topics even with sparse feedback per cluster.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar hierarchical priors on selection mechanisms could address bias in other user-generated data streams like reviews or surveys.
  • If the bias channel prior is dashboard-readable, the method could be deployed with minimal additional data collection in live LLM systems.
  • Extending the clustering to dynamic online updates might improve adaptability to shifting user bases.

Load-bearing premise

The per-cluster data can be fit equally well by a range of quality and selection combinations, so an external prior on the bias channel is needed to pick out the correct one.

What would settle it

Running the model on the UltraFeedback dataset with the described simulated biases and checking whether the credible intervals cover the known Q* in fewer than 45 of 50 replicates, or whether the point estimate error exceeds 13 percentage points at 30:1 bias ratio.

Figures

Figures reproduced from arXiv: 2605.12177 by Andrea Morandi, Mahesh Viswanathan.

Figure 1
Figure 1. Figure 1: Posterior densities of the aggregate quality estimate [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-cluster recovery of the latent quality [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Absolute error |Qˆ − Q⋆| against sentiment-bias strength κmax, on a log-κmax horizontal axis. Across the full range, Hierarchical-Informed (purple, bold) holds inside a narrow band of 0.04 to 0.13. Naive, IPW, Basic, Hierarchical-Sentiment, and Corrected-Global all degrade roughly linearly with log κmax. Enhanced (orange) traces a non-monotonic pattern: over￾correction at low κ, accidental improvement by κ… view at source ↗
Figure 4
Figure 4. Figure 4: Hierarchical-Informed posterior mean (purple dot) and 95% credible [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

[Abridged] Production LLM deployments receive feedback from a non-random fraction of users: thumbs sit mostly in the tails of the satisfaction distribution, and a naive average over them can land 40-50 percentage points away from true system quality. We treat this as a topic- and sentiment- stratified selection-bias problem and propose a three-agent hierarchical Bayesian pipeline that does not require ground-truth labels on individual interactions. A Topic Clustering Agent partitions the stream via UMAP + HDBSCAN over text embeddings; a Bias Modeling Agent fits a two-stage hierarchical Beta-Binomial under NUTS, inferring per-topic selection rates $s_c$ and quality $q_c$ with partial pooling; a Synthesis Agent reweights $q_c$ by true topic prevalence $\hat\pi_c = n_c/N$ to report a bias-corrected aggregate posterior $\bar Q = \sum_c \hat\pi_c q_c$ with credible interval, plus drift signals for online recalibration. Validation uses UltraFeedback (N=10,232 retained interactions, $C=18$ clusters, $Q^\star=0.6249$) with simulated topic- and sentiment-dependent selection biases. We compare five Bayesian variants against Naive and IPW baselines. A mild prior on the feedback channel (typical positive-feedback rate and negative-to-positive ratio, both readable from any production dashboard without labels) keeps Hierarchical-Informed within 4-13 pp of $Q^\star$ as the bias ratio sweeps from 1:1 to 30:1, with 95% credible intervals covering $Q^\star$ in 50/50 random-seed replicates at $\kappa_{\max}=10$. Without channel-side priors, every weak-prior variant misses $Q^\star$ by 22-33 pp: the per-cluster sufficient statistics admit a one-parameter family of equally good fits, and the prior on the bias channel (not on latent quality) is what breaks the degeneracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to address selection bias in user feedback for LLM quality estimation using a three-agent hierarchical Bayesian pipeline. A Topic Clustering Agent partitions interactions via UMAP + HDBSCAN on embeddings; a Bias Modeling Agent fits a hierarchical Beta-Binomial model under NUTS to infer per-cluster selection rates s_c and qualities q_c; a Synthesis Agent reweights by topic prevalence to obtain bias-corrected aggregate quality Q-bar with credible intervals. On UltraFeedback (N=10,232, C=18 clusters, true Q^*=0.6249) with simulated biases, the informed variant stays within 4-13 pp of Q* for bias ratios up to 30:1, with good CI coverage, while weak priors fail by 22-33 pp.

Significance. Should the mild dashboard-derived priors prove robust in real deployments with unknown selection mechanisms, the approach would provide a practical, label-free method for debiasing LLM feedback and monitoring drift. The explicit acknowledgment of under-identification without the channel prior and the use of partial pooling are notable strengths. The simulation-based validation demonstrates the necessity of the prior but leaves open questions about generalization.

major comments (3)
  1. [Abstract] The central result that Hierarchical-Informed stays within 4-13 pp of Q* depends on the mild prior on the feedback channel (positive-feedback rate and negative-to-positive ratio). Since the validation simulates the bias whose parameters inform the prior, the manuscript should include a sensitivity analysis to prior misspecification to support claims of applicability in real settings where the true bias is unknown.
  2. [Validation on UltraFeedback] The per-cluster degeneracy is noted (per-cluster sufficient statistics admit a one-parameter family of equally good fits), but no explicit equation or derivation shows this family for the observed feedback counts. Providing the likelihood function or a simple example with two clusters would clarify how the bias-channel prior resolves the identification issue.
  3. [Bias Modeling Agent] Details on convergence diagnostics for the NUTS sampler (e.g., R-hat, effective sample size) and sensitivity to clustering hyperparameters (HDBSCAN min_cluster_size, etc.) are missing, which are load-bearing for trusting the credible intervals and cluster stability.
minor comments (2)
  1. The notation hat{pi}_c = n_c/N for topic prevalence could be introduced earlier with a symbol table for clarity.
  2. [Abstract] The value of kappa_max=10 should be defined in the main text, as it appears only in the abstract.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the presentation and strengthen the claims regarding robustness. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] The central result that Hierarchical-Informed stays within 4-13 pp of Q* depends on the mild prior on the feedback channel (positive-feedback rate and negative-to-positive ratio). Since the validation simulates the bias whose parameters inform the prior, the manuscript should include a sensitivity analysis to prior misspecification to support claims of applicability in real settings where the true bias is unknown.

    Authors: We agree that the current validation ties the prior to the simulated bias mechanism, which limits claims about robustness under misspecification. In the revised manuscript we will add a dedicated sensitivity analysis section that perturbs the dashboard-derived prior hyperparameters (means and variances of the positive-feedback rate and negative-to-positive ratio) over plausible ranges and reports the resulting deviation of the posterior mean and credible-interval coverage from Q*. This will be presented both for the UltraFeedback simulation and, where feasible, on a second held-out dataset. revision: yes

  2. Referee: [Validation on UltraFeedback] The per-cluster degeneracy is noted (per-cluster sufficient statistics admit a one-parameter family of equally good fits), but no explicit equation or derivation shows this family for the observed feedback counts. Providing the likelihood function or a simple example with two clusters would clarify how the bias-channel prior resolves the identification issue.

    Authors: We accept that the current text only states the existence of the degeneracy without derivation. The revised version will include the explicit per-cluster likelihood (Beta-Binomial with selection probability s_c and quality q_c, where the observed positive count y_c depends on the product s_c * q_c under the assumed feedback channel) and a short two-cluster toy example that shows the one-parameter family of (s_c, q_c) pairs yielding identical likelihoods. We will then demonstrate how the mild prior on the channel parameters (positive rate and ratio) breaks the degeneracy and yields unique posterior modes. revision: yes

  3. Referee: [Bias Modeling Agent] Details on convergence diagnostics for the NUTS sampler (e.g., R-hat, effective sample size) and sensitivity to clustering hyperparameters (HDBSCAN min_cluster_size, etc.) are missing, which are load-bearing for trusting the credible intervals and cluster stability.

    Authors: We agree that these diagnostics are essential for reproducibility and trust in the credible intervals. In the revision we will report standard NUTS diagnostics (R-hat < 1.01, bulk and tail ESS > 1000, no divergent transitions) for the main UltraFeedback run and will add a supplementary sensitivity table varying HDBSCAN min_cluster_size and min_samples over a small grid, showing the resulting variation in the number of clusters, aggregate Q-bar, and CI coverage. If any hyperparameter choice materially affects conclusions we will note it explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; hierarchical model uses external priors and independent simulation validation

full rationale

The paper's derivation applies standard hierarchical Beta-Binomial inference (via NUTS) to observed feedback counts after UMAP+HDBSCAN clustering, with a mild external prior on channel parameters drawn from production dashboards. The aggregate Q-bar is computed as a weighted sum of posterior q_c under known topic prevalences. Validation against known Q* on simulated biases, plus comparisons to Naive and IPW baselines, provides independent grounding. No equation reduces the reported accuracy or credible-interval coverage to a fitted input by construction, no self-citation is load-bearing for the central claim, and the acknowledged degeneracy without the prior is resolved by an external source rather than tautologically. The pipeline is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 3 invented entities

The approach rests on standard hierarchical Bayesian modeling and clustering techniques plus the assumption that mild priors on feedback rates are available externally; three new agents are introduced as part of the pipeline.

free parameters (2)
  • per-topic selection rates s_c
    Inferred parameters in the two-stage hierarchical Beta-Binomial model
  • per-cluster quality q_c
    Latent quality parameters estimated with partial pooling
axioms (2)
  • standard math Beta-Binomial likelihood for binary feedback under selection bias
    Core distributional assumption in the Bias Modeling Agent
  • domain assumption UMAP + HDBSCAN clustering on embeddings captures topic- and sentiment-dependent selection strata
    Invoked by the Topic Clustering Agent to define the C=18 clusters
invented entities (3)
  • Topic Clustering Agent no independent evidence
    purpose: Partitions interactions via embeddings
    New component introduced to stratify the data
  • Bias Modeling Agent no independent evidence
    purpose: Fits hierarchical model to infer s_c and q_c
    New component for Bayesian inference
  • Synthesis Agent no independent evidence
    purpose: Reweights by topic prevalence to produce aggregate Q
    New component for final bias-corrected output

pith-pipeline@v0.9.0 · 5666 in / 1620 out tokens · 68044 ms · 2026-05-13T05:49:06.477542+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

  1. [1]

    Bias-corrected multi-agent user feedback analysis system for LLM performance optimization,

    S. Gupta and A. Morandi, “Bias-corrected multi-agent user feedback analysis system for LLM performance optimization,” Technical Disclo- sure Commons, Defensive Publication Series #8803, 2025

  2. [2]

    Post-stratification: A modeler’s perspective,

    R. J. A. Little, “Post-stratification: A modeler’s perspective,”Journal of the American Statistical Association, vol. 88, no. 423, pp. 1001–1012, 1993

  3. [3]

    Poststratification and weighting adjust- ments,

    A. Gelman and J. B. Carlin, “Poststratification and weighting adjust- ments,” inSurvey Nonresponse, R. M. Groves, D. A. Dillman, J. L. Eltinge, and R. J. A. Little, Eds. Wiley, 2002

  4. [4]

    Sample selection bias as a specification error,

    J. J. Heckman, “Sample selection bias as a specification error,”Econo- metrica, vol. 47, no. 1, pp. 153–161, 1979

  5. [5]

    A generalization of sampling without replacement from a finite universe,

    D. G. Horvitz and D. J. Thompson, “A generalization of sampling without replacement from a finite universe,”Journal of the American Statistical Association, vol. 47, no. 260, pp. 663–685, 1952. 13

  6. [6]

    Post stratification,

    D. Holt and T. M. F. Smith, “Post stratification,”Journal of the Royal Statistical Society. Series A, vol. 142, no. 1, pp. 33–46, 1979

  7. [7]

    Poststratification into many categories using hierarchical logistic regression,

    A. Gelman and T. C. Little, “Poststratification into many categories using hierarchical logistic regression,”Survey Methodology, vol. 23, pp. 127– 135, 1997

  8. [8]

    Bayesian multilevel estima- tion with poststratification: State-level estimates from national polls,

    D. K. Park, A. Gelman, and J. Bafumi, “Bayesian multilevel estima- tion with poststratification: State-level estimates from national polls,” Political Analysis, vol. 12, no. 4, pp. 375–385, 2004

  9. [9]

    UMAP: Uniform manifold approximation and projection for dimension reduction,

    L. McInnes, J. Healy, and J. Melville, “UMAP: Uniform manifold approximation and projection for dimension reduction,” 2018

  10. [10]

    Density-based cluster- ing based on hierarchical density estimates,

    R. J. G. B. Campello, D. Moulavi, and J. Sander, “Density-based cluster- ing based on hierarchical density estimates,” inPacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Springer, 2013, pp. 160–172

  11. [11]

    hdbscan: Hierarchical density based clustering,

    L. McInnes, J. Healy, and S. Astels, “hdbscan: Hierarchical density based clustering,”Journal of Open Source Software, vol. 2, no. 11, p. 205, 2017

  12. [12]

    Top2Vec: Distributed representations of topics,

    D. Angelov, “Top2Vec: Distributed representations of topics,” 2020

  13. [13]

    BERTopic: Neural topic modeling with a class-based TF-IDF procedure,

    M. Grootendorst, “BERTopic: Neural topic modeling with a class-based TF-IDF procedure,” 2022

  14. [14]

    Holistic evaluation of language models,

    P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumaret al., “Holistic evaluation of language models,”Transactions on Machine Learning Research, 2023

  15. [15]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,

    L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

  16. [16]

    Recommendations as treatments: Debiasing learning and evaluation,

    T. Schnabel, A. Swaminathan, A. Singh, N. Chandak, and T. Joachims, “Recommendations as treatments: Debiasing learning and evaluation,” inInternational Conference on Machine Learning (ICML), 2016

  17. [17]

    Collaborative prediction and ranking with non-random missing data,

    B. M. Marlin and R. S. Zemel, “Collaborative prediction and ranking with non-random missing data,” inACM Conference on Recommender Systems (RecSys), 2009

  18. [18]

    Causal inference for recom- mendation,

    D. Liang, L. Charlin, and D. M. Blei, “Causal inference for recom- mendation,” inUAI Workshop on Causation: Foundation to Application, 2016

  19. [19]

    Gelman, J

    A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin,Bayesian Data Analysis, 3rd ed. Chapman and Hall/CRC, 2013

  20. [20]

    Silhouettes: A graphical aid to the interpretation and validation of cluster analysis,

    P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and validation of cluster analysis,”Journal of Computational and Applied Mathematics, vol. 20, pp. 53–65, 1987

  21. [21]

    A cluster separation measure,

    D. L. Davies and D. W. Bouldin, “A cluster separation measure,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI- 1, no. 2, pp. 224–227, 1979

  22. [22]

    The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo,

    M. D. Hoffman and A. Gelman, “The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo,”Journal of Machine Learning Research, vol. 15, no. 47, pp. 1593–1623, 2014

  23. [23]

    Inference from iterative simulation using multiple sequences,

    A. Gelman and D. B. Rubin, “Inference from iterative simulation using multiple sequences,”Statistical Science, vol. 7, no. 4, pp. 457–472, 1992

  24. [24]

    Posterior predictive assessment of model fitness via realized discrepancies,

    A. Gelman, X.-L. Meng, and H. Stern, “Posterior predictive assessment of model fitness via realized discrepancies,”Statistica Sinica, vol. 6, no. 4, pp. 733–807, 1996

  25. [25]

    UltraFeedback: Boosting language models with scaled AI feedback,

    G. Cui, L. Yuan, N. Ding, G. Yao, B. He, W. Zhu, Y . Ni, G. Xie, R. Xie, Y . Lin, Z. Liu, and M. Sun, “UltraFeedback: Boosting language models with scaled AI feedback,” 2023

  26. [26]

    Practical Bayesian model eval- uation using leave-one-out cross-validation and W AIC,

    A. Vehtari, A. Gelman, and J. Gabry, “Practical Bayesian model eval- uation using leave-one-out cross-validation and W AIC,”Statistics and Computing, vol. 27, no. 5, pp. 1413–1432, 2017