pith. machine review for the scientific record. sign in

arxiv: 2604.11224 · v1 · submitted 2026-04-13 · 💻 cs.SI

Recognition: unknown

Quality-Sensitive Matrix Factorization for Community Notes: Towards Sample Efficiency and Manipulation Resistance

Ashish Goel, Mohak Goyal, Nishka Arora

Pith reviewed 2026-05-10 16:02 UTC · model grok-4.3

classification 💻 cs.SI
keywords community notesmatrix factorizationquality sensitivitycrowdsourced fact-checkingmanipulation resistancepeer predictionrater weightingsample efficiency
0
0 comments X

The pith

Adding a single quality-sensitivity parameter per rater to Community Notes' matrix factorization lets the model reach accurate note quality scores with fewer ratings and hold up better against coordinated attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Quality-Sensitive Matrix Factorization to improve the existing model used in Community Notes. After separating ideology from note quality, the new approach estimates one extra scalar per rater that gives greater influence to those whose ratings line up more consistently with the quality scores learned from the full set of data. This estimation happens jointly from the ratings alone and draws on peer-prediction logic without needing any external ground truth about which notes are correct. On a large real-world dataset of 45 million ratings, the method matches the baseline accuracy for popular notes while using 26 to 40 percent fewer ratings. It also limits how much strategic groups can shift the quality estimates of targeted notes.

Core claim

By introducing and jointly estimating a scalar quality-sensitivity parameter for each rater, QSMF connects the standard ideology-adjusted matrix factorization to peer-prediction principles, assigning higher weight to raters whose ratings are more consistent with the aggregate note-quality estimates learned from all data, without requiring any external ground truth on note accuracy.

What carries the argument

The per-rater quality-sensitivity parameter ρ̂_i, estimated jointly with all other model parameters and used to modulate each rater's influence on note quality based on consistency with the learned estimates.

If this is right

  • QSMF reaches the baseline level of accuracy on high-traffic notes with 26-40 percent fewer ratings.
  • In semi-synthetic tests of coordinated attacks, QSMF produces substantially smaller shifts in the quality estimates of targeted notes.
  • On synthetic data with known ground truth, the estimated ρ̂_i separates good from bad raters with AUC above 0.94 and recovers true note quality with lower error than the baseline.
  • All improvements are obtained from one additional scalar per rater and require no external labels or manual moderation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could allow accurate notes to be surfaced earlier in their lifecycle on platforms where rating volume grows slowly.
  • Downweighting inconsistent raters may gradually reduce the payoff for strategic behavior in the rating pool.
  • Similar per-rater sensitivity terms could be added to other crowdsourced rating systems that already use matrix factorization to separate quality from user traits.
  • Over longer periods, the recovered ρ̂_i values might serve as a lightweight signal for identifying raters who systematically contribute reliable signals.

Load-bearing premise

That a rater's quality-sensitivity is a stable individual trait that can be recovered accurately from their ratings pattern alone, without external verification of note quality and without creating new instabilities in the estimates.

What would settle it

A large-scale split-half test on fresh Community Notes data in which the ρ̂_i values estimated from the first half fail to predict which raters remain consistent in the second half, or in which QSMF quality scores diverge from the baseline after the same number of ratings accumulate.

Figures

Figures reproduced from arXiv: 2604.11224 by Ashish Goel, Mohak Goyal, Nishka Arora.

Figure 1
Figure 1. Figure 1: Inverse-variance weighting rewards predictability rather than quality tracking. The plotted [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Latent-parameter distributions from the full-data [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Fitted quality sensitivity ρˆi against ideology intensity |γˆi |. The weak Spearman correlation (r = −0.075) shows that ρˆi is not simply an inverse measure of ideological intensity. The shaded region is the 95% confidence interval. The within-bucket variation is much higher than the variation across ideology buckets. Activity bin n raters Split-half r IQR(si) 20–50 137,551 0.327 0.676 50–100 76,434 0.499 … view at source ↗
Figure 4
Figure 4. Figure 4: Rolling-window sample efficiency across October 2024 for notes first observed in each two [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Coordinated-attack robustness on attacked notes. Left: attacked-note displacement under [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Synthetic recovery with known ground truth. Left: [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Rolling-window sample efficiency with two ideology dimensions. The design matches Fig [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
read the original abstract

Community Notes is X's crowdsourced fact-checking program: contributors write short notes that add context to potentially misleading posts, and other contributors rate whether those notes are helpful. Its algorithm uses a matrix factorization model to separate ideology from note quality, so notes are surfaced only when they receive support across ideological lines. After ideology is accounted for, however, the model gives all raters equal influence on quality estimates. This slows consensus formation and leaves the quality estimate vulnerable to noisy or strategic raters. We propose Quality-Sensitive Matrix Factorization (QSMF), which uses a per-rater quality-sensitivity parameter \(\hat\rho_i\) estimated jointly with all other parameters. This connects QSMF to peer prediction: without external ground truth, it gives more influence to raters whose ideology-adjusted ratings are more consistent with the note-quality estimates learned from all the ratings. We evaluate QSMF on 45M ratings over 365K notes from the six months before the 2024 U.S. presidential election. Split-half tests confirm that quality sensitivity is a stable, empirically recoverable rater trait. In evaluation on high-traffic notes, QSMF requires 26--40\% fewer ratings to match the baseline's accuracy. In semi-synthetic coordinated attacks on notes of opposing ideology, QSMF substantially reduces displacement on the estimated quality estimates of targeted notes relative to the baseline. In synthetic data with known ground truth, \(\hat\rho_i\) separates good from bad raters with an AUC above 0.94, and achieves much lower error in recovering the true note quality estimates in the presence of bad raters. These gains come from a single additional scalar parameter per rater, with no external ground truth and no manual moderation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Quality-Sensitive Matrix Factorization (QSMF) for Community Notes on X. It extends the existing matrix factorization approach by introducing a single per-rater quality-sensitivity parameter ρ̂_i that is estimated jointly with note quality and ideology parameters. Raters whose ideology-adjusted ratings align more closely with the learned note-quality vector receive higher weight. The paper reports evaluations on 45M real ratings, showing 26–40% fewer ratings needed to match baseline accuracy on high-traffic notes, reduced displacement under semi-synthetic coordinated attacks, and strong recovery performance (AUC > 0.94 for rater separation, lower note-quality error) in synthetic data with known ground truth.

Significance. If the empirical claims hold under the joint estimation procedure, the work offers a practical, low-parameter improvement to crowdsourced fact-checking that increases sample efficiency and attack resistance without external ground truth or manual intervention. The combination of large-scale real-data results, attack simulations, and synthetic recovery tests provides concrete evidence of potential impact on deployed systems like Community Notes.

major comments (2)
  1. [Model definition and optimization procedure] The central modeling choice defines ρ̂_i as the consistency between a rater’s ideology-adjusted ratings and the note-quality vector that is itself learned under weights derived from those same ρ̂_i values. This creates an implicit fixed-point problem. While split-half stability and synthetic AUC > 0.94 are reported, the manuscript does not provide a formal analysis of uniqueness, convergence rate, or sensitivity to initialization/noise for the joint optimum. Because the headline efficiency (26–40%) and attack-resistance claims rest directly on the stability of this fixed point, a dedicated section or appendix deriving or empirically characterizing the optimization landscape is required.
  2. [Evaluation on real data] The real-data efficiency result is obtained on high-traffic notes only. It is unclear whether the same relative gain holds for the long tail of low-traffic notes that dominate the 365K-note corpus; if the method’s advantage is concentrated on already well-rated items, the practical sample-efficiency benefit for the overall system may be smaller than stated.
minor comments (2)
  1. [Abstract] The abstract states numerical gains (26–40%, AUC > 0.94) without accompanying standard errors, confidence intervals, or number of independent runs; adding these would allow readers to gauge variability.
  2. [Model section] Notation for the baseline matrix factorization and the precise form of the QSMF objective (including how ρ̂_i enters the loss) should be written out explicitly in the main text rather than left to supplementary material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight important aspects of the model and evaluation that we address point by point below. We will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Model definition and optimization procedure] The central modeling choice defines ρ̂_i as the consistency between a rater’s ideology-adjusted ratings and the note-quality vector that is itself learned under weights derived from those same ρ̂_i values. This creates an implicit fixed-point problem. While split-half stability and synthetic AUC > 0.94 are reported, the manuscript does not provide a formal analysis of uniqueness, convergence rate, or sensitivity to initialization/noise for the joint optimum. Because the headline efficiency (26–40%) and attack-resistance claims rest directly on the stability of this fixed point, a dedicated section or appendix deriving or empirically characterizing the optimization landscape is required.

    Authors: We acknowledge that the joint estimation of ρ̂_i with the note-quality and ideology parameters forms an implicit fixed-point problem, as the weights depend on the parameters being estimated. The current manuscript relies on empirical evidence—split-half stability of ρ̂_i and strong synthetic recovery (AUC > 0.94)—to support reliability. We agree that a more explicit characterization would better substantiate the efficiency and robustness claims. In the revised manuscript we will add an appendix that empirically characterizes the optimization landscape, including convergence behavior from varied initializations, sensitivity to initialization and additive noise, and stability of recovered parameters across random data splits. While a complete theoretical analysis of uniqueness for this non-convex problem lies beyond the scope of the present work, the added empirical diagnostics will directly address the referee’s concern. revision: yes

  2. Referee: [Evaluation on real data] The real-data efficiency result is obtained on high-traffic notes only. It is unclear whether the same relative gain holds for the long tail of low-traffic notes that dominate the 365K-note corpus; if the method’s advantage is concentrated on already well-rated items, the practical sample-efficiency benefit for the overall system may be smaller than stated.

    Authors: The referee is correct that the reported 26–40% sample-efficiency gains are measured on high-traffic notes. We focused on this subset because these are the notes that accumulate sufficient ratings to reach the display threshold in the deployed Community Notes system; sample efficiency is therefore most consequential precisely where consensus must form. For the long tail of low-traffic notes, absolute rating counts are small and many never reach visibility, so the relative benefit is less operationally relevant. In the revision we will add a dedicated paragraph clarifying this scope and, using the existing 45 M rating corpus, include supplementary results that stratify efficiency gains by rating volume to show how the advantage behaves for notes with fewer ratings. revision: partial

Circularity Check

0 steps flagged

No significant circularity in QSMF joint estimation or empirical claims

full rationale

The QSMF model defines per-rater ρ̂_i jointly with note quality via consistency of ideology-adjusted ratings with the learned quality vector, creating an explicit fixed-point interdependence by design. However, this is not a reduction of any claimed prediction or first-principles result to its own inputs; it is the model specification itself. All headline performance claims are evaluated against independent benchmarks: synthetic data supplies known ground-truth rater and note qualities for AUC and error metrics; split-half tests on the 45M-rating corpus assess stability of ρ̂_i without using the same optimization artifacts for validation; and efficiency/attack-resistance numbers compare QSMF to baseline on high-traffic notes and semi-synthetic attacks. No self-citation, uniqueness theorem, or ansatz is invoked to force the result, and the derivation chain remains self-contained against these external checks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The model adds one fitted scalar per rater on top of the standard matrix-factorization assumptions already used by Community Notes; no new physical entities are postulated.

free parameters (1)
  • per-rater quality-sensitivity parameter ρ̂_i
    Single additional scalar per rater, estimated jointly from the rating matrix.
axioms (1)
  • domain assumption Matrix factorization can separate note quality from rater ideology
    Inherited from the existing Community Notes baseline model.

pith-pipeline@v0.9.0 · 5626 in / 1322 out tokens · 40700 ms · 2026-05-10T16:02:17.026778+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 4 canonical work pages

  1. [1]

    Arechar, Gordon Pennycook, and David G

    Jennifer Allen, Antonio A. Arechar, Gordon Pennycook, and David G. Rand. Scaling up fact- checking using the wisdom of crowds.Science Advances, 7(36):eabf4393, 2021

  2. [2]

    Baker and Seock-Ho Kim.Item Response Theory: Parameter Estimation Techniques

    Frank B. Baker and Seock-Ho Kim.Item Response Theory: Parameter Estimation Techniques. CRC Press, 2nd edition, 2004

  3. [3]

    Bishop.Pattern Recognition and Machine Learning

    Christopher M. Bishop.Pattern Recognition and Machine Learning. Springer, 2006. 20

  4. [4]

    Vivek S. Borkar. Stochastic approximation with two time scales.Systems & Control Letters, 29(5):291–294, 1997

  5. [5]

    Request a note: How the request function shapes X’s community notes system, 2025

    Yuwei Chuai, Gabriele Lenzini, and Nicolas Pr ¨ollochs. Request a note: How the request function shapes X’s community notes system, 2025

  6. [6]

    Consensus stability of community notes on X.arXiv preprint arXiv:2601.14002, 2026

    Yuwei Chuai, Gabriele Lenzini, and Nicolas Pr ¨ollochs. Consensus stability of community notes on X.arXiv preprint arXiv:2601.14002, 2026

  7. [7]

    The statistical analysis of roll call data

    Joshua Clinton, Simon Jackman, and Douglas Rivers. The statistical analysis of roll call data. American Political Science Review, 98(2):355–370, 2004

  8. [8]

    Crowdsourced judgement elicitation with endogenous pro- ficiency

    Anirban Dasgupta and Arpita Ghosh. Crowdsourced judgement elicitation with endogenous pro- ficiency. InProceedings of the 22nd International Conference on World Wide Web, pages 319–330. ACM, 2013

  9. [9]

    A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the EM algorithm.Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):20–28, 1979

  10. [10]

    Imprimerie Royale, Paris, 1785

    Marquis de Condorcet.Essai sur l’application de l’analyse `a la probabilit´e des d´ecisions rendues `a la pluralit´e des voix. Imprimerie Royale, Paris, 1785

  11. [11]

    Community notes increase trust in fact-checking on social media.PNAS Nexus, 3(7):pgae217, 2024

    Chiara Patricia Drolsbach, Kirill Solovev, and Nicolas Pr¨ollochs. Community notes increase trust in fact-checking on social media.PNAS Nexus, 3(7):pgae217, 2024

  12. [12]

    Embretson and Steven P

    Susan E. Embretson and Steven P. Reise.Item Response Theory for Psychologists. Lawrence Erlbaum Associates, 2000

  13. [13]

    Guan, Varun Gulshan, Andrew M

    Melody Y. Guan, Varun Gulshan, Andrew M. Dai, and Geoffrey E. Hinton. Who said what: Model- ing individual labelers improves classification. InProceedings of the AAAI Conference on Artificial Intelligence, 2018

  14. [14]

    Hoerl and Robert W

    Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970

  15. [15]

    Karger, Sewoong Oh, and Devavrat Shah

    David R. Karger, Sewoong Oh, and Devavrat Shah. Iterative learning for reliable crowdsourcing systems. InAdvances in Neural Information Processing Systems, 2011

  16. [16]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InProceedings of the International Conference on Learning Representations (ICLR), 2015

  17. [17]

    Matrix factorization techniques for recommender systems

    Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. volume 42, pages 30–37. IEEE, 2009

  18. [18]

    Cameron Martel, Jennifer Allen, Gordon Pennycook, and David G. Rand. Crowds can effectively identify misinformation at scale.Perspectives on Psychological Science, 2023

  19. [19]

    Testing begins for community notes on Facebook, Instagram and Threads.https://about

    Meta. Testing begins for community notes on Facebook, Instagram and Threads.https://about. fb.com/news/2025/03/testing-begins-community-notes-facebook-instagram-threads/. Updated April 10, 2025

  20. [20]

    Eliciting informative feedback: The peer- prediction method.Management Science, 51(9):1359–1373, 2005

    Nolan Miller, Paul Resnick, and Richard Zeckhauser. Eliciting informative feedback: The peer- prediction method.Management Science, 51(9):1359–1373, 2005

  21. [21]

    Hyperactive minority alter the stability of community notes

    Jacopo Nudo, Eugenio Nerio Nemmi, Edoardo Loru, Alessandro Mei, Walter Quattrociocchi, and Matteo Cinelli. Hyperactive minority alter the stability of community notes. 2026. 21

  22. [22]

    Bridging systems: Open problems for countering destructive divisiveness across ranking, recommenders, and governance, 2023

    Aviv Ovadya and Luke Thorburn. Bridging systems: Open problems for countering destructive divisiveness across ranking, recommenders, and governance, 2023

  23. [23]

    Poole.Spatial Models of Parliamentary Voting

    Keith T. Poole.Spatial Models of Parliamentary Voting. Cambridge University Press, 2005

  24. [24]

    Poole and Howard Rosenthal.Ideology and Congress

    Keith T. Poole and Howard Rosenthal.Ideology and Congress. Transaction Publishers, 2nd edition, 2007

  25. [25]

    A Bayesian truth serum for subjective data.Science, 306(5695):462–466, 2004

    Dra ˇzen Prelec. A Bayesian truth serum for subjective data.Science, 306(5695):462–466, 2004

  26. [26]

    Testing a new feature to enhance content on TikTok.https://newsroom.tiktok

    Adam Presser. Testing a new feature to enhance content on TikTok.https://newsroom.tiktok. com/footnotes?lang=en

  27. [27]

    Community-based fact-checking on Twitter’s Birdwatch platform

    Nicolas Pr ¨ollochs. Community-based fact-checking on Twitter’s Birdwatch platform. InProceed- ings of the International AAAI Conference on Web and Social Media (ICWSM), 2022

  28. [28]

    Radhakrishna Rao

    C. Radhakrishna Rao. Large sample tests of statistical hypotheses concerning several parame- ters with applications to problems of estimation. InMathematical Proceedings of the Cambridge Philosophical Society, volume 44, pages 50–57. Cambridge University Press, 1948

  29. [29]

    Raykar, Shipeng Yu, Linda H

    Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds.Journal of Machine Learning Research, 11:1297– 1322, 2010

  30. [30]

    Timeli- ness, consensus, and composition of the crowd: Community notes on x.arXiv preprint arXiv:2510.12559, 2025

    Olesya Razuvayevskaya, Adel Tayebi, Ulrikke Dybdal Sørensen, Kalina Bontcheva, and Richard Rogers. Timeliness, consensus, and composition of the crowd: Community notes on X.arXiv, (2510.12559), 2025

  31. [31]

    Collaboratively adding context to social media posts reduces the sharing of false news, 2024

    Thomas Renault, David Restrepo Amariles, and Aurore Troussel. Collaboratively adding context to social media posts reduces the sharing of false news, 2024

  32. [32]

    Deep learning from crowds

    Filipe Rodrigues and Francisco Pereira. Deep learning from crowds. InProceedings of the AAAI Conference on Artificial Intelligence, 2018

  33. [33]

    Probabilistic matrix factorization

    Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. InAdvances in Neural Information Processing Systems, 2008

  34. [34]

    Victor Shnayder, Arpit Agarwal, Rafael Frongillo, and David C. Parkes. Informed truthfulness in multi-task peer prediction. InProceedings of the 2016 ACM Conference on Economics and Compu- tation, pages 179–196. ACM, 2016

  35. [35]

    Community notes on X reduces engagement with misleading posts.Proceedings of the National Academy of Sciences, 122(38):e2503413122, 2025

    Isaac Slaughter, Axel Peytavin, Johan Ugander, and Martin Saveski. Community notes on X reduces engagement with misleading posts.Proceedings of the National Academy of Sciences, 122(38):e2503413122, 2025

  36. [36]

    Doubleday, 2004

    James Surowiecki.The Wisdom of Crowds. Doubleday, 2004

  37. [37]

    Testing new ways to offer viewers more context and information on videos

    The YouTube Team. Testing new ways to offer viewers more context and information on videos. https://blog.youtube/news-and-events/new-ways-to-offer-viewers-more-context/

  38. [38]

    Bao Tran Truong, Siqi Wu, Alessandro Flammini, Filippo Menczer, and Alexander J. Stewart. Com- munity notes are vulnerable to rater bias and manipulation.arXiv preprint arXiv:2511.02615, 2025

  39. [39]

    The laziness of the crowd: Effort aversion among raters risks undermining the efficacy of x’s community notes program.arXiv preprint arXiv:2603.11120, 2026

    Morgan Wack, Patrick Warren, and Mustafa Alam. The laziness of the crowd: Effort aversion among raters risks undermining the efficacy of x’s community notes program.arXiv preprint arXiv:2603.11120, 2026. 22

  40. [40]

    The multidimensional wisdom of crowds

    Peter Welinder, Steve Branson, Serge Belongie, and Pietro Perona. The multidimensional wisdom of crowds. InAdvances in Neural Information Processing Systems, 2010

  41. [41]

    Movellan, and Paul L

    Jacob Whitehill, Ting-Fan Wu, Jacob Bergsma, Javier R. Movellan, and Paul L. Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in Neural Information Processing Systems, 2009

  42. [42]

    Stefan Wojcik, Sophie Hilgard, Nick Judd, Delia Mocanu, Stephen Ragain, M. B. Fallin Hunzaker, Keith Coleman, and Jay Baxter. Birdwatch: Crowd wisdom and bridging algorithms can inform understanding and reduce the spread of misinformation. 2022

  43. [43]

    ˆβj ˆδj # = ˆSj +λ β ˆCj ˆCj ˆVj +λ δ −1

    X Corp. Community notes data download.https://x.com/i/communitynotes/download-data, 2026. Accessed: 2026-03-03. A Proofs A.1 Exact Conditional Note-Block Update Fix a notej, and hold all parameters except(β j, δj)fixed. Define the partial residualˆyij :=r ij −ˆµ− ˆαi, i∈ I j.Then the note-jsubproblem is min β,δ 1 2 X i∈Ij ˆyij −ˆρiβ−ˆγiδ 2 + λβ 2 β2 + λδ ...