pith. machine review for the scientific record. sign in

arxiv: 2605.09228 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

ProactBench: Beyond What The User Asked For

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords conversational proactivityLLM benchmarksimplied user needsrecovery evaluationmulti-agent dialoguefrontier modelsdialogue systems
0
0 comments X

The pith

Recovery after task completion is difficult for LLMs and weakly tied to standard benchmarks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most LLM benchmarks only score responses to explicit user requests. This paper introduces ProactBench to measure conversational proactivity, the skill of noticing and acting on needs the user has implied but not stated. It divides the skill into three types: inference from a single anchor, synthesis across multiple anchors, and recovery with forward-looking value once the stated task ends. A three-agent system generates test cases while keeping information separate to avoid style bias or leaked answers. Tests across sixteen models show recovery stands out as both hard and only weakly linked to six common benchmarks.

Core claim

ProactBench decomposes conversational proactivity into Emergent, Critical, and Recovery phases. It uses a Planner, User Agent, and Assistant Model with information asymmetries to produce 198 dialogues containing 624 trigger points across 24 communication styles. Evaluation of frontier and open-weight models shows Recovery is difficult and weakly predicted by existing benchmarks, establishing it as a distinct evaluation signal.

What carries the argument

The three-agent architecture with Planner, User Agent, and Assistant Model that maintains information asymmetries to generate unbiased trigger points for proactivity evaluation.

If this is right

  • Recovery performance can function as an independent metric when comparing how models handle real conversations.
  • Standard benchmarks leave out important aspects of helpfulness that involve anticipating unstated needs.
  • The 624 trigger points across 24 styles allow testing model robustness to different user communication patterns.
  • Improving recovery may require training methods that emphasize post-task forward-looking inference rather than explicit instructions alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Interfaces built around recovery checks could lower user frustration during extended chat sessions.
  • Developers could apply the benchmark to spot gaps in training data that favor explicit over implicit user signals.
  • The information-asymmetry method might extend to testing other subtle skills such as timely clarification requests.

Load-bearing premise

The three-agent setup with information asymmetries successfully prevents style confounding, rubric leakage, and information dumps without introducing new artifacts.

What would settle it

Observing a strong correlation between recovery scores and performance on the six standard benchmarks across additional models would indicate recovery does not provide a useful new signal.

Figures

Figures reproduced from arXiv: 2605.09228 by Ahmad Salimi, Alex Smola, Dongming Shen, Sepehr Harfi.

Figure 1
Figure 1. Figure 1: The three-agent loop. The Planner authors strategy and the prospective rubric for turn [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-model pass rate by trigger type. The drop from [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pairwise Pearson correlations across six standard benchmarks and three proactivity trigger types, computed over 16 models (95% bootstrap CIs in Appendix K). Existing benchmarks intercorrelate at r = 0.64 to 0.97; EMERGENT and CRITICAL fit within this regime (r¯ ≈ 0.83). RECOV￾ERY breaks the pattern: r¯ = 0.51, 95% CI [0.29, 0.71]. weight models (Qwen3.5-397B-A17B, Kimi-K2.6, DeepSeek-V4-Flash, Llama-4-Mave… view at source ↗
Figure 4
Figure 4. Figure 4: Logit-transformed per-(model, style) pass rates regressed against a shared style-difficulty axis (the [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cohen’s κ per trigger type for each judge pair, with 95% bootstrap CI. Dashed line at 0.40 marks the conventional “moderate” threshold; dotted line at 0.60 marks “substantial.” EMERGENT sits at the moderate boundary; CRITICAL is the noisiest dimension across all three pairs. G.3 RECOVERY gap and disagreement direction The most informative test is whether GPT-5.5’s RECOVERY lead survives a cross-family judg… view at source ↗
Figure 6
Figure 6. Figure 6: RECOVERY weighted score (%) by evaluated model under each of the three judges, with 95% bootstrap CI error bars. GPT-5.5 is the top model under every judge; the magnitude of its lead compresses under cross-family judges but never reverses. Disagreement direction. Across all triggers pooled, the cross-family judges agree with GPT-5.4 roughly 60–65% of the time, with disagreements distributed roughly symmetr… view at source ↗
Figure 7
Figure 7. Figure 7: Weighted score (%) per (evaluated model × judge) cell, broken out by trigger type. Aligned-trigger subset only (n = 132–167 per model). The Overall column shows that top overall rankings shift under cross-family judges, while the per-type panels show the type-specific patterns described above, including GPT-5.5’s preserved RECOVERY lead. G.4 Curation-model contamination ablation The main offline-evaluation… view at source ↗
Figure 8
Figure 8. Figure 8: Per-model logit pass rates plotted against a shared stage-difficulty axis. Each polyline connects one [PITH_FULL_IMAGE:figures/full_fig_p038_8.png] view at source ↗
read the original abstract

Most LLM benchmarks score how well a model responds to explicit requests. They leave unmeasured a different conversational ability: noticing and acting on needs the user has implied but not said. We call this \emph{conversational proactivity}. ProactBench decomposes it into three phase-tied types: \textsc{Emergent}, inference from a single disclosed anchor; \textsc{Critical}, synthesis across multiple anchors; and \textsc{Recovery}, grounded forward-looking value after task completion. We operationalise the benchmark with three agents: a Planner, a User Agent, and an Assistant Model. Their information asymmetries defend against style-confounded scoring, rubric leakage, external-context contamination, and information dumps. The released corpus contains 198 curated dialogues with 624 trigger points across 24 communication styles drawn from a psychometric inventory and audited by an independent LLM judge. Across 16 frontier and open-weight models, \textsc{Recovery} is both difficult and weakly predicted by six standard benchmarks, making it a useful new evaluation signal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ProactBench to measure conversational proactivity in LLMs—the ability to notice and act on implied but unstated user needs. It decomposes proactivity into three phase-tied types (Emergent: inference from one anchor; Critical: synthesis across anchors; Recovery: grounded forward-looking value post-task) and operationalizes the benchmark via a three-agent protocol (Planner, User Agent, Assistant) whose information asymmetries are intended to block style confounding, rubric leakage, and information dumps. The released corpus comprises 198 dialogues with 624 trigger points spanning 24 psychometric communication styles; evaluations across 16 models indicate that Recovery is difficult and only weakly predicted by six standard benchmarks, positioning it as a distinct evaluation signal.

Significance. If the multi-agent construction successfully isolates genuine proactivity without introducing new artifacts, the reported weak correlation between Recovery scores and existing benchmarks would constitute a useful new signal for capabilities not captured by instruction-following evaluations. The release of the curated corpus and the psychometric grounding of styles are concrete strengths that could enable follow-on work.

major comments (2)
  1. [§3] §3 (three-agent operationalization): The central claim that Recovery scores reflect proactivity rather than protocol artifacts rests on the assertion that Planner/User-Agent/Assistant information asymmetries prevent style confounding, rubric leakage, and information dumps. The manuscript does not enumerate the precise knowledge partitions (e.g., whether the User Agent ever receives the Planner’s trigger list, the Assistant’s prior turns, or the full rubric), leaving open the possibility that residual style signals or handoff artifacts inflate Recovery difficulty and produce the observed weak correlations with the six external benchmarks.
  2. [Results] Results section (evaluation of 16 models): The claim that Recovery “is both difficult and weakly predicted” by six standard benchmarks is load-bearing for the paper’s contribution. Without reported correlation coefficients, confidence intervals, or explicit exclusion criteria for the six benchmarks, it is impossible to verify that the weak relationship is statistically distinguishable from noise or from the difficulty of the task itself.
minor comments (2)
  1. [Abstract] The abstract introduces “anchor” and “trigger points” without a concise definition on first use; a parenthetical gloss would improve readability.
  2. [§3] The auditing procedure by the independent LLM judge is mentioned but lacks details on prompt, agreement metric, or disagreement resolution; these should be added to the corpus-construction subsection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate the suggested clarifications.

read point-by-point responses
  1. Referee: [§3] §3 (three-agent operationalization): The central claim that Recovery scores reflect proactivity rather than protocol artifacts rests on the assertion that Planner/User-Agent/Assistant information asymmetries prevent style confounding, rubric leakage, and information dumps. The manuscript does not enumerate the precise knowledge partitions (e.g., whether the User Agent ever receives the Planner’s trigger list, the Assistant’s prior turns, or the full rubric), leaving open the possibility that residual style signals or handoff artifacts inflate Recovery difficulty and produce the observed weak correlations with the six external benchmarks.

    Authors: We acknowledge that the original description of the three-agent protocol, while outlining the intended information asymmetries, did not include an exhaustive enumeration of knowledge partitions. In the revised manuscript we have added a dedicated table in §3 that specifies the exact information available to each agent at every stage. The User Agent receives only the current simulated utterance and dialogue history and has no access to the Planner’s trigger list or the full rubric; the Assistant receives solely the conversation history without any prior knowledge of triggers, styles, or evaluation criteria. This explicit partitioning directly mitigates concerns about residual style signals or handoff artifacts and supports the claim that Recovery scores reflect proactivity rather than protocol effects. revision: yes

  2. Referee: [Results] Results section (evaluation of 16 models): The claim that Recovery “is both difficult and weakly predicted” by six standard benchmarks is load-bearing for the paper’s contribution. Without reported correlation coefficients, confidence intervals, or explicit exclusion criteria for the six benchmarks, it is impossible to verify that the weak relationship is statistically distinguishable from noise or from the difficulty of the task itself.

    Authors: We agree that quantitative statistical support is required to substantiate the claim. The revised Results section now includes a table reporting Pearson and Spearman correlations between Recovery scores and each of the six benchmarks, together with 95% confidence intervals and p-values. We have also added explicit selection criteria for the benchmarks (representative instruction-following, reasoning, and knowledge tasks) and note that the observed correlations remain low (|r| < 0.3) even after accounting for task difficulty. These additions allow readers to confirm that the weak relationship is statistically distinguishable from stronger correlations seen for the other proactivity phases. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark definition or empirical claims

full rationale

The paper constructs ProactBench via a three-agent protocol (Planner/User-Agent/Assistant) to generate 198 dialogues with 624 trigger points, then runs 16 models to measure Recovery difficulty and its weak correlation with six external benchmarks. No equations, fitted parameters, self-citations, or ansatzes appear in the derivation; the central result is an empirical observation from the released corpus and model evaluations rather than a quantity forced by construction or prior self-referential definitions. The three-agent asymmetries are presented as a methodological choice whose effectiveness is left to external verification, not asserted by internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark rests on standard assumptions about what constitutes a valid trigger point and on the psychometric inventory for communication styles; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption The three-phase taxonomy (Emergent, Critical, Recovery) captures the main forms of conversational proactivity.
    Stated in the abstract as the decomposition used to operationalize the benchmark.
  • domain assumption Information asymmetries between Planner, User Agent, and Assistant Model eliminate style confounding and leakage.
    Abstract claims these asymmetries defend against listed confounds.

pith-pipeline@v0.9.0 · 5480 in / 1307 out tokens · 46478 ms · 2026-05-12T02:04:19.417084+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

145 extracted references · 145 canonical work pages · 11 internal anchors

  1. [5]

    Kaur, Kirandeep and Gupta, Vinayak and Gupta, Aditya and Shah, Chirag , journal=

  2. [6]

    Samarinas, Chris and Zamani, Hamed , journal=. Pro

  3. [7]

    A survey on proactive dialogue systems: Problems, methods, and prospects.arXiv preprint arXiv:2305.02750, 2023

    A Survey on Proactive Dialogue Systems: Problems, Methods, and Prospects , author=. arXiv preprint arXiv:2305.02750 , url=

  4. [8]

    Proactive Conversational

    Deng, Yang and Liao, Lizi and Lei, Wenqiang and Yang, Grace Hui and Lam, Wai and Chua, Tat-Seng , journal=. Proactive Conversational. 2025 , doi=

  5. [9]

    Findings of EMNLP , year=

    Prompting and Evaluating Large Language Models for Proactive Dialogues: Clarification, Target-guided, and Non-collaboration , author=. Findings of EMNLP , year=

  6. [10]

    Proceedings of SIGIR , year=

    Towards Human-centered Proactive Conversational Agents , author=. Proceedings of SIGIR , year=

  7. [11]

    Proceedings of IJCAI , pages=

    Smarter Response with Proactive Suggestion: A New Generative Neural Conversation Paradigm , author=. Proceedings of IJCAI , pages=. 2018 , url=

  8. [12]

    Zheng, Lianmin and Chiang, Wei-Lin and others , journal=. Judging

  9. [13]

    WildBench: Benchmarking

    Lin, Bill Yuchen and Deng, Yuntian and others , journal=. WildBench: Benchmarking

  10. [14]

    Chatbot Arena: An Open Platform for Evaluating

    Chiang, Wei-Lin and Zheng, Lianmin and others , journal=. Chatbot Arena: An Open Platform for Evaluating

  11. [15]

    , journal=

    Chang, Serina and Anderson, Ashton and Hofman, Jake M. , journal=. ChatBench: From Static Benchmarks to Human-

  12. [17]

    Gan, Yujian and Li, Changling and others , journal=. ClarQ-

  13. [18]

    arXiv preprint arXiv:2602.03429 , url=

    DiscoverLLM: From Executing Intents to Discovering Them , author=. arXiv preprint arXiv:2602.03429 , url=

  14. [20]

    Zhou, Xuhui and Zhu, Hao and Mathur, Leena and others , journal=

  15. [21]

    Dong, Wenjie and Chen, Sirong and Yang, Yan , booktitle=. Pro. 2025 , url=

  16. [22]

    Proceedings of ICLR , year=

    Measuring Massive Multitask Language Understanding , author=. Proceedings of ICLR , year=

  17. [23]

    2024 , url=

    Rein, David and Hou, Betty Li and Stickland, Asa Cooper and others , booktitle=. 2024 , url=

  18. [24]

    2024 , url=

    Jimenez, Carlos E and Yang, John and Wettig, Alexander and others , booktitle=. 2024 , url=

  19. [27]

    2025 , howpublished=

  20. [28]

    MedDialBench: Benchmarking

    Luo, Xiaotian and Jiang, Xun and Wu, Jiangcheng , journal=. MedDialBench: Benchmarking

  21. [29]

    and Bakker-Pieper, Angelique and Konings, Femke E

    de Vries, Reinout E. and Bakker-Pieper, Angelique and Konings, Femke E. and Schouten, Barbara , journal=. The. 2013 , publisher=

  22. [30]

    Communication Research , volume=

    The Content and Dimensionality of Communication Styles , author=. Communication Research , volume=. 2009 , publisher=

  23. [31]

    Psychometric Properties and a Preliminary Validation Study of the Italian Brief Version of the Communication Styles Inventory (

    Diotaiuti, Pierluigi and Valente, Giuseppe and Mancone, Stefania and Grambone, Angela , journal=. Psychometric Properties and a Preliminary Validation Study of the Italian Brief Version of the Communication Styles Inventory (. 2020 , doi=

  24. [32]

    Tint, Joshua and Sagar, Som and others , journal=

  25. [33]

    arXiv preprint arXiv:2501.00383 , url=

    Proactive Conversational Agents with Inner Thoughts , author=. arXiv preprint arXiv:2501.00383 , url=

  26. [34]

    Huang, Shuai and Zhao, Wenxuan and Gao, Jun , journal=

  27. [35]

    Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing

    Qiu, Huachuan and Lan, Zhenzhong , journal=. Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing

  28. [36]

    Yang, An and others , journal=

  29. [37]

    2026 , eprint=

    Submodular Benchmark Selection , author=. 2026 , eprint=

  30. [38]

    Near-Optimal Sensor Placements in

    Guestrin, Carlos and Krause, Andreas and Singh, Ajit Paul , booktitle=. Near-Optimal Sensor Placements in. 2005 , url=

  31. [39]

    Neural Theory-of-Mind? On the Limits of Social Intelligence in Large

    Sap, Maarten and LeBras, Ronan and Fried, Daniel and Choi, Yejin , booktitle=. Neural Theory-of-Mind? On the Limits of Social Intelligence in Large. 2022 , url=

  32. [40]

    2023 , url=

    Kim, Hyunwoo and Sclar, Melanie and Zhou, Xuhui and Bras, Ronan Le and Kim, Gunhee and Choi, Yejin and Sap, Maarten , booktitle=. 2023 , url=

  33. [41]

    Proceedings of EMNLP-IJCNLP , year=

    Revisiting the Evaluation of Theory of Mind through Question Answering , author=. Proceedings of EMNLP-IJCNLP , year=

  34. [42]

    Nature Human Behaviour , year=

    Testing theory of mind in large language models and humans , author=. Nature Human Behaviour , year=

  35. [43]

    Large language models fail on trivial alterations to theory-of-mind tasks, 2023

    Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks , author=. arXiv preprint arXiv:2302.08399 , url=

  36. [44]

    Proceedings of ICLR , year=

    Towards Understanding Sycophancy in Language Models , author=. Proceedings of ICLR , year=

  37. [45]

    Findings of ACL , year=

    Discovering Language Model Behaviors with Model-Written Evaluations , author=. Findings of ACL , year=

  38. [46]

    Constitutional

    Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and others , journal=. Constitutional

  39. [48]

    arXiv preprint arXiv:2212.10711 , url=

    Task Ambiguity in Humans and Language Models , author=. arXiv preprint arXiv:2212.10711 , url=

  40. [51]

    Proceedings of SIGIR , year=

    Asking Clarifying Questions in Open-Domain Information-Seeking Conversations , author=. Proceedings of SIGIR , year=

  41. [52]

    2024 , eprint=

    Shi, Taiwei and Wang, Zhuoer and Yang, Longqi and Lin, Ying-Chun and He, Zexue and Wan, Mengting and Zhou, Pei and Jauhar, Sujay and Chen, Sihao and Xia, Shan and Zhang, Hongfei and Zhao, Jieyu and Xu, Xiaofeng and Song, Xia and Neville, Jennifer , booktitle=. 2024 , eprint=

  42. [53]

    Transactions of the Association for Computational Linguistics , volume=

    Lost in the Middle: How Language Models Use Long Contexts , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , url=

  43. [54]

    Needle in a Haystack -- Pressure Testing

    Kamradt, Greg , year=. Needle in a Haystack -- Pressure Testing

  44. [55]

    2024 , url=

    An, Chenxin and Gong, Shansan and Zhong, Ming and Zhao, Xingjian and Li, Mukai and Zhang, Jun and Kong, Lingpeng and Qiu, Xipeng , booktitle=. 2024 , url=

  45. [56]

    Hsieh, Cheng-Ping and Sun, Simeng and Kriman, Samuel and others , journal=

  46. [57]

    2024 , url=

    Bai, Yushi and Lv, Xin and Zhang, Jiajie and others , booktitle=. 2024 , url=

  47. [58]

    2024 , url=

    Wang, Xingyao and Wang, Zihan and Liu, Jiateng and others , booktitle=. 2024 , url=

  48. [59]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    -Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. arXiv preprint arXiv:2406.12045 , url=

  49. [60]

    Proceedings of ICLR , year=

    Mialon, Gr. Proceedings of ICLR , year=

  50. [61]

    Cognitive Science , volume=

    Contributing to discourse , author=. Cognitive Science , volume=. 1989 , url=

  51. [62]

    1996 , url=

    Using Language , author=. 1996 , url=

  52. [63]

    Journal of Experimental Psychology: Learning, Memory, and Cognition , volume=

    Conceptual pacts and lexical choice in conversation , author=. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume=. 1996 , url=

  53. [64]

    2024 , url=

    Salemi, Alireza and Mysore, Sheshera and Bendersky, Michael and Zamani, Hamed , booktitle=. 2024 , url=

  54. [65]

    Foundations and Trends in Information Retrieval , volume=

    Conversational Information Seeking , author=. Foundations and Trends in Information Retrieval , volume=. 2023 , url=

  55. [66]

    and Feng, Shi , booktitle=

    Panickssery, Arjun and Bowman, Samuel R. and Feng, Shi , booktitle=. 2024 , url=

  56. [68]

    Proceedings of ACL , year=

    Large Language Models are not Fair Evaluators , author=. Proceedings of ACL , year=

  57. [69]

    NeurIPS Datasets and Benchmarks , year=

    Benchmarking Foundation Models with Language-Model-as-an-Examiner , author=. NeurIPS Datasets and Benchmarks , year=

  58. [70]

    Quantifying the Persona Effect in

    Hu, Tiancheng and Collier, Nigel , booktitle=. Quantifying the Persona Effect in. 2024 , url=

  59. [71]

    Proceedings of EACL , year=

    Sensitivity, Performance, Robustness: Deconstructing the Effect of Sociodemographic Prompting , author=. Proceedings of EACL , year=

  60. [72]

    A Survey on Personalized and Pluralistic Preference Alignment in Large Language Models

    A Survey on Personalized and Pluralistic Preference Alignment in Large Language Models , author=. arXiv preprint arXiv:2504.07070 , url=

  61. [73]

    2025 , publisher=

    Nemotron-Personas: A Collection of Synthetic Persona Datasets Aligned to Real-World Distributions , author=. 2025 , publisher=

  62. [74]

    Transactions on Machine Learning Research , year=

    Emergent abilities of large language models , author=. Transactions on Machine Learning Research , year=

  63. [75]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Are Emergent Abilities of Large Language Models a Mirage? , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  64. [76]

    Polo, Felipe Maia and Weber, Lucas and Choshen, Leshem and Sun, Yuekai and Xu, Gongjun and Yurochkin, Mikhail , booktitle=. tiny. 2024 , url=

  65. [77]

    2025 , howpublished=

    Introducing. 2025 , howpublished=

  66. [78]

    2026 , howpublished=

    Introducing. 2026 , howpublished=

  67. [79]

    2024 , howpublished=

    Hello. 2024 , howpublished=

  68. [80]

    2026 , howpublished=

  69. [81]

    2025 , howpublished=

    The. 2025 , howpublished=

  70. [82]

    2024 , howpublished=

  71. [84]

    Educational and Psychological Measurement , volume=

    A Coefficient of Agreement for Nominal Scales , author=. Educational and Psychological Measurement , volume=. 1960 , doi=

  72. [85]

    Psychological Bulletin , volume=

    Weighted Kappa: Nominal Scale Agreement Provision for Scaled Disagreement or Partial Credit , author=. Psychological Bulletin , volume=. 1968 , doi=

  73. [86]

    The American Journal of Psychology , volume=

    The Proof and Measurement of Association Between Two Things , author=. The American Journal of Psychology , volume=. 1904 , doi=

  74. [87]

    The Annals of Statistics , volume=

    Bootstrap Methods: Another Look at the Jackknife , author=. The Annals of Statistics , volume=. 1979 , doi=

  75. [88]

    Communications of the ACM , volume=

    Datasheets for Datasets , author=. Communications of the ACM , volume=. 2021 , doi=

  76. [89]

    Croissant: A Metadata Format for

    Akhtar, Mubashara and Benjelloun, Omar and Conforti, Costanza and Gijsbers, Pieter and Giner-Miguelez, Joan and Jain, Nitisha and Kuchnik, Michael and Lhoest, Quentin and Marcenac, Pierre and Maskey, Manil and Mattson, Peter and Oala, Luis and Ruyssen, Pierre and Shinde, Rajat and Simperl, Elena and Thomas, Goeffry and Tykhonov, Slava and Vanschoren, Joaq...

  77. [90]

    Human Communication Research , volume=

    Reliability in Content Analysis: Some Common Misconceptions and Recommendations , author=. Human Communication Research , volume=. 2004 , doi=

  78. [91]

    Scandinavian Journal of Statistics , volume=

    A Simple Sequentially Rejective Multiple Test Procedure , author=. Scandinavian Journal of Statistics , volume=. 1979 , publisher=

  79. [92]

    Journal of the Royal Statistical Society

    Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing , author=. Journal of the Royal Statistical Society. Series B (Methodological) , volume=. 1995 , publisher=

  80. [93]

    Croissant: A metadata format for ML -ready datasets

    Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Pieter Gijsbers, Joan Giner-Miguelez, Nitisha Jain, Michael Kuchnik, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Pierre Ruyssen, Rajat Shinde, Elena Simperl, Goeffry Thomas, Slava Tykhonov, Joaquin Vanschoren, Susheel Varma, Jos van der Velde, Steffen Vogler, Carole-Jean Wu...

Showing first 80 references.