arxiv: 2211.03540 · v2 · pith:MQKP6N75new · submitted 2022-11-04 · 💻 cs.HC · cs.AI· cs.CL

Measuring Progress on Scalable Oversight for Large Language Models

Samuel R. Bowman , Jeeyoon Hyun , Ethan Perez , Edwin Chen , Craig Pettit , Scott Heiner , Kamil\.e Luko\v{s}i\=ut\.e , Amanda Askell

show 38 more authors

Andy Jones Anna Chen Anna Goldie Azalia Mirhoseini Cameron McKinnon Christopher Olah Daniela Amodei Dario Amodei Dawn Drain Dustin Li Eli Tran-Johnson Jackson Kernion Jamie Kerr Jared Mueller Jeffrey Ladish Joshua Landau Kamal Ndousse Liane Lovitt Nelson Elhage Nicholas Schiefer Nicholas Joseph Noem\'i Mercado Nova DasSarma Robin Larson Sam McCandlish Sandipan Kundu Scott Johnston Shauna Kravec Sheer El Showk Stanislav Fort Timothy Telleen-Lawton Tom Brown Tom Henighan Tristan Hume Yuntao Bai Zac Hatfield-Dodds Ben Mann Jared Kaplan

This is my paper

Pith reviewed 2026-05-17 14:57 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.CL

keywords scalable oversightlarge language modelshuman-AI interactionquestion answeringMMLUQuALITYAI safety

0 comments

The pith

Humans chatting with an unreliable LLM outperform both the model and unaided humans on specialist tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses scalable oversight, the task of supervising AI systems that may exceed human abilities on most relevant skills. It introduces an experimental setup using tasks where experts succeed but current general AI and regular people do not, then tests a simple baseline: people chatting with a flawed large language model. On MMLU and time-limited QuALITY, the human-AI pairs score higher than either the model or the human working alone. This result indicates that basic interactive assistance already improves performance on hard questions, allowing researchers to study oversight empirically with today's models.

Core claim

Human participants who interact with an unreliable large-language-model dialog assistant through chat substantially outperform both the model alone and their own unaided performance on tasks like MMLU and time-limited QuALITY, where specialists succeed but unaided humans and current general AI systems fail.

What carries the argument

An experimental design built around proxy tasks that specialists can solve but unaided humans and current models cannot, paired with a baseline oversight method of unstructured chat with an unreliable LLM.

If this is right

Scalable oversight can be studied empirically using present-day models instead of waiting for future superhuman systems.
Even unreliable AI models can raise human performance on difficult question-answering tasks through simple dialog.
Progress on oversight methods can be measured by comparing human-AI team accuracy against standalone model and human baselines.
More advanced oversight techniques can be tested within the same experimental framework once the baseline is established.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the pattern holds, oversight research may focus more on interactive assistance protocols than on pure verification or auditing.
The findings suggest hybrid human-AI systems could serve as an intermediate step before full autonomous oversight.
This setup invites tests of whether performance gains persist when humans are given more time or different interaction formats.

Load-bearing premise

Tasks like MMLU and time-limited QuALITY, where specialists succeed but unaided humans and current AI fail, serve as valid proxies for the challenges of supervising future AI systems that broadly outperform humans.

What would settle it

An experiment on harder tasks or with stronger models in which human-AI chat teams no longer outperform both the model alone and unaided humans.

read the original abstract

Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on ways it can be studied empirically. We first present an experimental design centered on tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a workable experimental design for testing scalable oversight with current models on specialist-success tasks, with a simple chat baseline showing gains, but the tasks may not capture the real verification problems for future superhuman systems.

read the letter

The key takeaway is that this paper introduces an experimental design for studying scalable oversight now, by focusing on tasks where specialists succeed but current models and average humans fail, and their initial test with human-LLM chat shows clear performance improvements on MMLU and time-limited QuALITY. They do a good job laying out why empirical work is tricky without superhuman models and then propose this specialist-success approach as a way forward. Running the proof-of-concept with three conditions—model alone, human alone, and human plus chat with the model—gives a simple but direct comparison. The finding that the assisted humans beat both baselines is encouraging and provides a trivial but functional starting strategy for oversight experiments. It also aligns with other work showing LLMs can help on hard tasks. The main limitation is whether these tasks really stand in for the oversight challenges ahead. MMLU and QuALITY have objective answers and limited scope, so humans can often catch model errors by knowing the right response or double-checking facts. Future models that are broadly superhuman might produce errors that are harder to detect, especially in open-ended generation or areas without clear ground truth. The paper acknowledges this is a proof-of-concept, but the results might not generalize to those harder cases. This work is for researchers in AI alignment and scalable oversight. Readers who want practical ways to test oversight methods empirically will get value from the design and the baseline results. It deserves serious peer review because it tackles an important problem with a new setup that can be iterated on, even if the current evidence is preliminary. I would send it for review. The experimental framework is worth developing, and feedback from referees could strengthen the connection to long-term oversight goals.

Referee Report

1 major / 2 minor

Summary. The paper proposes an experimental design for empirically studying scalable oversight of AI systems that may outperform humans, centered on tasks where specialists succeed but unaided humans and current general AI fail. It reports a proof-of-concept experiment on MMLU and time-limited QuALITY in which humans chatting with an unreliable LLM dialog assistant substantially outperform both the model alone and their unaided performance, positioning this as a viable baseline strategy and encouraging further empirical work with present models.

Significance. If the results hold, the work supplies a concrete, low-overhead baseline for human-AI collaboration on difficult question-answering tasks and demonstrates that such oversight experiments are tractable today rather than deferred until superhuman systems exist. The direct empirical comparison of three conditions on fixed benchmarks with no free parameters or circular derivations is a clear strength.

major comments (1)

[Experimental Design] The experimental design (centered on MMLU and time-limited QuALITY) treats these tasks as proxies for the core scalable-oversight difficulty of verifying models whose errors are subtle or whose capabilities exceed the overseer on most dimensions. Because both tasks supply objective ground truth and narrow domains in which specialists already know the answers, the observed chat-assisted gains may not generalize to open-ended generation or ambiguous-goal settings where verification itself is the central unsolved problem. This assumption is load-bearing for the claim that the reported interaction protocol constitutes progress toward scalable oversight.

minor comments (2)

[Abstract] The abstract states that participants 'substantially outperform' both baselines; the results section should report exact accuracy deltas, participant counts, statistical tests, and error bars so readers can judge effect size and reliability.
[Methods] Clarify the precise chat interface, time limits, and instructions given to participants so the protocol can be reproduced or extended by other groups.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the detailed comment on the experimental design. We address this point directly below.

read point-by-point responses

Referee: The experimental design (centered on MMLU and time-limited QuALITY) treats these tasks as proxies for the core scalable-oversight difficulty of verifying models whose errors are subtle or whose capabilities exceed the overseer on most dimensions. Because both tasks supply objective ground truth and narrow domains in which specialists already know the answers, the observed chat-assisted gains may not generalize to open-ended generation or ambiguous-goal settings where verification itself is the central unsolved problem. This assumption is load-bearing for the claim that the reported interaction protocol constitutes progress toward scalable oversight.

Authors: We agree that MMLU and time-limited QuALITY function as proxies rather than direct instantiations of the hardest cases of scalable oversight, where verification of subtle errors or superhuman outputs in open-ended or ambiguous-goal domains is the central difficulty. The manuscript frames the experimental design explicitly as a means to enable empirical study of human-AI collaboration today, using tasks where objective ground truth permits clear measurement of performance differences between the three conditions (model alone, human alone, and human with chat assistant). The reported protocol is described as a trivial baseline strategy, and the results are presented as evidence that such interaction-based oversight can be studied productively with present models. We do not claim the protocol solves verification in general superhuman regimes. To address the referee's concern, we will revise the discussion and limitations sections to more explicitly delineate the scope of these tasks as proxies, note the load-bearing nature of the assumption, and outline directions for extending the approach to settings without objective ground truth. revision: yes

Circularity Check

0 steps flagged

Direct empirical comparison on fixed benchmarks shows no circularity

full rationale

The paper describes an experimental design using tasks where specialists succeed but unaided humans and current models fail, then reports results from a proof-of-concept study comparing three conditions (model alone, human alone, human+LLM chat) on MMLU and time-limited QuALITY. These are straightforward empirical measurements on objective benchmarks with no derivations, equations, fitted parameters, predictions, or first-principles claims that reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central result is a direct performance comparison, which is self-contained and externally falsifiable via replication on the same tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that the selected tasks capture key difficulties of future scalable oversight; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Tasks where human specialists succeed but unaided humans and current general AI systems fail are appropriate proxies for studying scalable oversight of future superhuman AI.
Invoked to justify the choice of MMLU and time-limited QuALITY as the experimental tasks.

pith-pipeline@v0.9.0 · 5689 in / 1246 out tokens · 87618 ms · 2026-05-17T14:57:07.849638+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 conditional novelty 8.0

LLMs exhibit myopic planning in four-in-a-row: move choices are best explained by shallow nodes in reasoning traces, not the deep lookahead they generate, unlike humans where depth drives performance.
LLM Wardens: Mitigating Adversarial Persuasion with Third-Party Conversational Oversight
cs.LG 2026-05 unverdicted novelty 7.0

A secondary warden LLM halves the success rate of hidden-goal adversarial LLMs in steering user decisions while causing only minor interference with genuine interactions.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 unverdicted novelty 7.0

LLM move selection in four-in-a-row is best explained by myopic models that ignore deep nodes in their own reasoning traces, while performance correlates with search breadth rather than depth.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 unverdicted novelty 7.0

LLMs exhibit myopic planning in games, with move choices driven by shallow nodes despite deep reasoning traces, in contrast to human deep-search reliance.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 unverdicted novelty 7.0

LLMs display myopic planning in games: move selection is driven by shallow nodes in reasoning traces despite generating deep lookahead, with performance tied to search breadth rather than depth.
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
cs.AI 2026-05 unverdicted novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
cs.AI 2024-06 conditional novelty 7.0

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
Measuring Faithfulness in Chain-of-Thought Reasoning
cs.AI 2023-07 conditional novelty 7.0

Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.
How to Interpret Agent Behavior
cs.AI 2026-05 conditional novelty 6.0

ACT*ONOMY is a Grounded-Theory-derived hierarchical taxonomy and open repository that enables systematic comparison and characterization of autonomous agent behavior across trajectories.
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
cs.LG 2026-05 unverdicted novelty 6.0

Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight
cs.AI 2026-05 unverdicted novelty 6.0

Behavior Cue Reasoning trains LLMs to emit special tokens before behaviors, enabling monitors to prune up to 50% of wasted tokens and recover safe actions from 80% of unsafe traces, more than doubling success rates wi...
Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture
cs.AI 2026-04 unverdicted novelty 6.0

A separation-of-powers system architecture for AI agents uses independent layers, cryptographic capability tokens, and a formal verification framework to maintain goal integrity even under model compromise.
Building a Precise Video Language with Human-AI Oversight
cs.CV 2026-04 unverdicted novelty 6.0

CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video gene...
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
cs.AI 2024-08 conditional novelty 6.0

Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
A Roadmap to Pluralistic Alignment
cs.AI 2024-02 unverdicted novelty 6.0

The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
Towards Understanding Sycophancy in Language Models
cs.CL 2023-10 conditional novelty 6.0

Sycophancy is prevalent in state-of-the-art AI assistants and is likely driven in part by human preferences that favor agreement over truthfulness.
Simple synthetic data reduces sycophancy in large language models
cs.CL 2023-08 unverdicted novelty 6.0

Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.
Auditing and Controlling AI Agent Actions in Spreadsheets
cs.HC 2026-04 unverdicted novelty 5.0

Pista decomposes AI agent actions in spreadsheets into auditable steps, enabling real-time user intervention that improves task outcomes, user comprehension, agent perception, and sense of co-ownership over baseline agents.
Extrapolating Volition with Recursive Information Markets
cs.GT 2026-04 unverdicted novelty 5.0

Recursive information markets with forgetful LLM buyers can align information prices with true value and extend to scalable oversight in AI alignment.
TrustLLM: Trustworthiness in Large Language Models
cs.CL 2024-01 unverdicted novelty 5.0

TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt...
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
cs.CL 2023-11 unverdicted novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 19 Pith papers · 14 internal anchors

[2]

The case for aligning narrowly superhuman models , url=

Cotra, Ajeya , year=. The case for aligning narrowly superhuman models , url=

work page
[3]

Christiano, Paul and Xu, Mark and Cotra, Ajeya , note=

work page
[4]

Irving, Geoffrey and Christiano, Paul and Amodei, Dario , journal=

work page
[7]

Weld , journal=

Gagan Bansal and Tongshuang Sherry Wu and Joyce Zhou and Raymond Fok and Besmira Nushi and Ece Kamar and Marco Tulio Ribeiro and Daniel S. Weld , journal=. Does the Whole Exceed its Parts?

work page
[9]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=

work page
[10]

Advances in Neural Information Processing Systems , volume=

Learning to summarize with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[12]

2014 , isbn =

Bostrom, Nick , title =. 2014 , isbn =

work page 2014
[14]

Irving, Geoffrey and Askell, Amanda , journal=

work page
[15]

Organizational behavior and human performance , volume=

Training for calibration , author=. Organizational behavior and human performance , volume=. 1980 , publisher=

work page 1980
[23]

Hubinger, Evan , year=

work page
[29]

Submitted to The Eleventh International Conference on Learning Representations , year=

Discovering Latent Knowledge in Language Models Without Supervision , author=. Submitted to The Eleventh International Conference on Learning Representations , year=

work page
[33]

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man \'e . 2016. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565

work page internal anchor Pith review Pith/arXiv arXiv 2016
[34]

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. 2021. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861

work page internal anchor Pith review Pith/arXiv arXiv 2021
[35]

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Gagan Bansal, Tongshuang Sherry Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel S. Weld. 2021. Does the whole exceed its parts? T he effect of AI explanations on complementary team performance. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems

work page 2021
[37]

Nick Bostrom. 2014. Superintelligence: Paths, Dangers, Strategies, 1st edition. Oxford University Press, Inc., USA

work page 2014
[38]

Paul Christiano, Buck Shlegeris, and Dario Amodei. 2018. Supervising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575

work page internal anchor Pith review Pith/arXiv arXiv 2018
[39]

Paul Christiano, Mark Xu, and Ajeya Cotra. 2021. https://www.alignmentforum.org/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge ARC 's first technical report: Eliciting latent knowledge . AI Alignment Forum

work page 2021
[40]

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. In Advances in neural information processing systems, volume 30

work page 2017
[41]

Michael Chromik, Malin Eiband, Felicitas Buchner, Adrian Kr\" u ger, and Andreas Butz. 2021. https://doi.org/10.1145/3397481.3450644 I think i get your point, AI ! T he illusion of explanatory depth in explainable AI . In 26th International Conference on Intelligent User Interfaces, IUI '21, page 307–317, New York, NY, USA. Association for Computing Machinery

work page doi:10.1145/3397481.3450644 2021
[42]

Ajeya Cotra. 2021. https://www.alignmentforum.org/posts/PZtsoaoSLpKjjbMqM/ The case for aligning narrowly superhuman models . AI Alignment Forum

work page 2021
[43]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2020
[44]

Evan Hubinger. 2020. https://www.alignmentforum.org/posts/YWwzccGbcHMJMpT45/ AI safety via market making . AI Alignment Forum

work page 2020
[45]

Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. 2019. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820

work page internal anchor Pith review Pith/arXiv arXiv 2019
[46]

Geoffrey Irving and Amanda Askell. 2019. AI safety needs social scientists. Distill, 4(2)

work page 2019
[47]

Geoffrey Irving, Paul Christiano, and Dario Amodei. 2018. AI safety via debate. arXiv preprint arXiv:1805.00899

work page internal anchor Pith review Pith/arXiv arXiv 2018
[48]

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield Dodds, Nova DasSarma, Eli Tran-Johnson, et al. 2022. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

Vivian Lai, Chacha Chen, Q Vera Liao, Alison Smith-Renner, and Chenhao Tan. 2021. Towards a science of human-ai decision making: a survey of empirical studies. arXiv preprint arXiv:2112.11471

work page arXiv 2021
[51]

Bach, and Jure Leskovec

Himabindu Lakkaraju, Stephen H. Bach, and Jure Leskovec. 2016. https://doi.org/10.1145/2939672.2939874 Interpretable decision sets: A joint framework for description and prediction . In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '16, page 1675–1684, New York, NY, USA. Association for Computing Machinery

work page doi:10.1145/2939672.2939874 2016
[52]

Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. 2018. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871

work page internal anchor Pith review Pith/arXiv arXiv 2018
[53]

Sarah Lichtenstein and Baruch Fischhoff. 1980. Training for calibration. Organizational behavior and human performance, 26(2):149--171

work page 1980
[54]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. https://doi.org/10.18653/v1/2022.acl-long.229 T ruthful QA : Measuring how models mimic human falsehoods . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214--3252, Dublin, Ireland. Association for Computational Linguistics

work page doi:10.18653/v1/2022.acl-long.229 2022
[55]

Han Liu, Vivian Lai, and Chenhao Tan. 2021. https://doi.org/10.1145/3479552 Understanding the effect of out-of-distribution examples and interactive explanations on human- AI decision making . Proc. ACM Hum.-Comput. Interact., 5(CSCW2)

work page doi:10.1145/3479552 2021
[56]

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. 2021. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114

work page internal anchor Pith review Pith/arXiv arXiv 2021
[57]

Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, and Samuel Bowman. 2022. https://doi.org/10.18653/v1/2022.naacl-main.391 Q u ALITY : Question answering with long input texts, yes! In Proceedings of the 2022 Conference of the North American Chapter of the As...

work page doi:10.18653/v1/2022.naacl-main.391 2022
[58]

Alicia Parrish, Harsh Trivedi, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Amanpreet Singh Saimbhi, and Samuel R Bowman. 2022 a . Two-turn debate doesn't help humans answer hard reading comprehension questions. arXiv preprint arXiv:2210.10860

work page arXiv 2022
[59]

Alicia Parrish, Harsh Trivedi, Ethan Perez, Angelica Chen, Nikita Nangia, Jason Phang, and Samuel R Bowman. 2022 b . Single-turn debate does not help humans answer hard reading-comprehension questions. arXiv preprint arXiv:2204.05212

work page arXiv 2022
[60]

William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. 2022. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802

work page internal anchor Pith review Pith/arXiv arXiv 2022
[61]

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008--3021

work page 2020
[62]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022
[63]

Jeff Wu, Long Ouyang, Daniel M Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Christiano. 2021. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862

work page internal anchor Pith review arXiv 2021