Humans chatting with an unreliable LLM assistant outperform both the model alone and unaided humans on MMLU and time-limited QuALITY tasks.
Q u ALITY : Question Answering with Long Input Texts, Yes!
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.HC 2verdicts
UNVERDICTED 2representative citing papers
Human-AI hybrids achieve only +0.4pp over AI alone on diverse tasks because confidence routing fails to identify the small set of cases where humans can correct AI errors.
citing papers explorer
-
Measuring Progress on Scalable Oversight for Large Language Models
Humans chatting with an unreliable LLM assistant outperform both the model alone and unaided humans on MMLU and time-limited QuALITY tasks.
-
Toward Human-AI Complementarity Across Diverse Tasks
Human-AI hybrids achieve only +0.4pp over AI alone on diverse tasks because confidence routing fails to identify the small set of cases where humans can correct AI errors.