QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?

Belinda Z Li, Been Kim, Zi Wang · 2025 · arXiv 2503.22674

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

Don't Start What You Can't Finish: A Counterfactual Audit of Support-State Triage in LLM Agents

cs.AI · 2026-04-17 · unverdicted · novelty 7.0

LLM agents overcommit on non-complete tasks at 41.7% unless given explicit support-state categories, which raise typed deferral accuracy to 91.7%.

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.

AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems

cs.LG · 2026-04-18 · unverdicted · novelty 6.0

AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.

When to Ask a Question: Understanding Communication Strategies in Generative AI Tools

cs.GT · 2026-05-11 · unverdicted · novelty 5.0

A tradeoff model shows generative AI can reduce bias against diverse preferences by strategically eliciting information instead of always inferring from majority patterns.

Context Collapse: Barriers to Adoption for Generative AI in Workplace Settings

cs.CY · 2026-04-06 · unverdicted · novelty 5.0

Expert interviews demonstrate that context in generative AI workplace use collapses or rots over time, limiting tool effectiveness and revealing pitfalls in computational context approaches.

citing papers explorer

Showing 5 of 5 citing papers.

Don't Start What You Can't Finish: A Counterfactual Audit of Support-State Triage in LLM Agents cs.AI · 2026-04-17 · unverdicted · none · ref 6
LLM agents overcommit on non-complete tasks at 41.7% unless given explicit support-state categories, which raise typed deferral accuracy to 91.7%.
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help? cs.AI · 2026-04-10 · unverdicted · none · ref 18
HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.
AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems cs.LG · 2026-04-18 · unverdicted · none · ref 58
AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.
When to Ask a Question: Understanding Communication Strategies in Generative AI Tools cs.GT · 2026-05-11 · unverdicted · none · ref 35
A tradeoff model shows generative AI can reduce bias against diverse preferences by strategically eliciting information instead of always inferring from majority patterns.
Context Collapse: Barriers to Adoption for Generative AI in Workplace Settings cs.CY · 2026-04-06 · unverdicted · none · ref 73
Expert interviews demonstrate that context in generative AI workplace use collapses or rots over time, limiting tool effectiveness and revealing pitfalls in computational context approaches.

QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?

fields

years

verdicts

representative citing papers

citing papers explorer