Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies.Transactions of the Association for Computational Linguistics2021,9, 346–361

Geva, M · 2021 · DOI 10.1162/tacl_a_00370

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

open at publisher browse 11 citing papers

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

The Harder Text Embedding Benchmark (HTEB): Beyond One-dimensional Static Robustness

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

HTEB introduces dynamic, multi-axis evaluation of text embedding robustness using LLM transformations, finding decoupled profiles across models and that scaling does not close all robustness gaps.

CommonWhy: A Dataset for Evaluating Entity-Based Causal Commonsense Reasoning in Large Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

CommonWhy is a new dataset of 15,000 why-questions for evaluating LLMs on entity-based causal commonsense reasoning grounded in Wikidata.

Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents

cs.AI · 2026-04-05 · unverdicted · novelty 7.0

PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.

GAIA: a benchmark for General AI Assistants

cs.CL · 2023-11-21 · unverdicted · novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts

cs.CL · 2026-06-03 · unverdicted · novelty 6.0

CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.

Automatic Chain of Thought Prompting in Large Language Models

cs.CL · 2022-10-07 · conditional · novelty 6.0

Auto-CoT automatically builds chain-of-thought demonstrations by sampling diverse questions and letting the LLM generate reasoning chains, matching manual CoT performance on ten reasoning tasks with GPT-3.

PaLM: Scaling Language Modeling with Pathways

cs.CL · 2022-04-05 · accept · novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

From Signals to Transfer: A Factorised Study of Probe-Based Uncertainty Estimation in Large Language Models

cs.CL · 2026-06-26 · conditional · novelty 5.0

A factorized study finds raw hidden states and attention features hard to beat in-domain for LLM uncertainty probes, but structured compressed features are more robust under distribution shift, with pretrained probes transferring to open-ended generation.

Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework

cs.AI · 2026-05-23 · unverdicted · novelty 5.0

A multi-dimensional framework with six dimensions (Correctness, Consistency, Robustness, Logical Coherence, Efficiency, Stability) is applied to seven LLMs on 975 items, revealing orthogonality between logical coherence and correctness plus ranking inversions invisible to accuracy metrics.

PaLM 2 Technical Report

cs.CL · 2023-05-17 · unverdicted · novelty 5.0

PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Automatic Chain of Thought Prompting in Large Language Models cs.CL · 2022-10-07 · conditional · none · ref 11
Auto-CoT automatically builds chain-of-thought demonstrations by sampling diverse questions and letting the LLM generate reasoning chains, matching manual CoT performance on ten reasoning tasks with GPT-3.
From Signals to Transfer: A Factorised Study of Probe-Based Uncertainty Estimation in Large Language Models cs.CL · 2026-06-26 · conditional · none · ref 34
A factorized study finds raw hidden states and attention features hard to beat in-domain for LLM uncertainty probes, but structured compressed features are more robust under distribution shift, with pretrained probes transferring to open-ended generation.

Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies.Transactions of the Association for Computational Linguistics2021,9, 346–361

fields

years

verdicts

representative citing papers

citing papers explorer