Proofwriter: Generating implications, proofs, and abductive statements over natural language

Oyvind Tafjord, others , title = · 2021 · DOI 10.18653/v1/2021.findings-acl.317

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open at publisher browse 4 citing papers

representative citing papers

HOLMES: Evaluating Higher-Order Logical Reasoning in LLMs

cs.AI · 2026-06-22 · unverdicted · novelty 7.0

HOLMES is the first real-world benchmark for higher-order symbolic reasoning in LLMs, where models average 50.64% accuracy and the best reaches 59.54%.

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

Audit finds 36-39% incorrect FOL labels in FOLIO and MALLS; corrections raise LLM accuracy 9-22 points and an LLM-guided review framework achieves 90% dataset quality after checking fewer than 24% of examples.

Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs

cs.AI · 2026-04-09 · accept · novelty 7.0

The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.

QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation

cs.AI · 2026-06-18 · unverdicted · novelty 6.0

QMFOL generates monadic first-order logic tasks with controllable complexity via pattern-based structures and round-trip prover verification, then evaluates six LRMs showing performance drops as logical depth and width increase.

citing papers explorer

Showing 4 of 4 citing papers.

HOLMES: Evaluating Higher-Order Logical Reasoning in LLMs cs.AI · 2026-06-22 · unverdicted · none · ref 1
HOLMES is the first real-world benchmark for higher-order symbolic reasoning in LLMs, where models average 50.64% accuracy and the best reaches 59.54%.
Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling cs.CL · 2026-06-01 · unverdicted · none · ref 65
Audit finds 36-39% incorrect FOL labels in FOLIO and MALLS; corrections raise LLM accuracy 9-22 points and an LLM-guided review framework achieves 90% dataset quality after checking fewer than 24% of examples.
Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs cs.AI · 2026-04-09 · accept · none · ref 93
The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.
QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation cs.AI · 2026-06-18 · unverdicted · none · ref 29
QMFOL generates monadic first-order logic tasks with controllable complexity via pattern-based structures and round-trip prover verification, then evaluates six LRMs showing performance drops as logical depth and width increase.

Proofwriter: Generating implications, proofs, and abductive statements over natural language

fields

years

verdicts

representative citing papers

citing papers explorer