L ogic A sker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models

Wan, Y · 2024 · DOI 10.18653/v1/2024.emnlp-main.128

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open at publisher browse 4 citing papers

representative citing papers

LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

LGMT applies metamorphic testing derived from first-order logic equivalences to detect reasoning inconsistencies in LLMs that static benchmarks miss.

Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism

cs.LO · 2026-04-07 · unverdicted · novelty 7.0

ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.

Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling

cs.LG · 2025-08-22 · unverdicted · novelty 6.0

In a cellular automata rule-inference task designed to block memorization, neural models achieve high next-step accuracy but accuracy falls sharply with longer reasoning chains; depth, recurrence, memory, and test-time compute extend the reachable depth but do not remove the bound.

Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework

cs.AI · 2026-05-23 · 2 refs

citing papers explorer

Showing 4 of 4 citing papers.

LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs cs.AI · 2026-05-12 · unverdicted · none · ref 49
LGMT applies metamorphic testing derived from first-order logic equivalences to detect reasoning inconsistencies in LLMs that static benchmarks miss.
Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism cs.LO · 2026-04-07 · unverdicted · none · ref 49
ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.
Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling cs.LG · 2025-08-22 · unverdicted · none · ref 68
In a cellular automata rule-inference task designed to block memorization, neural models achieve high next-step accuracy but accuracy falls sharply with longer reasoning chains; depth, recurrence, memory, and test-time compute extend the reachable depth but do not remove the bound.
Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework cs.AI · 2026-05-23 · unreviewed · ref 24 · 2 links

L ogic A sker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models

fields

years

verdicts

representative citing papers

citing papers explorer