HOLMES is the first real-world benchmark for higher-order symbolic reasoning in LLMs, where models average 50.64% accuracy and the best reaches 59.54%.
Diagnosing the First-Order Logical Reasoning Ability Through L ogic NLI
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6representative citing papers
Audit finds 36-39% incorrect FOL labels in FOLIO and MALLS; corrections raise LLM accuracy 9-22 points and an LLM-guided review framework achieves 90% dataset quality after checking fewer than 24% of examples.
QMFOL generates monadic first-order logic tasks with controllable complexity via pattern-based structures and round-trip prover verification, then evaluates six LRMs showing performance drops as logical depth and width increase.
ChLogic benchmark shows persistent English-Chinese gaps in LLM logical reasoning performance, with back-translation effects varying by model and difficulty.
C3RL is a new RL algorithm combining correctness, calibration, and reference accuracy rewards to improve LLM confidence calibration, enabling CAS to outperform majority voting with up to 12.33x lower inference cost.
citing papers explorer
-
HOLMES: Evaluating Higher-Order Logical Reasoning in LLMs
HOLMES is the first real-world benchmark for higher-order symbolic reasoning in LLMs, where models average 50.64% accuracy and the best reaches 59.54%.
-
Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling
Audit finds 36-39% incorrect FOL labels in FOLIO and MALLS; corrections raise LLM accuracy 9-22 points and an LLM-guided review framework achieves 90% dataset quality after checking fewer than 24% of examples.
-
QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation
QMFOL generates monadic first-order logic tasks with controllable complexity via pattern-based structures and round-trip prover verification, then evaluates six LRMs showing performance drops as logical depth and width increase.
-
ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions
ChLogic benchmark shows persistent English-Chinese gaps in LLM logical reasoning performance, with back-translation effects varying by model and difficulty.
-
Scaling with Confidence: Calibrating Confidence of LLMs for Adaptive Test Time Scaling
C3RL is a new RL algorithm combining correctness, calibration, and reference accuracy rewards to improve LLM confidence calibration, enabling CAS to outperform majority voting with up to 12.33x lower inference cost.
- LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs