pith. sign in

arxiv: 2606.02837 · v1 · pith:2MSWBSYKnew · submitted 2026-06-01 · 💻 cs.CL · cs.AI

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

Pith reviewed 2026-06-28 14:27 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords NL-to-FOL translationdataset auditingannotation errorsLLM-assisted reviewFirst-Order LogicNatural Language Inferencebenchmark quality
0
0 comments X

The pith

Incorrect FOL formalizations affect 39% of FOLIO and 36% of MALLS entries, and corrections improve LLM accuracy by 9 to 22 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper performs a systematic human audit of the FOLIO validation split and a subset of MALLS test instances, revealing that roughly 39% and 36% of entries contain incorrect FOL ground-truth labels along with notable rates of ambiguous sentences and some incorrect NLI labels. It releases corrected annotations and shows that three state-of-the-art LLMs obtain accuracy gains of 9 to 22 percentage points when evaluated on the fixed labels instead of the originals. To scale future audits, the work introduces an LLM-assisted framework that directs human reviewers to the instances most likely to contain errors, enabling 90% dataset accuracy after inspecting fewer than 24% of entries rather than over 70% under random review.

Core claim

Systematic human inspection shows that approximately 39% of FOLIO entries and 36% of sampled MALLS entries have incorrect FOL formalizations as ground truth, accompanied by 16.4% and 48% ambiguous natural-language sentences plus 8.4% incorrect NLI labels in FOLIO; the corrected ground truths raise accuracy for Gemma 4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini by 9 to 22 points, while an LLM-based framework prioritizes error-prone instances so that reviewers reach 90% dataset accuracy after examining under 24% of the data.

What carries the argument

An LLM-based framework that scores instances for likely annotation errors and directs human reviewers to the highest-risk subset first.

If this is right

  • All prior model comparisons and leaderboard rankings on FOLIO and MALLS must be recomputed with the corrected labels.
  • Neurosymbolic systems trained or evaluated on these datasets inherit the original label noise and require re-testing.
  • The targeted-review approach cuts the human labor required to produce high-accuracy NL-to-FOL data by more than two-thirds.
  • Any new NL-to-FOL benchmark should incorporate the same inspection step before release.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Un-audited NL-to-FOL or NLI datasets in other domains are likely to contain comparable fractions of label errors.
  • The prioritization logic could be transferred to improve efficiency in other annotation-heavy tasks such as semantic parsing or program synthesis.
  • Public release of the verified annotations creates a reusable reference that future work can treat as a cleaner baseline.

Load-bearing premise

The human inspection process correctly and consistently identifies incorrect FOL formalizations and ambiguities without systematic bias or new errors introduced during correction.

What would settle it

An independent team re-inspecting a random sample of the released corrections and reporting disagreement rates above 10% on the FOL labels would indicate that the reported error rates and accuracy gains rest on unreliable fixes.

Figures

Figures reproduced from arXiv: 2606.02837 by Andrea Brunello, Angelo Montanari, Cristian Curaba, Luca Geatti, Michele Mignani, Nicola Saccomanno.

Figure 1
Figure 1. Figure 1: The two pipelines. Each starts from the Initial Dataset containing triplets (p, Ω, φ), and produces an output with the Formalization Proposal (p, Ω, ψˆ) and the verdict v. Pipeline 1 judges the original formula φ directly. Pipeline 2 first re-generates a candidate φˆ from p and Ω alone, then judges it. • Pipeline 2: Re-generation and V&R. The original formula φ is discarded. The LLM first translates p unde… view at source ↗
Figure 2
Figure 2. Figure 2: Pipelines comparison across models (horizontally) and datasets (vertically) according to the [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy-human effort curve. Each plot shows Gemma’s performance on FOLIO_validation (left) and MALLS_test (right). The blue band represents Pipeline 1 (min/max and average across the six prompting combinations {B1, B2, B3}×{pv1, pv3}); the red curve shows the best-AUC configuration of Pipeline 2; the black and green line represents respectively the Black and Green Baseline. above 97%. This opens the possi… view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy conditioned on the verdict assigned by the judge, for Pipeline 1 (blue) and Pipeline 2 [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pipeline comparison across models and datasets under the [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pipeline comparison across models and datasets under the [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Pipeline comparison across models and datasets under the AAG metric. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Model comparison across all three datasets and all four metrics (AUC, AAG, [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompting strategies and variants comparison for Pipeline 1. Blue-framed cells mark the [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompting strategies and variants comparison for Pipeline 2 [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Extension of Figure [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
read the original abstract

Accurate translation from Natural Language to First-Order Logic (NL-to-FOL) underpins neurosymbolic AI systems and Natural Language Inference (NLI), making the quality of NL-to-FOL benchmarks essential -- yet these datasets have never been rigorously audited. Our first contribution is to present a systematic human inspection of the validation split of \textsf{FOLIO} and a subset of \textsf{MALLS} test instances, finding that approximately 39% and 36% of entries, respectively, contain incorrect FOL formalizations (i.e., ground truth labels), with additional rates of ambiguous NL sentences (16.4% and 48%) and incorrect NLI labels in \textsf{FOLIO} (8.4%). Our second contribution is to develop and release corrected ground truths for such datasets, showing that annotation errors distort model evaluation on a reference benchmark task: testing three state-of-the-art LLMs (Gemma~4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini) with the corrected ground truths yields accuracy gains from +9 to +22 percentage points. Motivated by these findings, we propose an LLM-based framework to support humans in manual reviewing NL-to-FOL datasets. By directing reviewers toward the most error-prone instances, we empirically show that it is possible to achieve 90% dataset accuracy after reviewing fewer than 24% of instances, compared to over 70% required by unguided review. We release all human-verified annotations and the code for our framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper audits the validation split of FOLIO and a subset of MALLS for NL-to-FOL translation quality via systematic human inspection, reporting ~39% and ~36% incorrect FOL formalizations (ground truth labels), plus 16.4%/48% ambiguous NL sentences and 8.4% incorrect NLI labels in FOLIO. It releases corrected annotations, shows that re-evaluating three LLMs (Gemma-4 31B-it, Qwen3-30B-A3B, GPT-4o-mini) on the corrected labels yields +9 to +22 pp accuracy gains, and proposes an LLM-assisted framework that directs human review to error-prone instances, achieving 90% dataset accuracy after reviewing <24% of instances versus >70% for unguided review.

Significance. If the human-verified corrections hold, the work demonstrates that annotation errors in prominent NL-to-FOL benchmarks materially distort model evaluations and supplies both corrected data and a practical prioritization framework that reduces human effort. Releasing the verified annotations and framework code strengthens reproducibility and enables follow-on auditing in neurosymbolic AI.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (Human Inspection): The central numerical claims (39%/36% incorrect FOL, 16.4%/48% ambiguous, 8.4% wrong NLI) rest entirely on the authors' human inspection, yet the manuscript supplies no information on inspection protocol, number of annotators, inter-annotator agreement statistics, adjudication procedure for disagreements, or selection criteria for the MALLS subset. This directly undermines the load-bearing error-rate statistics and the downstream accuracy-gain results.
  2. [§4] §4 (Model Evaluation): The reported +9 to +22 pp accuracy gains are computed by comparing LLM performance on the original versus the authors' corrected labels. Without an independent validation of the corrections (e.g., blind re-annotation or external expert review), it is impossible to distinguish genuine error fixes from systematic shifts introduced by the inspection process itself.
  3. [§5] §5 (LLM-assisted Framework): The claim that the framework reaches 90% accuracy after reviewing <24% of instances depends on the same unvalidated human judgments used to define the 'error-prone' instances; any bias in the initial inspection propagates into the prioritization model and the reported efficiency gains.
minor comments (1)
  1. [Abstract] The abstract states results from human inspection but does not reference any supplementary material or appendix that might contain the missing protocol details; if such material exists, it should be explicitly cited in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The concerns about transparency in the human inspection process and validation of corrections are well-taken. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Human Inspection): The central numerical claims (39%/36% incorrect FOL, 16.4%/48% ambiguous, 8.4% wrong NLI) rest entirely on the authors' human inspection, yet the manuscript supplies no information on inspection protocol, number of annotators, inter-annotator agreement statistics, adjudication procedure for disagreements, or selection criteria for the MALLS subset. This directly undermines the load-bearing error-rate statistics and the downstream accuracy-gain results.

    Authors: We agree that the current description of the inspection process is insufficient. In the revised manuscript we will expand §3 with a dedicated subsection that specifies the annotation protocol, number of annotators and their qualifications, the guidelines provided to them, inter-annotator agreement statistics, the procedure used to resolve disagreements, and the exact selection criteria applied to the MALLS subset. These additions will make the reported error rates fully reproducible. revision: yes

  2. Referee: [§4] §4 (Model Evaluation): The reported +9 to +22 pp accuracy gains are computed by comparing LLM performance on the original versus the authors' corrected labels. Without an independent validation of the corrections (e.g., blind re-annotation or external expert review), it is impossible to distinguish genuine error fixes from systematic shifts introduced by the inspection process itself.

    Authors: We acknowledge that the manuscript does not include an independent blind re-annotation by external experts. The corrections were produced through systematic logical comparison of each FOL formula against its NL premise by the authors. In revision we will add an explicit limitations paragraph in §4 that discusses the possibility of systematic bias, reports any internal consistency checks performed, and stresses that the full set of corrected annotations is released publicly so that the community can perform independent verification. We maintain that the observed accuracy gains are driven by the removal of clear logical mismatches, but we will present this as an acknowledged limitation rather than a fully externally validated result. revision: partial

  3. Referee: [§5] §5 (LLM-assisted Framework): The claim that the framework reaches 90% accuracy after reviewing <24% of instances depends on the same unvalidated human judgments used to define the 'error-prone' instances; any bias in the initial inspection propagates into the prioritization model and the reported efficiency gains.

    Authors: We agree that the framework evaluation inherits the same human judgments used to label errors. In the revision we will clarify in §5 how the prioritization model was trained (on features derived from the inspected data), provide additional ablation results that isolate the contribution of the LLM component, and add a discussion of how inspection bias could affect the reported efficiency numbers. We will also release the framework code and the full set of model predictions so that others can re-evaluate the prioritization under alternative label sets. revision: yes

Circularity Check

0 steps flagged

Empirical audit reports direct observations with no self-referential reductions

full rationale

The paper reports error rates (39%/36% incorrect FOL, etc.) and LLM accuracy gains (+9 to +22 pp) obtained via systematic human inspection of existing dataset instances followed by direct re-evaluation of models on the resulting corrected labels. These quantities are produced by external annotation and testing steps rather than any equation, fitted parameter, or self-citation chain that reduces the outputs to the inputs by construction. No self-definitional, fitted-input-called-prediction, or ansatz-smuggling patterns appear in the abstract or described contributions. The LLM-assisted framework is a separate proposal and does not alter the reported statistics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that human judgment constitutes reliable ground truth for FOL correctness; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Human annotators can reliably determine whether a given FOL formula is a correct formalization of a natural language sentence
    All reported error rates and accuracy gains derive from this human judgment step.

pith-pipeline@v0.9.1-grok · 5838 in / 1249 out tokens · 28437 ms · 2026-06-28T14:27:34.335795+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

105 extracted references · 61 canonical work pages · 1 internal anchor

  1. [1]

    13th International Conference on Intelligent Computer Mathematics (CICM) , series =

    Christian Szegedy , title =. 13th International Conference on Intelligent Computer Mathematics (CICM) , series =. 2020 , doi =

  2. [2]

    2024 , url =

    Long Hei Matthew Lam and others , title =. 2024 , url =

  3. [4]

    Cox and Robert Dale , title =

    Dave Barker-Plummer and Richard J. Cox and Robert Dale , title =. 2011 , isbn =

  4. [5]

    CoRR , volume =

    Dalrymple, David "davidad" and Skalse, Joar and Bengio, Yoshua and Russel, Stuart and Tegmark, Max and Seshia, Sanjit and Omohundro, Steve and Szegedy, Christian and Goldhaber, Ben and Ammann, Nora and Abate, Alessandro and Halpern, Joe and Barrett, Clark and Zhao, Ding and Zhi-Xuan, Tan and Wing, Jeannette and Tenenbaum, Joshua , title =. CoRR , volume =...

  5. [6]

    and Dale, Robert , booktitle=

    Barker-Plummer, Dave and Cox, Richard J. and Dale, Robert , booktitle=. Student translations of natural language into logic:

  6. [7]

    Cox and Robert Dale , year=

    Dave Barker-Plummer and Richard J. Cox and Robert Dale , year=. Tarski’s

  7. [8]

    Deshmukh, Jyotirmoy and Kantaros, Yiannis , title =

    Wang, Jun and Sundarsingh, David Smith and V. Deshmukh, Jyotirmoy and Kantaros, Yiannis , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2504.21022 , eprinttype =. 2504.21022 , timestamp =

  8. [9]

    2021 , url =

    Apurwa Yadav and Aarshil Patel and Manan Shah , title =. 2021 , url =. doi:10.1016/J.AIOPEN.2021.05.001 , timestamp =

  9. [10]

    CoRR , volume =

    Lei Xu and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.06774 , eprinttype =. 2510.06774 , timestamp =

  10. [11]

    Advancing Natural Language Formalization to First Order Logic with Fine-tuned LLMs

    Vossel, Felix and Mossakowski, Till and Gehrke, Björn , biburl =. Advancing Natural Language Formalization to First Order Logic with Fine-tuned LLMs. , url =. CoRR , keywords =

  11. [12]

    Soviet physics

    Binary codes capable of correcting deletions, insertions, and reversals , author=. Soviet physics. Doklady , year=

  12. [13]

    arXiv preprint arXiv:2405.02318 , year=

    Autoformalizing Natural Language to First-Order Logic: A Case Study in Logical Fallacy Detection , author=. arXiv preprint arXiv:2405.02318 , year=

  13. [14]

    QA - N at V er: Question Answering for Natural Logic-based Fact Verification

    Aly, Rami and others. QA - N at V er: Question Answering for Natural Logic-based Fact Verification. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.521

  14. [15]

    Logical Fallacy Detection

    Jin, Zhijing and others. Logical Fallacy Detection. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. doi:10.18653/v1/2022.findings-emnlp.532

  15. [16]

    CoRR , volume =

    Yujun Zhou and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2506.04810 , eprinttype =. 2506.04810 , timestamp =

  16. [17]

    2024 , url =

    Andrea Brunello and others , title =. 2024 , url =

  17. [18]

    Lee, Hyemin S

    Ryu, Hyun and Kim, Gyeongman and S. Lee, Hyemin S. and Yang, Eunho , title =. 2025 , url =

  18. [19]

    Complexity Parameters for First-Order Classes , booktitle =

    Marta Arias and Roni Khardon , editor =. Complexity Parameters for First-Order Classes , booktitle =. 2003 , url =. doi:10.1007/978-3-540-39917-9\_4 , timestamp =

  19. [20]

    Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,

    Fengxiang Cheng and others , title =. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,. 2025 , url =. doi:10.24963/IJCAI.2025/1155 , timestamp =

  20. [21]

    2025 , url =

    Lovish Madaan and others , title =. 2025 , url =. doi:10.18653/V1/2025.NAACL-LONG.466 , timestamp =

  21. [22]

    ICLR 2024 Workshop on Secure and Trustworthy Large Language Models , year=

    Enhancing and Evaluating Logical Reasoning Abilities of Large Language Models , author=. ICLR 2024 Workshop on Secure and Trustworthy Large Language Models , year=

  22. [23]

    NeurIPS 2022, November 28 - December 9, 2022 , year =

    Yuhuai Wu and others , title =. NeurIPS 2022, November 28 - December 9, 2022 , year =

  23. [24]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),

    Jundong Xu and others , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2024 , url =. doi:10.18653/V1/2024.ACL-LONG.720 , timestamp =

  24. [25]

    CoRR , volume =

    Benjamin Callewaert and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2501.14540 , eprinttype =. 2501.14540 , timestamp =

  25. [26]

    Few-Shot Natural Language to First-Order Logic Translation via Code Generation , booktitle =

    Junnan Liu , editor =. Few-Shot Natural Language to First-Order Logic Translation via Code Generation , booktitle =. 2025 , url =. doi:10.18653/V1/2025.NAACL-LONG.547 , timestamp =

  26. [28]

    2024 , url =

    Xin Quan and others , title =. 2024 , url =. doi:10.18653/V1/2024.EMNLP-MAIN.172 , timestamp =

  27. [29]

    1990 , url=

    Events in the Semantics of English: A Study in Subatomic Semantics , author=. 1990 , url=

  28. [30]

    CoRR , volume =

    Christopher Hahn and others , title =. CoRR , volume =. 2022 , url =. doi:10.48550/ARXIV.2206.01962 , eprinttype =. 2206.01962 , timestamp =

  29. [31]

    Parsing the WSJ Using CCG and Log-Linear Models

    Clark, Stephen and Curran, James R. Parsing the WSJ Using CCG and Log-Linear Models. ACL -04. 2004. doi:10.3115/1218955.1218969

  30. [32]

    2015 , url =

    Johan Bos , title =. 2015 , url =

  31. [33]

    Yu Pei and others , title =. Trans. Assoc. Comput. Linguistics , volume =. 2025 , url =. doi:10.1162/TACL.A.41 , timestamp =

  32. [34]

    GCAT 2023 , year=

    Data and Knowledge Engineering for Legal Precedents Using First-Order Predicate Logic , author=. GCAT 2023 , year=

  33. [35]

    Towards Advanced Mathematical Reasoning for LLM s via First-Order Logic Theorem Proving

    Cao, Chuxue and others. Towards Advanced Mathematical Reasoning for LLM s via First-Order Logic Theorem Proving. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.628

  34. [36]

    Grammar-Constrained Decoding Makes Large Language Models Better Logical Parsers

    Raspanti, Federico and others. Grammar-Constrained Decoding Makes Large Language Models Better Logical Parsers. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track). 2025. doi:10.18653/v1/2025.acl-industry.34

  35. [37]

    Let Me Speak Freely? A Study On The Impact Of Format Restrictions On Large Language Model Performance

    Tam, Zhi Rui and others. Let Me Speak Freely? A Study On The Impact Of Format Restrictions On Large Language Model Performance. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2024. doi:10.18653/v1/2024.emnlp-industry.91

  36. [39]

    SEMANTiCS 2025, Vienna, Austria, September 3-5, 2025 , series =

    Alexander Beiser and others , title =. SEMANTiCS 2025, Vienna, Austria, September 3-5, 2025 , series =. 2025 , url =

  37. [40]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),

    Mihir Parmar and others , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2024 , url =. doi:10.18653/V1/2024.ACL-LONG.739 , timestamp =

  38. [41]

    Into The Limits of Logic: Alignment Methods for Formal Logical Reasoning

    Lopez-Ponce, FernandoFrancisco and Bel-Enguix, Gemma. Into The Limits of Logic: Alignment Methods for Formal Logical Reasoning. MathNLP 2025. 2025. doi:10.18653/v1/2025.mathnlp-main.8

  39. [42]

    Diagnosing the First-Order Logical Reasoning Ability Through L ogic NLI

    Tian, Jidong and others. Diagnosing the First-Order Logical Reasoning Ability Through L ogic NLI. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.303

  40. [43]

    CoRR , volume =

    Thatikonda, Ramya Keerthy and Han, Jiuzhou and Buntine, Wray and Shareghi, Ehsan , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2409.16461 , eprinttype =. 2409.16461 , timestamp =

  41. [44]

    2025 , url =

    Chengwen Qi and others , title =. 2025 , url =

  42. [45]

    NeurIPS 2023, December 10 - 16, 2023 , year =

    Ye, Xi and Chen, Qiaochu and Dillig, Isil and Durrett, Greg , title =. NeurIPS 2023, December 10 - 16, 2023 , year =

  43. [46]

    Generating Predicate Logic Expressions from Natural Language , year=

    Levkovskyi, Oleksii and Li, Wei , booktitle=. Generating Predicate Logic Expressions from Natural Language , year=

  44. [47]

    Educational Data Mining , year=

    Dimensions of Difficulty in Translating Natural Language into First-Order Logic , author=. Educational Data Mining , year=

  45. [48]

    CoRR , volume =

    Singh, Hrituraj and Aggarwal, Milan and Krishnamurthy, Balaji , title =. CoRR , volume =. 2020 , url =. 2002.06544 , timestamp =

  46. [49]

    Parsing Natural Language into Propositional and First-Order Logic with Dual Reinforcement Learning

    Lu, Xuantao and others. Parsing Natural Language into Propositional and First-Order Logic with Dual Reinforcement Learning. Proceedings of the 29th International Conference on Computational Linguistics. 2022

  47. [50]

    Findings of the Association for Computational Linguistics:

    Akshay Chaturvedi and Nicholas Asher , title =. Findings of the Association for Computational Linguistics:. 2024 , url =. doi:10.18653/V1/2024.FINDINGS-EMNLP.390 , timestamp =

  48. [51]

    Faithful Chain-of-Thought Reasoning

    Lyu, Qing and others. Faithful Chain-of-Thought Reasoning. Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.ijcnlp-main.20

  49. [52]

    CoRR , volume =

    Qingchuan Li and others , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2410.21779 , eprinttype =. 2410.21779 , timestamp =

  50. [53]

    2023 , url =

    Olausson, Theo and Gu, Alex and Lipkin, Ben and Zhang, Cedegao and Solar-Lezama, Armando and Tenenbaum, Joshua and Levy, Roger , title =. 2023 , url =. doi:10.18653/V1/2023.EMNLP-MAIN.313 , timestamp =

  51. [54]

    CoRR , volume =

    Peizhang Shao and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.07748 , eprinttype =. 2507.07748 , timestamp =

  52. [55]

    2025 , url =

    Bowen Jiang and others , title =. 2025 , url =. doi:10.18653/V1/2025.NAACL-LONG.186 , timestamp =

  53. [56]

    Frontiers Comput

    Laura Orynbay and others , title =. Frontiers Comput. Sci. , volume =. 2025 , url =. doi:10.3389/FCOMP.2024.1486581 , timestamp =

  54. [57]

    Findings of the Association for Computational Linguistics:

    Pan, Liangming and Albalak, Alon and Wang, Xinyi and Yang Wang, William , title =. Findings of the Association for Computational Linguistics:. 2023 , url =. doi:10.18653/V1/2023.FINDINGS-EMNLP.248 , timestamp =

  55. [58]

    CoRR , volume =

    Shashank Kirtania and others , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2407.02514 , eprinttype =. 2407.02514 , timestamp =

  56. [59]

    Logic-Thinker: Teaching Large Language Models to Think more Logically

    Wen, Chengyao and others. Logic-Thinker: Teaching Large Language Models to Think more Logically. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.696

  57. [60]

    CoRR , volume =

    Koushik Viswanadha and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2506.18383 , eprinttype =. 2506.18383 , timestamp =

  58. [61]

    2024 , url =

    Fangzhi Xu and others , title =. 2024 , url =. doi:10.18653/V1/2024.ACL-LONG.707 , timestamp =

  59. [62]

    2025 , url =

    Ruikang Hu and others , title =. 2025 , url =

  60. [63]

    CoRR , volume =

    Hannah Bansal and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.17377 , eprinttype =. 2509.17377 , timestamp =

  61. [64]

    Reasoning or

    Zhaofeng Wu and others , title =. 2024 , url =. doi:10.18653/V1/2024.NAACL-LONG.102 , timestamp =

  62. [65]

    Findings of the Association for Computational Linguistics:

    Oyvind Tafjord and others , title =. Findings of the Association for Computational Linguistics:. 2021 , url =. doi:10.18653/V1/2021.FINDINGS-ACL.317 , timestamp =

  63. [66]

    Transformers as Soft Reasoners over Language , booktitle =

    Peter Clark and others , editor =. Transformers as Soft Reasoners over Language , booktitle =. 2020 , url =. doi:10.24963/IJCAI.2020/537 , timestamp =

  64. [67]

    CoRR , volume =

    Debargha Ganguly and others , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2409.17270 , eprinttype =. 2409.17270 , timestamp =

  65. [68]

    Findings of the Association for Computational Linguistics:

    Simeng Han and others , title =. Findings of the Association for Computational Linguistics:. 2024 , url =. doi:10.18653/V1/2024.FINDINGS-EMNLP.966 , timestamp =

  66. [69]

    CoRR , volume =

    Qianxi He and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.19907 , eprinttype =. 2502.19907 , timestamp =

  67. [70]

    CoRR , volume =

    Shokhrukh Ibragimov and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.14180 , eprinttype =. 2502.14180 , timestamp =

  68. [71]

    CoRR , volume =

    Navapat Nananukul and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.01530 , eprinttype =. 2510.01530 , timestamp =

  69. [72]

    2025 , eprint=

    From Hypothesis to Premises: LLM-based Backward Logical Reasoning with Selective Symbolic Translation , author=. 2025 , eprint=

  70. [73]

    CoRR , volume =

    Yue Zhang and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.21281 , eprinttype =. 2505.21281 , timestamp =

  71. [74]

    Ontology learning towards expressiveness: A survey , journal =

    Pauline Armary and others , keywords =. Ontology learning towards expressiveness: A survey , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.cosrev.2024.100693 , url =

  72. [75]

    CoRR , volume =

    Rick Du and others , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2404.14991 , eprinttype =. 2404.14991 , timestamp =

  73. [76]

    Zhengkun Di and others , title =. Knowl. Based Syst. , volume =. 2025 , url =. doi:10.1016/J.KNOSYS.2025.114140 , timestamp =

  74. [77]

    arXiv preprint arXiv:2509.24765 , year=

    From Ambiguity to Verdict: A Semiotic-Grounded Multi-Perspective Agent for LLM Logical Reasoning , author=. arXiv preprint arXiv:2509.24765 , year=

  75. [78]

    2024 , eprint=

    uto val: Autonomous Assessment of LLMs in Formal Synthesis and Interpretation Tasks , author=. 2024 , eprint=

  76. [79]

    Learning First-Order Logic Rules for Argumentation Mining

    Sun, Yang and others. Learning First-Order Logic Rules for Argumentation Mining. ACL 2025. 2025. doi:10.18653/v1/2025.acl-long.691

  77. [80]

    Transformer models for translating natural language sentences into formal logical expressions , school=

    Deveci, İbrahim Ethem , year=. Transformer models for translating natural language sentences into formal logical expressions , school=

  78. [81]

    2025 , url =

    Samuele Germiniani and others , title =. 2025 , url =. doi:10.1109/ACCESS.2025.3551607 , timestamp =

  79. [82]

    CoRR , volume =

    Ali Mohammadjafari and others , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2410.01066 , eprinttype =. 2410.01066 , timestamp =

  80. [83]

    CoRR , volume =

    Ke Weng and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.23486 , eprinttype =. 2505.23486 , timestamp =

Showing first 80 references.