Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

Andrea Brunello; Angelo Montanari; Cristian Curaba; Luca Geatti; Michele Mignani; Nicola Saccomanno

arxiv: 2606.02837 · v1 · pith:2MSWBSYKnew · submitted 2026-06-01 · 💻 cs.CL · cs.AI

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

Andrea Brunello , Cristian Curaba , Luca Geatti , Michele Mignani , Angelo Montanari , Nicola Saccomanno This is my paper

Pith reviewed 2026-06-28 14:27 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords NL-to-FOL translationdataset auditingannotation errorsLLM-assisted reviewFirst-Order LogicNatural Language Inferencebenchmark quality

0 comments

The pith

Incorrect FOL formalizations affect 39% of FOLIO and 36% of MALLS entries, and corrections improve LLM accuracy by 9 to 22 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper performs a systematic human audit of the FOLIO validation split and a subset of MALLS test instances, revealing that roughly 39% and 36% of entries contain incorrect FOL ground-truth labels along with notable rates of ambiguous sentences and some incorrect NLI labels. It releases corrected annotations and shows that three state-of-the-art LLMs obtain accuracy gains of 9 to 22 percentage points when evaluated on the fixed labels instead of the originals. To scale future audits, the work introduces an LLM-assisted framework that directs human reviewers to the instances most likely to contain errors, enabling 90% dataset accuracy after inspecting fewer than 24% of entries rather than over 70% under random review.

Core claim

Systematic human inspection shows that approximately 39% of FOLIO entries and 36% of sampled MALLS entries have incorrect FOL formalizations as ground truth, accompanied by 16.4% and 48% ambiguous natural-language sentences plus 8.4% incorrect NLI labels in FOLIO; the corrected ground truths raise accuracy for Gemma 4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini by 9 to 22 points, while an LLM-based framework prioritizes error-prone instances so that reviewers reach 90% dataset accuracy after examining under 24% of the data.

What carries the argument

An LLM-based framework that scores instances for likely annotation errors and directs human reviewers to the highest-risk subset first.

If this is right

All prior model comparisons and leaderboard rankings on FOLIO and MALLS must be recomputed with the corrected labels.
Neurosymbolic systems trained or evaluated on these datasets inherit the original label noise and require re-testing.
The targeted-review approach cuts the human labor required to produce high-accuracy NL-to-FOL data by more than two-thirds.
Any new NL-to-FOL benchmark should incorporate the same inspection step before release.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Un-audited NL-to-FOL or NLI datasets in other domains are likely to contain comparable fractions of label errors.
The prioritization logic could be transferred to improve efficiency in other annotation-heavy tasks such as semantic parsing or program synthesis.
Public release of the verified annotations creates a reusable reference that future work can treat as a cleaner baseline.

Load-bearing premise

The human inspection process correctly and consistently identifies incorrect FOL formalizations and ambiguities without systematic bias or new errors introduced during correction.

What would settle it

An independent team re-inspecting a random sample of the released corrections and reporting disagreement rates above 10% on the FOL labels would indicate that the reported error rates and accuracy gains rest on unreliable fixes.

Figures

Figures reproduced from arXiv: 2606.02837 by Andrea Brunello, Angelo Montanari, Cristian Curaba, Luca Geatti, Michele Mignani, Nicola Saccomanno.

**Figure 1.** Figure 1: The two pipelines. Each starts from the Initial Dataset containing triplets (p, Ω, φ), and produces an output with the Formalization Proposal (p, Ω, ψˆ) and the verdict v. Pipeline 1 judges the original formula φ directly. Pipeline 2 first re-generates a candidate φˆ from p and Ω alone, then judges it. • Pipeline 2: Re-generation and V&R. The original formula φ is discarded. The LLM first translates p unde… view at source ↗

**Figure 2.** Figure 2: Pipelines comparison across models (horizontally) and datasets (vertically) according to the [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy-human effort curve. Each plot shows Gemma’s performance on FOLIO_validation (left) and MALLS_test (right). The blue band represents Pipeline 1 (min/max and average across the six prompting combinations {B1, B2, B3}×{pv1, pv3}); the red curve shows the best-AUC configuration of Pipeline 2; the black and green line represents respectively the Black and Green Baseline. above 97%. This opens the possi… view at source ↗

**Figure 4.** Figure 4: Accuracy conditioned on the verdict assigned by the judge, for Pipeline 1 (blue) and Pipeline 2 [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Pipeline comparison across models and datasets under the [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Pipeline comparison across models and datasets under the [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Pipeline comparison across models and datasets under the AAG metric. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Model comparison across all three datasets and all four metrics (AUC, AAG, [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Prompting strategies and variants comparison for Pipeline 1. Blue-framed cells mark the [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Prompting strategies and variants comparison for Pipeline 2 [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Extension of Figure [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

read the original abstract

Accurate translation from Natural Language to First-Order Logic (NL-to-FOL) underpins neurosymbolic AI systems and Natural Language Inference (NLI), making the quality of NL-to-FOL benchmarks essential -- yet these datasets have never been rigorously audited. Our first contribution is to present a systematic human inspection of the validation split of \textsf{FOLIO} and a subset of \textsf{MALLS} test instances, finding that approximately 39% and 36% of entries, respectively, contain incorrect FOL formalizations (i.e., ground truth labels), with additional rates of ambiguous NL sentences (16.4% and 48%) and incorrect NLI labels in \textsf{FOLIO} (8.4%). Our second contribution is to develop and release corrected ground truths for such datasets, showing that annotation errors distort model evaluation on a reference benchmark task: testing three state-of-the-art LLMs (Gemma~4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini) with the corrected ground truths yields accuracy gains from +9 to +22 percentage points. Motivated by these findings, we propose an LLM-based framework to support humans in manual reviewing NL-to-FOL datasets. By directing reviewers toward the most error-prone instances, we empirically show that it is possible to achieve 90% dataset accuracy after reviewing fewer than 24% of instances, compared to over 70% required by unguided review. We release all human-verified annotations and the code for our framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds high error rates in two NL-to-FOL benchmarks, releases fixes that shift model scores, and offers a prioritization framework, but the human audit lacks basic reproducibility details.

read the letter

The core news is that FOLIO and a slice of MALLS contain roughly 39% and 36% incorrect FOL translations plus other issues, and swapping in the authors' corrections raises three LLMs' accuracy by 9-22 points. They also show an LLM-assisted triage method that reaches 90% dataset quality after checking under a quarter of the data.

The useful parts are the released corrected annotations and the concrete demonstration that bad labels were moving the numbers. Releasing the fixes lets other groups rerun their experiments without starting from scratch. The triage framework is a straightforward application of model disagreement or uncertainty to cut down manual work, and the before-after comparison gives a sense of its payoff.

The weak point is the human inspection step. The abstract reports the error counts but gives no protocol, no inter-annotator numbers, and no description of how disagreements were settled or how the MALLS subset was picked. That leaves the headline percentages resting on whatever single-pass judgment the authors applied. If the definition of "incorrect FOL" turns out to be stricter or looser than what other experts would use, both the error rates and the reported gains move with it.

Anyone who builds or evaluates NL-to-FOL systems, or who treats these datasets as gold standards for NLI, will want the corrected versions. The work is narrow but directly relevant to an active evaluation niche. It is worth sending to referees so the annotation process can be examined and, if needed, tightened before the numbers circulate further.

Referee Report

3 major / 1 minor

Summary. The paper audits the validation split of FOLIO and a subset of MALLS for NL-to-FOL translation quality via systematic human inspection, reporting ~39% and ~36% incorrect FOL formalizations (ground truth labels), plus 16.4%/48% ambiguous NL sentences and 8.4% incorrect NLI labels in FOLIO. It releases corrected annotations, shows that re-evaluating three LLMs (Gemma-4 31B-it, Qwen3-30B-A3B, GPT-4o-mini) on the corrected labels yields +9 to +22 pp accuracy gains, and proposes an LLM-assisted framework that directs human review to error-prone instances, achieving 90% dataset accuracy after reviewing <24% of instances versus >70% for unguided review.

Significance. If the human-verified corrections hold, the work demonstrates that annotation errors in prominent NL-to-FOL benchmarks materially distort model evaluations and supplies both corrected data and a practical prioritization framework that reduces human effort. Releasing the verified annotations and framework code strengthens reproducibility and enables follow-on auditing in neurosymbolic AI.

major comments (3)

[Abstract and §3] Abstract and §3 (Human Inspection): The central numerical claims (39%/36% incorrect FOL, 16.4%/48% ambiguous, 8.4% wrong NLI) rest entirely on the authors' human inspection, yet the manuscript supplies no information on inspection protocol, number of annotators, inter-annotator agreement statistics, adjudication procedure for disagreements, or selection criteria for the MALLS subset. This directly undermines the load-bearing error-rate statistics and the downstream accuracy-gain results.
[§4] §4 (Model Evaluation): The reported +9 to +22 pp accuracy gains are computed by comparing LLM performance on the original versus the authors' corrected labels. Without an independent validation of the corrections (e.g., blind re-annotation or external expert review), it is impossible to distinguish genuine error fixes from systematic shifts introduced by the inspection process itself.
[§5] §5 (LLM-assisted Framework): The claim that the framework reaches 90% accuracy after reviewing <24% of instances depends on the same unvalidated human judgments used to define the 'error-prone' instances; any bias in the initial inspection propagates into the prioritization model and the reported efficiency gains.

minor comments (1)

[Abstract] The abstract states results from human inspection but does not reference any supplementary material or appendix that might contain the missing protocol details; if such material exists, it should be explicitly cited in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The concerns about transparency in the human inspection process and validation of corrections are well-taken. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Human Inspection): The central numerical claims (39%/36% incorrect FOL, 16.4%/48% ambiguous, 8.4% wrong NLI) rest entirely on the authors' human inspection, yet the manuscript supplies no information on inspection protocol, number of annotators, inter-annotator agreement statistics, adjudication procedure for disagreements, or selection criteria for the MALLS subset. This directly undermines the load-bearing error-rate statistics and the downstream accuracy-gain results.

Authors: We agree that the current description of the inspection process is insufficient. In the revised manuscript we will expand §3 with a dedicated subsection that specifies the annotation protocol, number of annotators and their qualifications, the guidelines provided to them, inter-annotator agreement statistics, the procedure used to resolve disagreements, and the exact selection criteria applied to the MALLS subset. These additions will make the reported error rates fully reproducible. revision: yes
Referee: [§4] §4 (Model Evaluation): The reported +9 to +22 pp accuracy gains are computed by comparing LLM performance on the original versus the authors' corrected labels. Without an independent validation of the corrections (e.g., blind re-annotation or external expert review), it is impossible to distinguish genuine error fixes from systematic shifts introduced by the inspection process itself.

Authors: We acknowledge that the manuscript does not include an independent blind re-annotation by external experts. The corrections were produced through systematic logical comparison of each FOL formula against its NL premise by the authors. In revision we will add an explicit limitations paragraph in §4 that discusses the possibility of systematic bias, reports any internal consistency checks performed, and stresses that the full set of corrected annotations is released publicly so that the community can perform independent verification. We maintain that the observed accuracy gains are driven by the removal of clear logical mismatches, but we will present this as an acknowledged limitation rather than a fully externally validated result. revision: partial
Referee: [§5] §5 (LLM-assisted Framework): The claim that the framework reaches 90% accuracy after reviewing <24% of instances depends on the same unvalidated human judgments used to define the 'error-prone' instances; any bias in the initial inspection propagates into the prioritization model and the reported efficiency gains.

Authors: We agree that the framework evaluation inherits the same human judgments used to label errors. In the revision we will clarify in §5 how the prioritization model was trained (on features derived from the inspected data), provide additional ablation results that isolate the contribution of the LLM component, and add a discussion of how inspection bias could affect the reported efficiency numbers. We will also release the framework code and the full set of model predictions so that others can re-evaluate the prioritization under alternative label sets. revision: yes

Circularity Check

0 steps flagged

Empirical audit reports direct observations with no self-referential reductions

full rationale

The paper reports error rates (39%/36% incorrect FOL, etc.) and LLM accuracy gains (+9 to +22 pp) obtained via systematic human inspection of existing dataset instances followed by direct re-evaluation of models on the resulting corrected labels. These quantities are produced by external annotation and testing steps rather than any equation, fitted parameter, or self-citation chain that reduces the outputs to the inputs by construction. No self-definitional, fitted-input-called-prediction, or ansatz-smuggling patterns appear in the abstract or described contributions. The LLM-assisted framework is a separate proposal and does not alter the reported statistics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that human judgment constitutes reliable ground truth for FOL correctness; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Human annotators can reliably determine whether a given FOL formula is a correct formalization of a natural language sentence
All reported error rates and accuracy gains derive from this human judgment step.

pith-pipeline@v0.9.1-grok · 5838 in / 1249 out tokens · 28437 ms · 2026-06-28T14:27:34.335795+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

105 extracted references · 61 canonical work pages · 1 internal anchor

[1]

13th International Conference on Intelligent Computer Mathematics (CICM) , series =

Christian Szegedy , title =. 13th International Conference on Intelligent Computer Mathematics (CICM) , series =. 2020 , doi =

2020
[2]

2024 , url =

Long Hei Matthew Lam and others , title =. 2024 , url =

2024
[4]

Cox and Robert Dale , title =

Dave Barker-Plummer and Richard J. Cox and Robert Dale , title =. 2011 , isbn =

2011
[5]

CoRR , volume =

Dalrymple, David "davidad" and Skalse, Joar and Bengio, Yoshua and Russel, Stuart and Tegmark, Max and Seshia, Sanjit and Omohundro, Steve and Szegedy, Christian and Goldhaber, Ben and Ammann, Nora and Abate, Alessandro and Halpern, Joe and Barrett, Clark and Zhao, Ding and Zhi-Xuan, Tan and Wing, Jeannette and Tenenbaum, Joshua , title =. CoRR , volume =...

2024
[6]

and Dale, Robert , booktitle=

Barker-Plummer, Dave and Cox, Richard J. and Dale, Robert , booktitle=. Student translations of natural language into logic:
[7]

Cox and Robert Dale , year=

Dave Barker-Plummer and Richard J. Cox and Robert Dale , year=. Tarski’s
[8]

Deshmukh, Jyotirmoy and Kantaros, Yiannis , title =

Wang, Jun and Sundarsingh, David Smith and V. Deshmukh, Jyotirmoy and Kantaros, Yiannis , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2504.21022 , eprinttype =. 2504.21022 , timestamp =

work page doi:10.48550/arxiv.2504.21022 2025
[9]

2021 , url =

Apurwa Yadav and Aarshil Patel and Manan Shah , title =. 2021 , url =. doi:10.1016/J.AIOPEN.2021.05.001 , timestamp =

work page doi:10.1016/j.aiopen.2021.05.001 2021
[10]

CoRR , volume =

Lei Xu and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.06774 , eprinttype =. 2510.06774 , timestamp =

work page doi:10.48550/arxiv.2510.06774 2025
[11]

Advancing Natural Language Formalization to First Order Logic with Fine-tuned LLMs

Vossel, Felix and Mossakowski, Till and Gehrke, Björn , biburl =. Advancing Natural Language Formalization to First Order Logic with Fine-tuned LLMs. , url =. CoRR , keywords =
[12]

Soviet physics

Binary codes capable of correcting deletions, insertions, and reversals , author=. Soviet physics. Doklady , year=
[13]

arXiv preprint arXiv:2405.02318 , year=

Autoformalizing Natural Language to First-Order Logic: A Case Study in Logical Fallacy Detection , author=. arXiv preprint arXiv:2405.02318 , year=

arXiv
[14]

QA - N at V er: Question Answering for Natural Logic-based Fact Verification

Aly, Rami and others. QA - N at V er: Question Answering for Natural Logic-based Fact Verification. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.521

work page doi:10.18653/v1/2023.emnlp-main.521 2023
[15]

Logical Fallacy Detection

Jin, Zhijing and others. Logical Fallacy Detection. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. doi:10.18653/v1/2022.findings-emnlp.532

work page doi:10.18653/v1/2022.findings-emnlp.532 2022
[16]

CoRR , volume =

Yujun Zhou and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2506.04810 , eprinttype =. 2506.04810 , timestamp =

work page doi:10.48550/arxiv.2506.04810 2025
[17]

2024 , url =

Andrea Brunello and others , title =. 2024 , url =

2024
[18]

Lee, Hyemin S

Ryu, Hyun and Kim, Gyeongman and S. Lee, Hyemin S. and Yang, Eunho , title =. 2025 , url =

2025
[19]

Complexity Parameters for First-Order Classes , booktitle =

Marta Arias and Roni Khardon , editor =. Complexity Parameters for First-Order Classes , booktitle =. 2003 , url =. doi:10.1007/978-3-540-39917-9\_4 , timestamp =

work page doi:10.1007/978-3-540-39917-9 2003
[20]

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,

Fengxiang Cheng and others , title =. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,. 2025 , url =. doi:10.24963/IJCAI.2025/1155 , timestamp =

work page doi:10.24963/ijcai.2025/1155 2025
[21]

2025 , url =

Lovish Madaan and others , title =. 2025 , url =. doi:10.18653/V1/2025.NAACL-LONG.466 , timestamp =

work page doi:10.18653/v1/2025.naacl-long.466 2025
[22]

ICLR 2024 Workshop on Secure and Trustworthy Large Language Models , year=

Enhancing and Evaluating Logical Reasoning Abilities of Large Language Models , author=. ICLR 2024 Workshop on Secure and Trustworthy Large Language Models , year=

2024
[23]

NeurIPS 2022, November 28 - December 9, 2022 , year =

Yuhuai Wu and others , title =. NeurIPS 2022, November 28 - December 9, 2022 , year =

2022
[24]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),

Jundong Xu and others , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2024 , url =. doi:10.18653/V1/2024.ACL-LONG.720 , timestamp =

work page doi:10.18653/v1/2024.acl-long.720 2024
[25]

CoRR , volume =

Benjamin Callewaert and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2501.14540 , eprinttype =. 2501.14540 , timestamp =

work page doi:10.48550/arxiv.2501.14540 2025
[26]

Few-Shot Natural Language to First-Order Logic Translation via Code Generation , booktitle =

Junnan Liu , editor =. Few-Shot Natural Language to First-Order Logic Translation via Code Generation , booktitle =. 2025 , url =. doi:10.18653/V1/2025.NAACL-LONG.547 , timestamp =

work page doi:10.18653/v1/2025.naacl-long.547 2025
[28]

2024 , url =

Xin Quan and others , title =. 2024 , url =. doi:10.18653/V1/2024.EMNLP-MAIN.172 , timestamp =

work page doi:10.18653/v1/2024.emnlp-main.172 2024
[29]

1990 , url=

Events in the Semantics of English: A Study in Subatomic Semantics , author=. 1990 , url=

1990
[30]

CoRR , volume =

Christopher Hahn and others , title =. CoRR , volume =. 2022 , url =. doi:10.48550/ARXIV.2206.01962 , eprinttype =. 2206.01962 , timestamp =

work page doi:10.48550/arxiv.2206.01962 2022
[31]

Parsing the WSJ Using CCG and Log-Linear Models

Clark, Stephen and Curran, James R. Parsing the WSJ Using CCG and Log-Linear Models. ACL -04. 2004. doi:10.3115/1218955.1218969

work page doi:10.3115/1218955.1218969 2004
[32]

2015 , url =

Johan Bos , title =. 2015 , url =

2015
[33]

Yu Pei and others , title =. Trans. Assoc. Comput. Linguistics , volume =. 2025 , url =. doi:10.1162/TACL.A.41 , timestamp =

work page doi:10.1162/tacl.a.41 2025
[34]

GCAT 2023 , year=

Data and Knowledge Engineering for Legal Precedents Using First-Order Predicate Logic , author=. GCAT 2023 , year=

2023
[35]

Towards Advanced Mathematical Reasoning for LLM s via First-Order Logic Theorem Proving

Cao, Chuxue and others. Towards Advanced Mathematical Reasoning for LLM s via First-Order Logic Theorem Proving. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.628

work page doi:10.18653/v1/2025.emnlp-main.628 2025
[36]

Grammar-Constrained Decoding Makes Large Language Models Better Logical Parsers

Raspanti, Federico and others. Grammar-Constrained Decoding Makes Large Language Models Better Logical Parsers. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track). 2025. doi:10.18653/v1/2025.acl-industry.34

work page doi:10.18653/v1/2025.acl-industry.34 2025
[37]

Let Me Speak Freely? A Study On The Impact Of Format Restrictions On Large Language Model Performance

Tam, Zhi Rui and others. Let Me Speak Freely? A Study On The Impact Of Format Restrictions On Large Language Model Performance. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2024. doi:10.18653/v1/2024.emnlp-industry.91

work page doi:10.18653/v1/2024.emnlp-industry.91 2024
[39]

SEMANTiCS 2025, Vienna, Austria, September 3-5, 2025 , series =

Alexander Beiser and others , title =. SEMANTiCS 2025, Vienna, Austria, September 3-5, 2025 , series =. 2025 , url =

2025
[40]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),

Mihir Parmar and others , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2024 , url =. doi:10.18653/V1/2024.ACL-LONG.739 , timestamp =

work page doi:10.18653/v1/2024.acl-long.739 2024
[41]

Into The Limits of Logic: Alignment Methods for Formal Logical Reasoning

Lopez-Ponce, FernandoFrancisco and Bel-Enguix, Gemma. Into The Limits of Logic: Alignment Methods for Formal Logical Reasoning. MathNLP 2025. 2025. doi:10.18653/v1/2025.mathnlp-main.8

work page doi:10.18653/v1/2025.mathnlp-main.8 2025
[42]

Diagnosing the First-Order Logical Reasoning Ability Through L ogic NLI

Tian, Jidong and others. Diagnosing the First-Order Logical Reasoning Ability Through L ogic NLI. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.303

work page doi:10.18653/v1/2021.emnlp-main.303 2021
[43]

CoRR , volume =

Thatikonda, Ramya Keerthy and Han, Jiuzhou and Buntine, Wray and Shareghi, Ehsan , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2409.16461 , eprinttype =. 2409.16461 , timestamp =

work page doi:10.48550/arxiv.2409.16461 2024
[44]

2025 , url =

Chengwen Qi and others , title =. 2025 , url =

2025
[45]

NeurIPS 2023, December 10 - 16, 2023 , year =

Ye, Xi and Chen, Qiaochu and Dillig, Isil and Durrett, Greg , title =. NeurIPS 2023, December 10 - 16, 2023 , year =

2023
[46]

Generating Predicate Logic Expressions from Natural Language , year=

Levkovskyi, Oleksii and Li, Wei , booktitle=. Generating Predicate Logic Expressions from Natural Language , year=
[47]

Educational Data Mining , year=

Dimensions of Difficulty in Translating Natural Language into First-Order Logic , author=. Educational Data Mining , year=
[48]

CoRR , volume =

Singh, Hrituraj and Aggarwal, Milan and Krishnamurthy, Balaji , title =. CoRR , volume =. 2020 , url =. 2002.06544 , timestamp =

arXiv 2020
[49]

Parsing Natural Language into Propositional and First-Order Logic with Dual Reinforcement Learning

Lu, Xuantao and others. Parsing Natural Language into Propositional and First-Order Logic with Dual Reinforcement Learning. Proceedings of the 29th International Conference on Computational Linguistics. 2022

2022
[50]

Findings of the Association for Computational Linguistics:

Akshay Chaturvedi and Nicholas Asher , title =. Findings of the Association for Computational Linguistics:. 2024 , url =. doi:10.18653/V1/2024.FINDINGS-EMNLP.390 , timestamp =

work page doi:10.18653/v1/2024.findings-emnlp.390 2024
[51]

Faithful Chain-of-Thought Reasoning

Lyu, Qing and others. Faithful Chain-of-Thought Reasoning. Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.ijcnlp-main.20

work page doi:10.18653/v1/2023.ijcnlp-main.20 2023
[52]

CoRR , volume =

Qingchuan Li and others , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2410.21779 , eprinttype =. 2410.21779 , timestamp =

work page doi:10.48550/arxiv.2410.21779 2024
[53]

2023 , url =

Olausson, Theo and Gu, Alex and Lipkin, Ben and Zhang, Cedegao and Solar-Lezama, Armando and Tenenbaum, Joshua and Levy, Roger , title =. 2023 , url =. doi:10.18653/V1/2023.EMNLP-MAIN.313 , timestamp =

work page doi:10.18653/v1/2023.emnlp-main.313 2023
[54]

CoRR , volume =

Peizhang Shao and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.07748 , eprinttype =. 2507.07748 , timestamp =

work page doi:10.48550/arxiv.2507.07748 2025
[55]

2025 , url =

Bowen Jiang and others , title =. 2025 , url =. doi:10.18653/V1/2025.NAACL-LONG.186 , timestamp =

work page doi:10.18653/v1/2025.naacl-long.186 2025
[56]

Frontiers Comput

Laura Orynbay and others , title =. Frontiers Comput. Sci. , volume =. 2025 , url =. doi:10.3389/FCOMP.2024.1486581 , timestamp =

work page doi:10.3389/fcomp.2024.1486581 2025
[57]

Findings of the Association for Computational Linguistics:

Pan, Liangming and Albalak, Alon and Wang, Xinyi and Yang Wang, William , title =. Findings of the Association for Computational Linguistics:. 2023 , url =. doi:10.18653/V1/2023.FINDINGS-EMNLP.248 , timestamp =

work page doi:10.18653/v1/2023.findings-emnlp.248 2023
[58]

CoRR , volume =

Shashank Kirtania and others , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2407.02514 , eprinttype =. 2407.02514 , timestamp =

work page doi:10.48550/arxiv.2407.02514 2024
[59]

Logic-Thinker: Teaching Large Language Models to Think more Logically

Wen, Chengyao and others. Logic-Thinker: Teaching Large Language Models to Think more Logically. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.696

work page doi:10.18653/v1/2025.findings-emnlp.696 2025
[60]

CoRR , volume =

Koushik Viswanadha and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2506.18383 , eprinttype =. 2506.18383 , timestamp =

work page doi:10.48550/arxiv.2506.18383 2025
[61]

2024 , url =

Fangzhi Xu and others , title =. 2024 , url =. doi:10.18653/V1/2024.ACL-LONG.707 , timestamp =

work page doi:10.18653/v1/2024.acl-long.707 2024
[62]

2025 , url =

Ruikang Hu and others , title =. 2025 , url =

2025
[63]

CoRR , volume =

Hannah Bansal and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.17377 , eprinttype =. 2509.17377 , timestamp =

work page doi:10.48550/arxiv.2509.17377 2025
[64]

Reasoning or

Zhaofeng Wu and others , title =. 2024 , url =. doi:10.18653/V1/2024.NAACL-LONG.102 , timestamp =

work page doi:10.18653/v1/2024.naacl-long.102 2024
[65]

Findings of the Association for Computational Linguistics:

Oyvind Tafjord and others , title =. Findings of the Association for Computational Linguistics:. 2021 , url =. doi:10.18653/V1/2021.FINDINGS-ACL.317 , timestamp =

work page doi:10.18653/v1/2021.findings-acl.317 2021
[66]

Transformers as Soft Reasoners over Language , booktitle =

Peter Clark and others , editor =. Transformers as Soft Reasoners over Language , booktitle =. 2020 , url =. doi:10.24963/IJCAI.2020/537 , timestamp =

work page doi:10.24963/ijcai.2020/537 2020
[67]

CoRR , volume =

Debargha Ganguly and others , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2409.17270 , eprinttype =. 2409.17270 , timestamp =

work page doi:10.48550/arxiv.2409.17270 2024
[68]

Findings of the Association for Computational Linguistics:

Simeng Han and others , title =. Findings of the Association for Computational Linguistics:. 2024 , url =. doi:10.18653/V1/2024.FINDINGS-EMNLP.966 , timestamp =

work page doi:10.18653/v1/2024.findings-emnlp.966 2024
[69]

CoRR , volume =

Qianxi He and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.19907 , eprinttype =. 2502.19907 , timestamp =

work page doi:10.48550/arxiv.2502.19907 2025
[70]

CoRR , volume =

Shokhrukh Ibragimov and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.14180 , eprinttype =. 2502.14180 , timestamp =

work page doi:10.48550/arxiv.2502.14180 2025
[71]

CoRR , volume =

Navapat Nananukul and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.01530 , eprinttype =. 2510.01530 , timestamp =

work page doi:10.48550/arxiv.2510.01530 2025
[72]

2025 , eprint=

From Hypothesis to Premises: LLM-based Backward Logical Reasoning with Selective Symbolic Translation , author=. 2025 , eprint=

2025
[73]

CoRR , volume =

Yue Zhang and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.21281 , eprinttype =. 2505.21281 , timestamp =

work page doi:10.48550/arxiv.2505.21281 2025
[74]

Ontology learning towards expressiveness: A survey , journal =

Pauline Armary and others , keywords =. Ontology learning towards expressiveness: A survey , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.cosrev.2024.100693 , url =

work page doi:10.1016/j.cosrev.2024.100693 2025
[75]

CoRR , volume =

Rick Du and others , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2404.14991 , eprinttype =. 2404.14991 , timestamp =

work page doi:10.48550/arxiv.2404.14991 2024
[76]

Zhengkun Di and others , title =. Knowl. Based Syst. , volume =. 2025 , url =. doi:10.1016/J.KNOSYS.2025.114140 , timestamp =

work page doi:10.1016/j.knosys.2025.114140 2025
[77]

arXiv preprint arXiv:2509.24765 , year=

From Ambiguity to Verdict: A Semiotic-Grounded Multi-Perspective Agent for LLM Logical Reasoning , author=. arXiv preprint arXiv:2509.24765 , year=

Pith/arXiv arXiv
[78]

2024 , eprint=

uto val: Autonomous Assessment of LLMs in Formal Synthesis and Interpretation Tasks , author=. 2024 , eprint=

2024
[79]

Learning First-Order Logic Rules for Argumentation Mining

Sun, Yang and others. Learning First-Order Logic Rules for Argumentation Mining. ACL 2025. 2025. doi:10.18653/v1/2025.acl-long.691

work page doi:10.18653/v1/2025.acl-long.691 2025
[80]

Transformer models for translating natural language sentences into formal logical expressions , school=

Deveci, İbrahim Ethem , year=. Transformer models for translating natural language sentences into formal logical expressions , school=
[81]

2025 , url =

Samuele Germiniani and others , title =. 2025 , url =. doi:10.1109/ACCESS.2025.3551607 , timestamp =

work page doi:10.1109/access.2025.3551607 2025
[82]

CoRR , volume =

Ali Mohammadjafari and others , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2410.01066 , eprinttype =. 2410.01066 , timestamp =

work page doi:10.48550/arxiv.2410.01066 2024
[83]

CoRR , volume =

Ke Weng and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.23486 , eprinttype =. 2505.23486 , timestamp =

work page doi:10.48550/arxiv.2505.23486 2025

Showing first 80 references.

[1] [1]

13th International Conference on Intelligent Computer Mathematics (CICM) , series =

Christian Szegedy , title =. 13th International Conference on Intelligent Computer Mathematics (CICM) , series =. 2020 , doi =

2020

[2] [2]

2024 , url =

Long Hei Matthew Lam and others , title =. 2024 , url =

2024

[3] [4]

Cox and Robert Dale , title =

Dave Barker-Plummer and Richard J. Cox and Robert Dale , title =. 2011 , isbn =

2011

[4] [5]

CoRR , volume =

Dalrymple, David "davidad" and Skalse, Joar and Bengio, Yoshua and Russel, Stuart and Tegmark, Max and Seshia, Sanjit and Omohundro, Steve and Szegedy, Christian and Goldhaber, Ben and Ammann, Nora and Abate, Alessandro and Halpern, Joe and Barrett, Clark and Zhao, Ding and Zhi-Xuan, Tan and Wing, Jeannette and Tenenbaum, Joshua , title =. CoRR , volume =...

2024

[5] [6]

and Dale, Robert , booktitle=

Barker-Plummer, Dave and Cox, Richard J. and Dale, Robert , booktitle=. Student translations of natural language into logic:

[6] [7]

Cox and Robert Dale , year=

Dave Barker-Plummer and Richard J. Cox and Robert Dale , year=. Tarski’s

[7] [8]

Deshmukh, Jyotirmoy and Kantaros, Yiannis , title =

Wang, Jun and Sundarsingh, David Smith and V. Deshmukh, Jyotirmoy and Kantaros, Yiannis , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2504.21022 , eprinttype =. 2504.21022 , timestamp =

work page doi:10.48550/arxiv.2504.21022 2025

[8] [9]

2021 , url =

Apurwa Yadav and Aarshil Patel and Manan Shah , title =. 2021 , url =. doi:10.1016/J.AIOPEN.2021.05.001 , timestamp =

work page doi:10.1016/j.aiopen.2021.05.001 2021

[9] [10]

CoRR , volume =

Lei Xu and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.06774 , eprinttype =. 2510.06774 , timestamp =

work page doi:10.48550/arxiv.2510.06774 2025

[10] [11]

Advancing Natural Language Formalization to First Order Logic with Fine-tuned LLMs

Vossel, Felix and Mossakowski, Till and Gehrke, Björn , biburl =. Advancing Natural Language Formalization to First Order Logic with Fine-tuned LLMs. , url =. CoRR , keywords =

[11] [12]

Soviet physics

Binary codes capable of correcting deletions, insertions, and reversals , author=. Soviet physics. Doklady , year=

[12] [13]

arXiv preprint arXiv:2405.02318 , year=

Autoformalizing Natural Language to First-Order Logic: A Case Study in Logical Fallacy Detection , author=. arXiv preprint arXiv:2405.02318 , year=

arXiv

[13] [14]

QA - N at V er: Question Answering for Natural Logic-based Fact Verification

Aly, Rami and others. QA - N at V er: Question Answering for Natural Logic-based Fact Verification. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.521

work page doi:10.18653/v1/2023.emnlp-main.521 2023

[14] [15]

Logical Fallacy Detection

Jin, Zhijing and others. Logical Fallacy Detection. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. doi:10.18653/v1/2022.findings-emnlp.532

work page doi:10.18653/v1/2022.findings-emnlp.532 2022

[15] [16]

CoRR , volume =

Yujun Zhou and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2506.04810 , eprinttype =. 2506.04810 , timestamp =

work page doi:10.48550/arxiv.2506.04810 2025

[16] [17]

2024 , url =

Andrea Brunello and others , title =. 2024 , url =

2024

[17] [18]

Lee, Hyemin S

Ryu, Hyun and Kim, Gyeongman and S. Lee, Hyemin S. and Yang, Eunho , title =. 2025 , url =

2025

[18] [19]

Complexity Parameters for First-Order Classes , booktitle =

Marta Arias and Roni Khardon , editor =. Complexity Parameters for First-Order Classes , booktitle =. 2003 , url =. doi:10.1007/978-3-540-39917-9\_4 , timestamp =

work page doi:10.1007/978-3-540-39917-9 2003

[19] [20]

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,

Fengxiang Cheng and others , title =. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,. 2025 , url =. doi:10.24963/IJCAI.2025/1155 , timestamp =

work page doi:10.24963/ijcai.2025/1155 2025

[20] [21]

2025 , url =

Lovish Madaan and others , title =. 2025 , url =. doi:10.18653/V1/2025.NAACL-LONG.466 , timestamp =

work page doi:10.18653/v1/2025.naacl-long.466 2025

[21] [22]

ICLR 2024 Workshop on Secure and Trustworthy Large Language Models , year=

Enhancing and Evaluating Logical Reasoning Abilities of Large Language Models , author=. ICLR 2024 Workshop on Secure and Trustworthy Large Language Models , year=

2024

[22] [23]

NeurIPS 2022, November 28 - December 9, 2022 , year =

Yuhuai Wu and others , title =. NeurIPS 2022, November 28 - December 9, 2022 , year =

2022

[23] [24]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),

Jundong Xu and others , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2024 , url =. doi:10.18653/V1/2024.ACL-LONG.720 , timestamp =

work page doi:10.18653/v1/2024.acl-long.720 2024

[24] [25]

CoRR , volume =

Benjamin Callewaert and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2501.14540 , eprinttype =. 2501.14540 , timestamp =

work page doi:10.48550/arxiv.2501.14540 2025

[25] [26]

Few-Shot Natural Language to First-Order Logic Translation via Code Generation , booktitle =

Junnan Liu , editor =. Few-Shot Natural Language to First-Order Logic Translation via Code Generation , booktitle =. 2025 , url =. doi:10.18653/V1/2025.NAACL-LONG.547 , timestamp =

work page doi:10.18653/v1/2025.naacl-long.547 2025

[26] [28]

2024 , url =

Xin Quan and others , title =. 2024 , url =. doi:10.18653/V1/2024.EMNLP-MAIN.172 , timestamp =

work page doi:10.18653/v1/2024.emnlp-main.172 2024

[27] [29]

1990 , url=

Events in the Semantics of English: A Study in Subatomic Semantics , author=. 1990 , url=

1990

[28] [30]

CoRR , volume =

Christopher Hahn and others , title =. CoRR , volume =. 2022 , url =. doi:10.48550/ARXIV.2206.01962 , eprinttype =. 2206.01962 , timestamp =

work page doi:10.48550/arxiv.2206.01962 2022

[29] [31]

Parsing the WSJ Using CCG and Log-Linear Models

Clark, Stephen and Curran, James R. Parsing the WSJ Using CCG and Log-Linear Models. ACL -04. 2004. doi:10.3115/1218955.1218969

work page doi:10.3115/1218955.1218969 2004

[30] [32]

2015 , url =

Johan Bos , title =. 2015 , url =

2015

[31] [33]

Yu Pei and others , title =. Trans. Assoc. Comput. Linguistics , volume =. 2025 , url =. doi:10.1162/TACL.A.41 , timestamp =

work page doi:10.1162/tacl.a.41 2025

[32] [34]

GCAT 2023 , year=

Data and Knowledge Engineering for Legal Precedents Using First-Order Predicate Logic , author=. GCAT 2023 , year=

2023

[33] [35]

Towards Advanced Mathematical Reasoning for LLM s via First-Order Logic Theorem Proving

Cao, Chuxue and others. Towards Advanced Mathematical Reasoning for LLM s via First-Order Logic Theorem Proving. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.628

work page doi:10.18653/v1/2025.emnlp-main.628 2025

[34] [36]

Grammar-Constrained Decoding Makes Large Language Models Better Logical Parsers

Raspanti, Federico and others. Grammar-Constrained Decoding Makes Large Language Models Better Logical Parsers. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track). 2025. doi:10.18653/v1/2025.acl-industry.34

work page doi:10.18653/v1/2025.acl-industry.34 2025

[35] [37]

Let Me Speak Freely? A Study On The Impact Of Format Restrictions On Large Language Model Performance

Tam, Zhi Rui and others. Let Me Speak Freely? A Study On The Impact Of Format Restrictions On Large Language Model Performance. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2024. doi:10.18653/v1/2024.emnlp-industry.91

work page doi:10.18653/v1/2024.emnlp-industry.91 2024

[36] [39]

SEMANTiCS 2025, Vienna, Austria, September 3-5, 2025 , series =

Alexander Beiser and others , title =. SEMANTiCS 2025, Vienna, Austria, September 3-5, 2025 , series =. 2025 , url =

2025

[37] [40]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),

Mihir Parmar and others , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2024 , url =. doi:10.18653/V1/2024.ACL-LONG.739 , timestamp =

work page doi:10.18653/v1/2024.acl-long.739 2024

[38] [41]

Into The Limits of Logic: Alignment Methods for Formal Logical Reasoning

Lopez-Ponce, FernandoFrancisco and Bel-Enguix, Gemma. Into The Limits of Logic: Alignment Methods for Formal Logical Reasoning. MathNLP 2025. 2025. doi:10.18653/v1/2025.mathnlp-main.8

work page doi:10.18653/v1/2025.mathnlp-main.8 2025

[39] [42]

Diagnosing the First-Order Logical Reasoning Ability Through L ogic NLI

Tian, Jidong and others. Diagnosing the First-Order Logical Reasoning Ability Through L ogic NLI. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.303

work page doi:10.18653/v1/2021.emnlp-main.303 2021

[40] [43]

CoRR , volume =

Thatikonda, Ramya Keerthy and Han, Jiuzhou and Buntine, Wray and Shareghi, Ehsan , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2409.16461 , eprinttype =. 2409.16461 , timestamp =

work page doi:10.48550/arxiv.2409.16461 2024

[41] [44]

2025 , url =

Chengwen Qi and others , title =. 2025 , url =

2025

[42] [45]

NeurIPS 2023, December 10 - 16, 2023 , year =

Ye, Xi and Chen, Qiaochu and Dillig, Isil and Durrett, Greg , title =. NeurIPS 2023, December 10 - 16, 2023 , year =

2023

[43] [46]

Generating Predicate Logic Expressions from Natural Language , year=

Levkovskyi, Oleksii and Li, Wei , booktitle=. Generating Predicate Logic Expressions from Natural Language , year=

[44] [47]

Educational Data Mining , year=

Dimensions of Difficulty in Translating Natural Language into First-Order Logic , author=. Educational Data Mining , year=

[45] [48]

CoRR , volume =

Singh, Hrituraj and Aggarwal, Milan and Krishnamurthy, Balaji , title =. CoRR , volume =. 2020 , url =. 2002.06544 , timestamp =

arXiv 2020

[46] [49]

Parsing Natural Language into Propositional and First-Order Logic with Dual Reinforcement Learning

Lu, Xuantao and others. Parsing Natural Language into Propositional and First-Order Logic with Dual Reinforcement Learning. Proceedings of the 29th International Conference on Computational Linguistics. 2022

2022

[47] [50]

Findings of the Association for Computational Linguistics:

Akshay Chaturvedi and Nicholas Asher , title =. Findings of the Association for Computational Linguistics:. 2024 , url =. doi:10.18653/V1/2024.FINDINGS-EMNLP.390 , timestamp =

work page doi:10.18653/v1/2024.findings-emnlp.390 2024

[48] [51]

Faithful Chain-of-Thought Reasoning

Lyu, Qing and others. Faithful Chain-of-Thought Reasoning. Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.ijcnlp-main.20

work page doi:10.18653/v1/2023.ijcnlp-main.20 2023

[49] [52]

CoRR , volume =

Qingchuan Li and others , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2410.21779 , eprinttype =. 2410.21779 , timestamp =

work page doi:10.48550/arxiv.2410.21779 2024

[50] [53]

2023 , url =

Olausson, Theo and Gu, Alex and Lipkin, Ben and Zhang, Cedegao and Solar-Lezama, Armando and Tenenbaum, Joshua and Levy, Roger , title =. 2023 , url =. doi:10.18653/V1/2023.EMNLP-MAIN.313 , timestamp =

work page doi:10.18653/v1/2023.emnlp-main.313 2023

[51] [54]

CoRR , volume =

Peizhang Shao and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.07748 , eprinttype =. 2507.07748 , timestamp =

work page doi:10.48550/arxiv.2507.07748 2025

[52] [55]

2025 , url =

Bowen Jiang and others , title =. 2025 , url =. doi:10.18653/V1/2025.NAACL-LONG.186 , timestamp =

work page doi:10.18653/v1/2025.naacl-long.186 2025

[53] [56]

Frontiers Comput

Laura Orynbay and others , title =. Frontiers Comput. Sci. , volume =. 2025 , url =. doi:10.3389/FCOMP.2024.1486581 , timestamp =

work page doi:10.3389/fcomp.2024.1486581 2025

[54] [57]

Findings of the Association for Computational Linguistics:

Pan, Liangming and Albalak, Alon and Wang, Xinyi and Yang Wang, William , title =. Findings of the Association for Computational Linguistics:. 2023 , url =. doi:10.18653/V1/2023.FINDINGS-EMNLP.248 , timestamp =

work page doi:10.18653/v1/2023.findings-emnlp.248 2023

[55] [58]

CoRR , volume =

Shashank Kirtania and others , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2407.02514 , eprinttype =. 2407.02514 , timestamp =

work page doi:10.48550/arxiv.2407.02514 2024

[56] [59]

Logic-Thinker: Teaching Large Language Models to Think more Logically

Wen, Chengyao and others. Logic-Thinker: Teaching Large Language Models to Think more Logically. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.696

work page doi:10.18653/v1/2025.findings-emnlp.696 2025

[57] [60]

CoRR , volume =

Koushik Viswanadha and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2506.18383 , eprinttype =. 2506.18383 , timestamp =

work page doi:10.48550/arxiv.2506.18383 2025

[58] [61]

2024 , url =

Fangzhi Xu and others , title =. 2024 , url =. doi:10.18653/V1/2024.ACL-LONG.707 , timestamp =

work page doi:10.18653/v1/2024.acl-long.707 2024

[59] [62]

2025 , url =

Ruikang Hu and others , title =. 2025 , url =

2025

[60] [63]

CoRR , volume =

Hannah Bansal and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.17377 , eprinttype =. 2509.17377 , timestamp =

work page doi:10.48550/arxiv.2509.17377 2025

[61] [64]

Reasoning or

Zhaofeng Wu and others , title =. 2024 , url =. doi:10.18653/V1/2024.NAACL-LONG.102 , timestamp =

work page doi:10.18653/v1/2024.naacl-long.102 2024

[62] [65]

Findings of the Association for Computational Linguistics:

Oyvind Tafjord and others , title =. Findings of the Association for Computational Linguistics:. 2021 , url =. doi:10.18653/V1/2021.FINDINGS-ACL.317 , timestamp =

work page doi:10.18653/v1/2021.findings-acl.317 2021

[63] [66]

Transformers as Soft Reasoners over Language , booktitle =

Peter Clark and others , editor =. Transformers as Soft Reasoners over Language , booktitle =. 2020 , url =. doi:10.24963/IJCAI.2020/537 , timestamp =

work page doi:10.24963/ijcai.2020/537 2020

[64] [67]

CoRR , volume =

Debargha Ganguly and others , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2409.17270 , eprinttype =. 2409.17270 , timestamp =

work page doi:10.48550/arxiv.2409.17270 2024

[65] [68]

Findings of the Association for Computational Linguistics:

Simeng Han and others , title =. Findings of the Association for Computational Linguistics:. 2024 , url =. doi:10.18653/V1/2024.FINDINGS-EMNLP.966 , timestamp =

work page doi:10.18653/v1/2024.findings-emnlp.966 2024

[66] [69]

CoRR , volume =

Qianxi He and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.19907 , eprinttype =. 2502.19907 , timestamp =

work page doi:10.48550/arxiv.2502.19907 2025

[67] [70]

CoRR , volume =

Shokhrukh Ibragimov and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.14180 , eprinttype =. 2502.14180 , timestamp =

work page doi:10.48550/arxiv.2502.14180 2025

[68] [71]

CoRR , volume =

Navapat Nananukul and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.01530 , eprinttype =. 2510.01530 , timestamp =

work page doi:10.48550/arxiv.2510.01530 2025

[69] [72]

2025 , eprint=

From Hypothesis to Premises: LLM-based Backward Logical Reasoning with Selective Symbolic Translation , author=. 2025 , eprint=

2025

[70] [73]

CoRR , volume =

Yue Zhang and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.21281 , eprinttype =. 2505.21281 , timestamp =

work page doi:10.48550/arxiv.2505.21281 2025

[71] [74]

Ontology learning towards expressiveness: A survey , journal =

Pauline Armary and others , keywords =. Ontology learning towards expressiveness: A survey , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.cosrev.2024.100693 , url =

work page doi:10.1016/j.cosrev.2024.100693 2025

[72] [75]

CoRR , volume =

Rick Du and others , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2404.14991 , eprinttype =. 2404.14991 , timestamp =

work page doi:10.48550/arxiv.2404.14991 2024

[73] [76]

Zhengkun Di and others , title =. Knowl. Based Syst. , volume =. 2025 , url =. doi:10.1016/J.KNOSYS.2025.114140 , timestamp =

work page doi:10.1016/j.knosys.2025.114140 2025

[74] [77]

arXiv preprint arXiv:2509.24765 , year=

From Ambiguity to Verdict: A Semiotic-Grounded Multi-Perspective Agent for LLM Logical Reasoning , author=. arXiv preprint arXiv:2509.24765 , year=

Pith/arXiv arXiv

[75] [78]

2024 , eprint=

uto val: Autonomous Assessment of LLMs in Formal Synthesis and Interpretation Tasks , author=. 2024 , eprint=

2024

[76] [79]

Learning First-Order Logic Rules for Argumentation Mining

Sun, Yang and others. Learning First-Order Logic Rules for Argumentation Mining. ACL 2025. 2025. doi:10.18653/v1/2025.acl-long.691

work page doi:10.18653/v1/2025.acl-long.691 2025

[77] [80]

Transformer models for translating natural language sentences into formal logical expressions , school=

Deveci, İbrahim Ethem , year=. Transformer models for translating natural language sentences into formal logical expressions , school=

[78] [81]

2025 , url =

Samuele Germiniani and others , title =. 2025 , url =. doi:10.1109/ACCESS.2025.3551607 , timestamp =

work page doi:10.1109/access.2025.3551607 2025

[79] [82]

CoRR , volume =

Ali Mohammadjafari and others , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2410.01066 , eprinttype =. 2410.01066 , timestamp =

work page doi:10.48550/arxiv.2410.01066 2024

[80] [83]

CoRR , volume =

Ke Weng and others , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.23486 , eprinttype =. 2505.23486 , timestamp =

work page doi:10.48550/arxiv.2505.23486 2025