pith. sign in

arxiv: 2605.23965 · v2 · pith:CQDAFVWKnew · submitted 2026-05-12 · 💻 cs.AI · cs.LG· cs.SE

LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

Pith reviewed 2026-06-30 22:12 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.SE
keywords metamorphic testingLLM reasoning evaluationfirst-order logicreasoning reliabilitylogical invariancerobustness testing
0
0 comments X

The pith

LGMT uses first-order logic equivalences to build test cases that check whether LLMs give consistent answers across logically equivalent questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LGMT, a framework that turns formal logical equivalences into metamorphic relations to generate multiple versions of the same reasoning problem. Traditional benchmarks compare a single answer against a reference and can miss cases where a model succeeds by luck or surface pattern rather than stable reasoning. LGMT instead checks whether the model produces consistent outputs on all versions, revealing defects when it does not. Experiments across six LLMs show these defects are common and often missed by reference-based tests, especially under changes to symbols or conclusions. The work concludes that evaluation should test robustness to logical invariance instead of isolated correctness.

Core claim

LGMT derives metamorphic relations directly from first-order logic equivalences, constructs sets of semantically invariant test cases, and identifies reasoning defects by checking output consistency across each set. On six state-of-the-art LLMs the method uncovers substantial defects that reference-based evaluations overlook, with particular sensitivity to symbol-level and conclusion-level variations, and shows that Few-shot Chain-of-Thought prompting reduces but does not eliminate the inconsistencies.

What carries the argument

Metamorphic relations derived from first-order logic equivalences, used to generate invariant test cases and perform cross-case consistency checking.

If this is right

  • LLMs remain sensitive to symbol-level and conclusion-level changes even when the underlying logic is preserved.
  • Few-shot Chain-of-Thought prompting reduces but does not remove the detected inconsistencies.
  • Evaluation of reasoning should shift from single-question correctness to checks of invariance under logical transformations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same consistency-checking approach could be applied to other formal systems such as temporal or modal logic.
  • Training objectives that explicitly reward cross-equivalence consistency might reduce the defects LGMT detects.
  • The framework could be adapted to test robustness in non-strictly logical domains such as causal or probabilistic reasoning.

Load-bearing premise

That equivalences taken from first-order logic produce test cases whose meaning remains fixed enough to diagnose genuine reasoning defects rather than mere differences in phrasing or output style.

What would settle it

A collection of logical equivalences on which every tested LLM produces fully consistent and correct answers, yet the same models still commit clear reasoning errors on equivalent problems outside the chosen relations.

Figures

Figures reproduced from arXiv: 2605.23965 by Man Li, Weibin Lin, Xiaoke Fang, Xinyi Zhou, Zenghui Zhou, Zheng Zheng.

Figure 2
Figure 2. Figure 2: demonstrates how lexical MRs induce semantic drift. First, we establish a baseline where the model correctly deduces the conclusion (“I own a car.”) from a specific premise (“My car has four wheels.”), returning an accurate judgment of “Yes”. Then, a standard lexical MR modifies the premise by replacing the target noun “car” with its broader hypernym, “vehicle,” while keeping the conclusion unchanged [18] … view at source ↗
Figure 1
Figure 1. Figure 1: In the first query, the model is presented with a set of premises and a conclusion. The model answers “Unknown”, correctly aligning with the ground-truth label to indicate it cannot derive the conclusion from the given premises. In the second query, we modify only a single premise by applying an equivalence transformation. Specif￾ically, the rule stating that “all social media applications have chat featur… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the LGMT Framework. represented in both natural language (premises and conclu￾sion) and its corresponding FOL form, which serves as the basis for subsequent transformations. (2) Logic-Grounded MR Design. We define a set of MRs grounded in first-order logic, including formula-level (MR-E), symbol-level (MR-S), premise-level (MR-P), and conclusion-level (MR-C) transformations. These MRs are forma… view at source ↗
Figure 4
Figure 4. Figure 4: Average MVR across MR categories. MR-C and MR-S induce the highest inconsistency, while MR-P shows substantially lower sensitivity [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Hidden Defect Rate (HDR) across models and prompting strategies [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: False Unreported Rate (FUR) across models and prompting strategies. Finding 4.4: Few-shot CoT generally achieves the lowest FUR, but non-trivial blind spots remain. Implication: As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
read the original abstract

Large Language Models (LLMs) achieve strong performance on logical reasoning benchmarks, yet their reliability remains uncertain. Existing evaluations rely on static benchmarks, which fail to assess robustness under logically equivalent transformations and often overestimate reasoning capability. We propose LGMT (Logic-Grounded Metamorphic Testing), an oracle-free framework that leverages first-order logic (FOL) to evaluate LLM reasoning. By deriving metamorphic relations from formal logical equivalences, LGMT constructs semantically invariant test cases and detects reasoning defects through cross-case consistency checking. Experiments on six state-of-the-art LLMs show that LGMT exposes substantial hidden defects missed by traditional reference-based evaluations. We further find that models are particularly sensitive to symbol-level and conclusion-level variations, and that advanced prompting such as Few-shot CoT only partially mitigates these issues. These results suggest that LLM evaluation should move beyond isolated correctness toward robustness under logical invariance. LGMT provides a principled and scalable approach for diagnosing reasoning failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes LGMT, an oracle-free metamorphic testing framework that derives relations from first-order logic equivalences to generate semantically invariant test cases, then detects LLM reasoning defects via cross-case consistency. Experiments on six state-of-the-art LLMs are reported to expose substantial hidden defects missed by static reference-based benchmarks; models are especially sensitive to symbol-level and conclusion-level variations, and Few-shot CoT only partially mitigates the issues. The work concludes that evaluation should shift from isolated correctness to robustness under logical invariance.

Significance. If the metamorphic relations are shown to isolate genuine reasoning failures rather than surface sensitivity, LGMT would supply a scalable, logic-grounded alternative to static benchmarks and could materially improve diagnosis of LLM reasoning reliability.

major comments (2)
  1. [Abstract (central claim and experiments paragraph)] The central claim that output inconsistency diagnoses reasoning defects (rather than token/attention-level surface effects) rests on the unverified assumption that FOL equivalences produce test cases whose semantic invariance is tight enough for LLMs; the abstract provides no quantitative validation such as human equivalence ratings or surface-feature ablations to separate these explanations.
  2. [Abstract (experiments paragraph)] Without details on the exact metamorphic relations, the consistency metric, dataset construction, or statistical controls, it is impossible to determine whether the reported defects support the claim that LGMT outperforms traditional evaluations.
minor comments (1)
  1. [Abstract] The abstract refers to 'six state-of-the-art LLMs' and 'advanced prompting such as Few-shot CoT' without naming the models or specifying the prompting variants and baselines used.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed feedback on the abstract. We address each major comment below with references to the full manuscript, which provides the supporting methodology and experiments.

read point-by-point responses
  1. Referee: [Abstract (central claim and experiments paragraph)] The central claim that output inconsistency diagnoses reasoning defects (rather than token/attention-level surface effects) rests on the unverified assumption that FOL equivalences produce test cases whose semantic invariance is tight enough for LLMs; the abstract provides no quantitative validation such as human equivalence ratings or surface-feature ablations to separate these explanations.

    Authors: The metamorphic relations are derived directly from established first-order logic equivalences (detailed in Section 3.1 and Table 1), which are mathematically guaranteed to preserve semantic meaning. This formal grounding, rather than empirical human ratings, underpins the claim that inconsistencies indicate reasoning defects. Section 5 includes controls that isolate logical variations (e.g., symbol and conclusion changes) while holding surface features constant where possible, and reports statistically significant differences across six LLMs. We maintain that the logical foundation suffices without additional human studies for the core argument. revision: no

  2. Referee: [Abstract (experiments paragraph)] Without details on the exact metamorphic relations, the consistency metric, dataset construction, or statistical controls, it is impossible to determine whether the reported defects support the claim that LGMT outperforms traditional evaluations.

    Authors: The abstract summarizes the approach at a high level due to length constraints. Full details appear in the manuscript: metamorphic relations are enumerated in Section 3 and Table 1; the consistency metric (cross-case agreement rate) is defined in Section 4.2; dataset construction from logical templates is described in Section 4.1; and statistical controls (multiple runs, significance testing) are reported in Section 5. These elements support the finding that LGMT detects defects missed by static benchmarks, with sensitivity analyses for symbol-level and conclusion-level variations. revision: no

standing simulated objections not resolved
  • Additional quantitative validation such as new human equivalence ratings or expanded surface-feature ablations would require experiments outside the current manuscript scope.

Circularity Check

0 steps flagged

No circularity detected in LGMT derivation

full rationale

The LGMT framework derives metamorphic relations directly from standard first-order logic equivalences, which are external mathematical facts independent of the paper. No equations, fitted parameters, self-citations, or ansatzes are presented that reduce any claimed prediction or result to the method's own inputs by construction. The approach is self-contained against external benchmarks in logic, and the empirical evaluation on LLMs does not involve renaming known results or load-bearing self-references. This is the normal honest outcome for a method grounded in established formal logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Relies on standard properties of first-order logic; no free parameters, invented entities, or ad-hoc axioms visible in abstract.

axioms (1)
  • standard math First-order logic equivalences define semantically invariant transformations
    Invoked to generate metamorphic relations from formal logical equivalences.

pith-pipeline@v0.9.1-grok · 5709 in / 1098 out tokens · 21928 ms · 2026-06-30T22:12:03.434681+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

86 extracted references · 35 canonical work pages · 6 internal anchors

  1. [1]

    Cao, C., Li, M., Dai, J., Yang, J., Zhao, Z., Zhang, S., et al., 2025. Towards advanced mathematical reasoning for llms via first-order logic theorem proving, in: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics. pp. 12429–12449. doi:10.18653/v1/2025. emnlp-main.628

  2. [2]

    Thinking like a developer? comparing the attention of humans with neural models of code,

    Chen, S., Jin, S., Xie, X., 2021. Testing your question answer- ing software via asking recursively, in: Preceeding of the Interna- tionalConferenceonAutomatedSoftwareEngineering,pp.104–116. doi:10.1109/ASE51524.2021.9678670

  3. [3]

    Metamorphic testing: A new approach for generating next test cases

    Chen, T.Y., Cheung, S.C., Yiu, S.M., 1998. Metamorphic testing: A new approach for generating next test cases. Technical Report HKUST-CS98-01.HongKongUniversityofScienceandTechnology

  4. [4]

    Chen,T.Y.,Kuo,F.C.,Liu,H.,Poon,P.L.,Towey,D.,Tse,T.H.,etal.,

  5. [5]

    ACM Computing Surveys 51, 1–27

    Metamorphictesting:areviewofchallengesandopportunities. ACM Computing Surveys 51, 1–27. doi:10.1145/3143561

  6. [6]

    doi:10.48550/arXiv

    Cho,S.,Ruberto,S.,Terragni,V.,2025. Metamorphictestingoflarge languagemodelsfornaturallanguageprocessing.doi:10.48550/arXiv. 2511.02108

  7. [7]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Clark,P.,Cowhey,I.,Etzioni,O.,Khot,T.,Sabharwal,A.,Schoenick, C., et al., 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. doi:10.48550/ARXIV.1803.05457

  8. [8]

    Transformers as soft reasoners over language, in: Proceedings of the International Joint Conference on Artificial Intelligence, pp

    Clark, P., Tafjord, O., Richardson, K., 2021. Transformers as soft reasoners over language, in: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 3882–3890

  9. [9]

    Errors of measurement in statistics

    Cochran, W.G., 1968. Errors of measurement in statistics. Techno- metrics 10, 637–666. doi:10.1080/00401706.1968.10490621

  10. [10]

    Explaining answers with entailment trees, in: Proceedingsofthe2021ConferenceonEmpiricalMethodsinNatural LanguageProcessing,AssociationforComputationalLinguistics.pp

    Dalvi, B., Jansen, P., Tafjord, O., Xie, Z., Smith, H., Pipatanangkura, L., et al., 2021. Explaining answers with entailment trees, in: Proceedingsofthe2021ConferenceonEmpiricalMethodsinNatural LanguageProcessing,AssociationforComputationalLinguistics.pp. 7358–7370. doi:10.18653/v1/2021.emnlp-main.585

  11. [11]

    DeepSeek-V3 Technical Report

    DeepSeek,Liu, A.,Feng, B.,Xue, B.,Wang, B.,Wu,B., etal., 2025. Deepseek-v3 technical report. doi:10.48550/arXiv.2412.19437

  12. [12]

    DeepSeek-AI, Zhu, Q., Guo, D., Shao, Z., Yang, D., Wang, P., et al.,

  13. [13]

    DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

    Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. doi:10.48550/arXiv.2406.11931

  14. [14]

    Dziri, N., Lu, X., Sclar, M., Li, X.L., Jiang, L., Lin, B.Y., et al.,

  15. [15]

    70293–70332

    Faith and fate: limits of transformers on compositionality, in: Proceedings of the International Conference on Neural Information Processing Systems, pp. 70293–70332

  16. [16]

    Logical consis- tency of large language models in fact-checking, in: The Thirteenth International Conference on Learning Representations

    Ghosh, B., Hasan, S., Arafat, N.A., Khan, A., 2025. Logical consis- tency of large language models in fact-checking, in: The Thirteenth International Conference on Learning Representations. URL:https: //openreview.net/forum?id=SimlDuN0YT

  17. [17]

    Intelligentvirtualassistantswithllm-basedprocessautomation.URL: https://arxiv.org/abs/2312.06677,arXiv:2312.06677

    Guan, Y., Wang, D., Chu, Z., Wang, S., Ni, F., Song, R., et al., 2023. Intelligentvirtualassistantswithllm-basedprocessautomation.URL: https://arxiv.org/abs/2312.06677,arXiv:2312.06677

  18. [18]

    Nagendra Kumar

    Han, S., Schoelkopf, H., Zhao, Y., Qi, Z., Riddell, M., Zhou, W., et al., 2024. Folio: natural language reasoning with first-order logic, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 22017–22031. doi:10.18653/v1/ 2024.emnlp-main.1229

  19. [19]

    Conditional andmodalreasoninginlargelanguagemodels,in:Proceedingsofthe Conference on Empirical Methods in Natural Language Processing, pp

    Holliday, W.H., Mandelkern, M., Zhang, C.E., 2024. Conditional andmodalreasoninginlargelanguagemodels,in:Proceedingsofthe Conference on Empirical Methods in Natural Language Processing, pp. 3800–3821. doi:10.18653/v1/2024.emnlp-main.222

  20. [20]

    Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., et al.,

  21. [21]

    2025 , publisher =

    Asurveyonhallucinationinlargelanguagemodels:Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems 43, 1–55. doi:10.1145/3703155

  22. [22]

    Metal: metamorphic testing frameworkforanalyzinglarge-languagemodelqualities,in:Proceed- ing of the IEEE Conference on Software Testing, Verification and Validation, IEEE

    Hyun, S., Guo, M., Babar, M.A., 2024. Metal: metamorphic testing frameworkforanalyzinglarge-languagemodelqualities,in:Proceed- ing of the IEEE Conference on Software Testing, Verification and Validation, IEEE. pp. 117–128. doi:10.1109/ICST60714.2024.00019

  23. [23]

    16889–16914

    Jiang,J.,Wang,J.,Yan,Y.,Liu,Y.,Zhu,J.,Zhang,M.,etal.,2025.Do largelanguagemodelsexcelincomplexlogicalreasoningwithformal language?, in: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 16889–16914. doi:10. 18653/v1/2025.emnlp-main.855

  24. [24]

    Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., et al.,

  25. [25]

    Swe-bench: Can language models resolve real-world github issues? doi:10.48550/arXiv.2310.06770

  26. [26]

    Retrieval-augmentedgenerationforknowledge-intensive nlp tasks, in: Advances in Neural Information Processing Systems, Curran Associates, Inc

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., etal.,2020. Retrieval-augmentedgenerationforknowledge-intensive nlp tasks, in: Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 9459–9474

  27. [27]

    Drowzee: metamorphic testing for fact-conflicting hallucination detection in large language models

    Li, N., Li, Y., Liu, Y., Shi, L., Wang, K., Wang, H., 2024. Drowzee: metamorphic testing for fact-conflicting hallucination detection in large language models. Proceedings of the ACM on Programming Languages 8, 1843–1872. doi:10.1145/3689776

  28. [28]

    Evaluating the logical reasoning abilities of large reasoning models

    Liu, H., Ding, Y., Fu, Z., Zhang, C., Liu, X., Zhang, Y., 2025. Evaluating the logical reasoning abilities of large reasoning models. doi:10.48550/arXiv.2505.11854

  29. [29]

    Logiqa: a challenge dataset for machine reading comprehension with logical reasoning, in: Proceedings of the International Joint Confer- enceonArtificialIntelligence,pp.3622–3628

    Liu, J., Cui, L., Liu, H., Huang, D., Wang, Y., Zhang, Y., 2021. Logiqa: a challenge dataset for machine reading comprehension with logical reasoning, in: Proceedings of the International Joint Confer- enceonArtificialIntelligence,pp.3622–3628. URL:https://dl.acm. org/doi/10.5555/3491440.3491941

  30. [30]

    Towardslogiglue:Abriefsurveyandabenchmarkfor analyzing logical reasoning capabilities of language models

    Luo,M.,Kumbhar,S.,shen,M.,Parmar,M.,Varshney,N.,Banerjee, P.,etal.,2023. Towardslogiglue:Abriefsurveyandabenchmarkfor analyzing logical reasoning capabilities of language models. doi:10. 48550/ARXIV.2310.00836

  31. [31]

    Beyond accuracy: Evaluating the reasoning behavior of large language models - a survey, in: First Conference on Language Modeling

    Mondorf, P., Plank, B., 2024. Beyond accuracy: Evaluating the reasoning behavior of large language models - a survey, in: First Conference on Language Modeling. URL:https://openreview.net/ forum?id=Lmjgl2n11u

  32. [32]

    Murphy, C., Kaiser, G.E., Hu, L., Wu, L., 2008. Properties of machine learning applications for use in metamorphic testing, in: ProceedingsoftheInternationalConferenceonSoftwareEngineering & Knowledge Engineering, Knowledge Systems Institute Graduate School. pp. 867–872

  33. [33]

    Olausson, T., Gu, A., Lipkin, B., Zhang, C., Solar-Lezama, A., Tenenbaum, J., et al., 2023. Linc: a neurosymbolic approach for logicalreasoningbycombininglanguagemodelswithfirst-orderlogic provers, in: Proceedings of the Conference on Empirical Methods in NaturalLanguageProcessing,pp.5153–5176. doi:10.18653/v1/2023. emnlp-main.313

  34. [34]

    Logic-lm: Empoweringlargelanguagemodelswithsymbolicsolversforfaithful logical reasoning, in: Proceeding of the Conference on Empirical MethodsinNaturalLanguageProcessing

    Pan, L., Albalak, A., Wang, X., Wang, W.Y., 2023. Logic-lm: Empoweringlargelanguagemodelswithsymbolicsolversforfaithful logical reasoning, in: Proceeding of the Conference on Empirical MethodsinNaturalLanguageProcessing. URL:https://openreview. net/forum?id=nWXMv949ZH&noteId=qt0t8SsVvT

  35. [35]

    Thinking assistants: Llm-based conversational assistants that help users think by asking rather than answering

    Park, S., Subramonyam, H., Kulkarni, C., 2024. Thinking assistants: Llm-based conversational assistants that help users think by asking rather than answering. URL:https://arxiv.org/abs/2312.06024, arXiv:2312.06024

  36. [36]

    Parmar, M., Patel, N., Varshney, N., Nakamura, M., Luo, M., Mashetty,S.,etal.,2024. Logicbench:Towardssystematicevaluation oflogicalreasoningabilityoflargelanguagemodels,in:Proceedings Zenghui Zhou et al.:Preprint submitted to ElsevierPage 16 of 25 Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs oftheAnnualMeetingoftheA...

  37. [37]

    Patel, N., Kulkarni, M., Parmar, M., Budhiraja, A., Nakamura, M., Varshney, N., et al., 2024. Multi-logieval: towards evaluating multi- step logical reasoning ability of large language models, in: Proceed- ings of the Conference on Empirical Methods in Natural Language Processing, pp. 20856–20879. doi:10.18653/v1/2024.emnlp-main. 1160

  38. [38]

    Large language models meet symbolic provers for logical reasoning eval- uation, in: Proceeding of the International Conference on Learning Representations

    Qi, C., Ma, R., Li, B., Du, H., Hui, B., Wu, J., et al., 2024. Large language models meet symbolic provers for logical reasoning eval- uation, in: Proceeding of the International Conference on Learning Representations. URL:https://openreview.net/forum?id=C25SgeXWjE

  39. [39]

    Code llama: Open foundation models for code

    Rozière, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., et al., 2024. Code llama: Open foundation models for code. doi:10. 48550/arXiv.2308.12950

  40. [40]

    Language models are greedy reasoners: a systematic formal analysis of chain-of-thought, in: The International Conference on Learning Representations

    Saparov, A., He, H., 2022. Language models are greedy reasoners: a systematic formal analysis of chain-of-thought, in: The International Conference on Learning Representations. URL:https://openreview. net/forum?id=qFVVBzXxR2V

  41. [41]

    IEEETransactionsonSoftwareEngineering 42, 805–824

    Segura,S.,Fraser,G.,Sanchez,A.B.,Ruiz-Cortés,A.,2016.Asurvey onmetamorphictesting. IEEETransactionsonSoftwareEngineering 42, 805–824. doi:10.1109/TSE.2016.2532875

  42. [42]

    Arelargelanguagemodelsgoodatfuzzyreasoning?, in: Proceedings of the International Conference on Computational Intelligence and Intelligent Systems, pp

    Singh,S.,2024. Arelargelanguagemodelsgoodatfuzzyreasoning?, in: Proceedings of the International Conference on Computational Intelligence and Intelligent Systems, pp. 1–6. doi:10.1145/3708778. 3708779

  43. [43]

    Sinha, K., Sodhani, S., Dong, J., Pineau, J., Hamilton, W.L., 2019. Clutrr: a diagnostic benchmark for inductive reasoning from text, in: ProceedingsoftheConferenceonEmpiricalMethodsinNaturalLan- guage Processing and the International Joint Conference on Natural Language Processing, pp. 4505–4514. doi:10.18653/v1/D19-1458

  44. [44]

    Metarag: Metamorphic testing for hallucination detection in rag systems

    Sok, C., Luz, D., Haddam, Y., 2025. Metarag: Metamorphic testing for hallucination detection in rag systems. doi:10.48550/arXiv.2509. 09360

  45. [45]

    Challenging big-bench tasks and whether chain- of-thought can solve them, in: Findings of the Association for Com- putational Linguistics: ACL 2023, pp

    Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y., Chung, H.W., et al., 2023. Challenging big-bench tasks and whether chain- of-thought can solve them, in: Findings of the Association for Com- putational Linguistics: ACL 2023, pp. 13003–13051. doi:10.18653/ v1/2023.findings-acl.824

  46. [46]

    Proofwriter: generating implications,proofs,andabductivestatementsovernaturallanguage, in: Findings of the Association for Computational Linguistics: ACL- IJCNLP 2021, pp

    Tafjord, O., Dalvi, B., Clark, P., 2021. Proofwriter: generating implications,proofs,andabductivestatementsovernaturallanguage, in: Findings of the Association for Computational Linguistics: ACL- IJCNLP 2021, pp. 3621–3634. doi:10.18653/v1/2021.findings-acl. 317

  47. [47]

    Diagnosing the first-order logical reasoning ability through logicnli, in: Proceed- ings of the Conference on Empirical Methods in Natural Language Processing, pp

    Tian, J., Li, Y., Chen, W., Xiao, L., He, H., Jin, Y., 2021. Diagnosing the first-order logical reasoning ability through logicnli, in: Proceed- ings of the Conference on Empirical Methods in Natural Language Processing, pp. 3738–3747. doi:10.18653/v1/2021.emnlp-main.303

  48. [48]

    Wan, Y., Wang, W., Yang, Y., Yuan, Y., Huang, J.t., He, P., et al.,

  49. [49]

    LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models

    Logicasker: Evaluating and improving the logical reasoning ability of large language models, in: Proceedings of the International Conference on Empirical Methods in Natural Language Processing, pp. 2124–2155. doi:10.18653/v1/2024.emnlp-main.128

  50. [50]

    Self-consistency improves chain of thought reasoning in language models, in: The Eleventh International Conference on Learning Representations

    Wang, X., Wei, J., Schuurmans, D., Le, Q.V., Chi, E.H., Narang, S., et al., 2022. Self-consistency improves chain of thought reasoning in language models, in: The Eleventh International Conference on Learning Representations

  51. [51]

    Chain-of-thought prompting elicits reasoning in large language models, in: Proceedings of the International Conference on Neural Information Processing Systems, pp

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., et al., 2022. Chain-of-thought prompting elicits reasoning in large language models, in: Proceedings of the International Conference on Neural Information Processing Systems, pp. 24824–24837

  52. [52]

    A systematic literature review of hallucinations in large language models

    Woesle, C., Fischer-Brandies, L., Buettner, R., 2025. A systematic literature review of hallucinations in large language models. IEEE Access 13, 148231–148253. doi:10.1109/ACCESS.2025.3601206

  53. [53]

    Detecting and reducing the factual hallucinations of large language models with metamorphic testing

    Wu, W., Cao, Y., Yi, N., Ou, R., Zheng, Z., 2025. Detecting and reducing the factual hallucinations of large language models with metamorphic testing. Proceedings of the ACM on Software Engineering 2, 1432–1453. doi:10.1145/3715784

  54. [54]

    Testing and validating machine learning classifiers by metamorphic testing

    Xie,X.,Ho,J.W.K.,Murphy,C.,Kaiser,G.,Xu,B.,Chen,T.Y.,2011. Testing and validating machine learning classifiers by metamorphic testing. Journal of Systems and Software 84, 544–558. doi:10.1016/ j.jss.2010.11.920

  55. [55]

    Are largelanguagemodelsreallygoodlogicalreasoners?acomprehensive evaluation and beyond

    Xu, F., Lin, Q., Han, J., Zhao, T., Liu, J., Cambria, E., 2025. Are largelanguagemodelsreallygoodlogicalreasoners?acomprehensive evaluation and beyond. IEEE Transactions on Knowledge and Data Engineering 37, 1620–1634. doi:10.1109/TKDE.2025.3536008

  56. [56]

    arXiv preprint arXiv:2404.18824 , year=

    Xu,R.,Wang,Z.,Fan,R.Z.,Liu,P.,2024. Benchmarkingbenchmark leakage in large language models. doi:10.48550/arXiv.2404.18824

  57. [57]

    Hal- lucination detection in large language models with metamorphic relations

    Yang, B., Al Mamun, M.A., Zhang, J.M., Uddin, G., 2025a. Hal- lucination detection in large language models with metamorphic relations. Proceedings of the ACM on Software Engineering 2, 425–

  58. [58]

    Hallucinationdetectionfor llm-based text-to-sql generation via two-stage metamorphic testing

    Yang,B.,Xia,Y.,Sun,W.,Liu,Y.,2025b. Hallucinationdetectionfor llm-based text-to-sql generation via two-stage metamorphic testing. doi:10.48550/arXiv.2512.22250

  59. [59]

    Reclor: a reading comprehension dataset requiring logical reasoning, in: Proceedings oftheInternationalConferenceonLearningRepresentations

    Yu, W., Jiang, Z., Dong, Y., Feng, J., 2020. Reclor: a reading comprehension dataset requiring logical reasoning, in: Proceedings oftheInternationalConferenceonLearningRepresentations. doi:10. 48550/arXiv.2002.04326

  60. [60]

    Asurveyoflargelanguagemodelagentsforquestion answering

    Yue,M.,2025. Asurveyoflargelanguagemodelagentsforquestion answering. doi:10.48550/arXiv.2503.19213

  61. [61]

    Zhang, D., Li, Z.Z., Zhang, M.L., Zhang, J., Liu, Z., Yao, Y., et al.,

  62. [62]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , 1–20doi:10.1109/TPAMI.2025.3637037

    From system 1 to system 2: A survey of reasoning large lan- guage models. IEEE Transactions on Pattern Analysis and Machine Intelligence , 1–20doi:10.1109/TPAMI.2025.3637037

  63. [63]

    Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., et al.,

  64. [64]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Least-to-most prompting enables complex reasoning in large language models. doi:10.48550/arXiv.2205.10625

  65. [65]

    Toolqa: A dataset for llm question answering with external tools

    Zhuang, Y., Yu, Y., Wang, K., Sun, H., Zhang, C., 2023. Toolqa: A dataset for llm question answering with external tools. Advances in Neural Information Processing Systems 36, 50117–50143. Zenghui Zhou et al.:Preprint submitted to ElsevierPage 17 of 25 Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs A. Completeness of M...

  66. [67]

    All citizens of Lawton Park use the zip code 98199

  67. [69]

    Conclusion Tom is a citizen of Washington

    Daniel uses the zip code 98199. Conclusion Tom is a citizen of Washington. The ground-truth label of this instance isUnknown, since no premise connects Lawton Park or Seattle to Wash- ington. (2) FOL Representation.The corresponding symbolic representation is shown below. Premises 1.NeighbourhoodIn(lawtonPark, seattle) 2.forall x. (ResidentOf(x, lawtonPar...

  68. [70]

    LawtonParkisaneighborhoodinSeattle

  69. [71]

    For every person, either they are not a citizenofLawtonPark,ortheyusethezip code 98199

  70. [72]

    Tom is a citizen of Lawton Park

  71. [73]

    Conclusion Tom is a citizen of Washington

    Daniel uses the zip code 98199. Conclusion Tom is a citizen of Washington. (5)ModelOutputsandOracleDecision.Letthemodel outputsforthesourceandfollow-uptestcasesbedenotedas 𝑦𝑠and𝑦 𝑓,respectively.UnderLGMT,ametamorphicoracle violationoccurs if 𝑦𝑠 ≠𝑦 𝑓 Since the transformation preserves logical equivalence, the correct reasoning outcome should remain unchang...

  72. [75]

    label".↪ The value for

    Zero Explanation: Do not generate any reasoning, thought processes, or introductory text. Provide only the final judgment. ↪ ↪ # Output Format: You must output a single, strictly formatted JSON object. The JSON must contain exactly one key: "label".↪ The value for "label" must be exactly one of the following strings: "True", "False", or "Unknown".↪ Output...

  73. [77]

    reasoning

    Step-by-Step Deduction: You must perform a rigorous, step-by-step logical deduction. Act like a formal proof system. Clearly state how the premises interact to evaluate the conclusion. Do not skip logical steps. ↪ ↪ ↪ ↪ # Output Format: You must output a single, strictly formatted JSON object. The JSON must contain exactly two keys: "reasoning" and "label...

  74. [79]

    label".↪ The value for

    Zero Explanation: Do not generate any reasoning, thought processes, or introductory text. Provide only the final judgment. ↪ ↪ # Output Format: You must output a single, strictly formatted JSON object. The JSON must contain exactly one key: "label".↪ The value for "label" must be exactly one of the following strings: "True", "False", or "Unknown".↪ Output...

  75. [80]

    Your evaluation must rely strictly on formal logical structure

    Pure Formal Logic: Treat all provided premises as absolute truth, regardless of real-world facts. Your evaluation must rely strictly on formal logical structure. ↪ ↪ ↪

  76. [81]

    reasoning

    Step-by-Step Deduction: You must perform a rigorous, step-by-step logical deduction. Act like a formal proof system. Clearly state how the premises interact to evaluate the conclusion. Do not skip logical steps. ↪ ↪ ↪ ↪ # Output Format: You must output a single, strictly formatted JSON object. The JSON must contain exactly two keys: "reasoning" and "label...

  77. [82]

    both A and B

    **Logical Connectives (Scope by Structure)** - **AND (&)**: Use "both A and B". If A is a complex sub-formula, use a comma: "both A, and B".↪ - **OR (|)**: Use "either A or B". - **Biconditional (<->)**: Use "A if and only if B". (Use a comma before'if'if A is complex).↪ - **Negation (-)**: Always use the prefix "it is not the case that".↪ Zenghui Zhou et...

  78. [83]

    Jadiel is Bitter

    **Conditional Symbol Handling** - **Standard Word** (e.g.,`Bitter(x)`,`Jadiel`): Use natural phrasing.↪ Example:`Bitter(Jadiel)`-> "Jadiel is Bitter"; `-Bitter(Jadiel)`-> "it is not the case that Jadiel is Bitter". ↪ ↪ - **Abstract/Placeholder** (e.g.,`Pre1(x)`, `Con1`): Use formal phrasing.↪ Example:`Pre1(x)`-> "x has property Pre1"; `-Pre1(x)`-> "it is ...

  79. [84]

    For all x,

    **Quantifiers & Variables** - Keep the order strictly left-to-right. -`all x.`-> "For all x, " -`exists x.`-> "There exists at least one x, such that "↪ - **NO Pronouns**: Always repeat the variable (x, y) or entity name. Never use "it", "he", or "they".↪

  80. [85]

    it is not the case that it is not the case that A

    **No Simplification** - **Double Negation (--A)**: Translate as "it is not the case that it is not the case that A".↪ - **Redundancy (A | A)**: Translate as "either A is true or A is true".↪ - **Constants**:`& 1`-> "...and it is logically true";`| 0`-> "...or it is logically false".↪ # Examples for Reference - FOL: --Orange(Stanley) -> {"translation": "it...

Showing first 80 references.