pith. machine review for the scientific record. sign in

arxiv: 2605.10808 · v1 · submitted 2026-05-11 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Threat Modelling using Domain-Adapted Language Models: Empirical Evaluation and Insights

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:46 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords threat modellinglanguage modelsdomain adaptationSTRIDE5G securitycybersecurityempirical evaluationLLM limitations
0
0 comments X

The pith

Domain-adapted language models do not consistently outperform general-purpose models on structured threat modeling tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether language models adapted to cybersecurity and telecommunications domains improve performance on structured threat modeling compared to their general-purpose versions. The evaluation covers eight models across fifty-two configurations for STRIDE threat classification in 5G scenarios, varying domain adaptation, model scale, decoding methods such as greedy versus sampling, and prompting techniques. Results indicate that domain-adapted models show no reliable advantage, that decoding strategies strongly affect output validity and classification behavior, and that larger models deliver higher but still inconsistent performance. These patterns suggest current language models encounter basic limitations when applied to security tasks that require precise structured reasoning, so progress cannot come from data adaptation or scaling alone.

Core claim

The central claim is that domain-adapted LLMs and SLMs trained on telecommunications and cybersecurity data do not consistently surpass their base counterparts when performing STRIDE threat classification on 5G use cases. Across the tested configurations, model scale correlates with better results but the gains remain neither uniform nor sufficient for dependable application. Decoding strategies exert a pronounced influence on both the validity of generated outputs and the accuracy of threat categorization, while prompting adjustments offer limited mitigation. The work therefore concludes that fundamental limitations in current language models prevent reliable structured threat modeling and,

What carries the argument

Systematic empirical comparison of eight general and domain-adapted language models under fifty-two configurations for STRIDE threat classification on 5G scenarios, isolating effects of adaptation, scale, decoding, and prompting.

If this is right

  • Domain adaptation alone does not deliver consistent gains for STRIDE-based threat classification.
  • Choice of decoding strategy affects output validity and model behavior more than model type or adaptation status.
  • Larger models tend to perform better yet still fall short of the consistency required for practical threat modeling.
  • Prompting techniques can be refined for STRIDE tasks but do not overcome the observed limitations.
  • Reliable LLM use in structured security tasks will need additions such as explicit reasoning mechanisms beyond data or scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar output validity and consistency problems may surface in other structured security tasks such as attack surface mapping or control selection.
  • Security tools built on LLMs would benefit from mandatory output validation layers and hybrid rule-based components.
  • The results encourage development of models that embed security ontologies or step-by-step threat reasoning during training.
  • In operational settings, human review will likely remain necessary for threat modeling outputs generated by current language models.

Load-bearing premise

The chosen eight models, fifty-two configurations, and STRIDE classification task on 5G scenarios provide a representative test of whether domain-adapted LLMs can perform reliable structured threat modelling in real deployments.

What would settle it

A controlled study in an actual 5G network showing domain-adapted models produce reliably higher accuracy and fewer invalid outputs than general models across repeated independent threat modeling exercises would disprove the main finding.

Figures

Figures reproduced from arXiv: 2605.10808 by AbdulAziz AbdulGhaffar, Ashraf Matrawy, Saba Pourhanifeh.

Figure 1
Figure 1. Figure 1: Zero-shot STRIDE classification prompt used across all evaluated models. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Few-shot STRIDE classification prompt with in-context demonstrations used across all evaluated models. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: STRIDE Classification of 5G Threats under Greedy Decoding. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Penalized F1-Score under Greedy Decoding [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Few-shot Performance Gain under Greedy Decoding [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Percentage Distribution and STRIDE Classification of 5G Threats Under Stochastic Sampling. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Penalized F1-Score under Stochastic Sampling, Sorted by Model [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Few-shot Performance Gain under Stochastic Sampling [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Models’ Invalid Output Rate in Greedy Decoding [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example of a model providing a reference we could not find. [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Example of an invalid output where the model provides malformed [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Example of an invalid output where the model incorrectly identifies [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
read the original abstract

Large Language Models(LLMs) are increasingly explored for cybersecurity applications such as vulnerability detection. In the domain of threat modelling, prior work has primarily evaluated a number of general-purpose Large Language Models under limited prompting settings. In this study, we extend the research area of structured threat modelling by systematically evaluating domain-adapted language models of different sizes to their general counterparts. We use both LLMs and Small Language Models(SLMs) that were domain adapted to telecommunications and cybersecuirty. For the structured threat modelling, we selected the widely used STRIDE approach and the application area is 5G security. We present a comprehensive empirical evaluation using 52 different configurations (on 8 different language models) to analyze the impact of 1) domain adaptation, 2) model scale, 3) decoding strategies (greedy vs. stochastic sampling), and 4) prompting technique on STRIDE threat classification. Our results show that domain-adapted models do not consistently outperform their general-purpose counterparts, and decoding strategies significantly affect model behavior and output validity. They also show that while larger models generally achieve higher performance, these gains are neither consistent nor sufficient for reliable threat modelling. These findings highlight fundamental limitations of current LLMs for structured threat modelling tasks and suggest that improvements require more than additional training data or model scaling, motivating the need for incorporating more task-specific reasoning and stronger grounding in security concepts. We present insights on invalid outputs encountered and present suggestions for prompting tailored specifically for STRIDE threat modelling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper reports a systematic empirical evaluation of 8 language models (mix of LLMs and SLMs, domain-adapted to telecom/cybersecurity vs. general-purpose) on STRIDE-based threat classification for 5G security scenarios. Using 52 configurations, it tests the effects of domain adaptation, model scale, decoding strategies (greedy vs. sampling), and prompting on classification accuracy and output validity. Central claims are that domain-adapted models do not consistently outperform general counterparts, decoding strategies significantly impact behavior and validity, larger models yield inconsistent gains insufficient for reliability, and current LLMs have fundamental limitations for structured threat modelling that require more than data or scaling.

Significance. If the empirical patterns hold after methodological clarification, the work offers a useful stress-test of LLMs for cybersecurity applications, showing that domain adaptation and scale alone do not guarantee reliable structured outputs on STRIDE tasks. The multi-factor design (explicitly varying adaptation, size, decoding, and prompting) is a strength that allows isolation of effects and could inform more targeted future work on task-specific reasoning or security grounding. The insights on invalid outputs and prompting suggestions add practical value.

major comments (3)
  1. [Methodology section] Methodology (likely §3 or §4): No description is provided of how ground-truth STRIDE labels were assigned to the 5G scenarios, including whether expert consensus, single annotator, or automated mapping was used, nor any inter-rater agreement metric (e.g., Cohen's kappa). This directly undermines assessment of the reported performance deltas and the claim that domain-adapted models 'do not consistently outperform' general ones.
  2. [Results section] Results (likely §5, Tables 1-3): The statement that 'decoding strategies significantly affect model behavior and output validity' is presented without statistical tests (e.g., significance levels, confidence intervals, or p-values on accuracy/validity differences across greedy vs. sampling). Observed inconsistencies could be artifacts of the 52 configurations rather than general effects.
  3. [Discussion section] Discussion/Conclusion (likely §6): The generalization to 'fundamental limitations of current LLMs for structured threat modelling tasks' and the assertion that 'improvements require more than additional training data or model scaling' rests on the narrow 5G STRIDE classification task with 8 models; the paper does not address how representative these scenarios are or test broader threat-modelling contexts.
minor comments (2)
  1. [Abstract] Abstract: 'Large Language Models(LLMs)' and 'cybersecuirty' contain typographical issues (missing space and misspelling).
  2. [Abstract] The repeated phrasing 'We present insights on invalid outputs encountered and present suggestions' could be streamlined for conciseness.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing honest clarifications based on the manuscript and indicating planned revisions where the concerns are valid.

read point-by-point responses
  1. Referee: [Methodology section] Methodology (likely §3 or §4): No description is provided of how ground-truth STRIDE labels were assigned to the 5G scenarios, including whether expert consensus, single annotator, or automated mapping was used, nor any inter-rater agreement metric (e.g., Cohen's kappa). This directly undermines assessment of the reported performance deltas and the claim that domain-adapted models 'do not consistently outperform' general ones.

    Authors: We acknowledge that the methodology section lacks an explicit description of the ground-truth labeling process. The 5G scenarios were labeled through expert analysis by the authors (who have domain expertise in 5G security and threat modelling), using a systematic review of each scenario description against STRIDE categories; this was performed primarily by one annotator with cross-checks by co-authors, but without formal inter-rater agreement metrics such as Cohen's kappa. We will revise the methodology section to include a clear description of this process and note the absence of quantitative agreement metrics as a study limitation. This will allow better evaluation of the performance claims without misrepresenting the original work. revision: yes

  2. Referee: [Results section] Results (likely §5, Tables 1-3): The statement that 'decoding strategies significantly affect model behavior and output validity' is presented without statistical tests (e.g., significance levels, confidence intervals, or p-values on accuracy/validity differences across greedy vs. sampling). Observed inconsistencies could be artifacts of the 52 configurations rather than general effects.

    Authors: We agree that the absence of statistical tests weakens the support for the claim. While differences in accuracy and validity between greedy and sampling were observed consistently across the 52 configurations and 8 models, no formal tests were included in the original submission. In the revised manuscript, we will add appropriate statistical analyses (e.g., McNemar's test or Wilcoxon signed-rank tests for paired comparisons, with p-values, confidence intervals, and effect sizes) to quantify the significance of decoding strategy effects and address potential artifacts from the experimental design. revision: yes

  3. Referee: [Discussion section] Discussion/Conclusion (likely §6): The generalization to 'fundamental limitations of current LLMs for structured threat modelling tasks' and the assertion that 'improvements require more than additional training data or model scaling' rests on the narrow 5G STRIDE classification task with 8 models; the paper does not address how representative these scenarios are or test broader threat-modelling contexts.

    Authors: We accept that the conclusions are based on a focused evaluation of 5G STRIDE scenarios with the selected models and that broader generalization requires caution. The 5G scenarios were selected to cover diverse real-world threats in a critical domain, but we did not explicitly discuss their representativeness or test other threat-modelling contexts. We will revise the discussion and conclusion to more explicitly qualify the scope, describe why 5G STRIDE provides relevant insights for structured tasks, and frame the 'fundamental limitations' claim as specific to this setting while calling for future work on additional domains. This addresses the concern without expanding the experimental scope. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation of LLM performance

full rationale

The paper reports direct empirical measurements of model outputs on STRIDE classification tasks across 52 configurations and 8 models, comparing domain-adapted vs. general-purpose LLMs on 5G scenarios. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-referential definitions appear in the provided text. Central claims rest on observed performance deltas and invalid output patterns rather than any reduction to inputs by construction. Prior work is referenced only for context, not as load-bearing justification for uniqueness or ansatz. This is a standard empirical study with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a pure empirical evaluation study with no mathematical derivations, new postulates, or invented entities; it relies on the pre-existing STRIDE framework and standard practices for LLM prompting and output classification.

pith-pipeline@v0.9.0 · 5580 in / 1123 out tokens · 40426 ms · 2026-05-12T03:46:21.848935+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

  1. [1]

    Large language model (llm) for telecommu- nications: A comprehensive survey on principles, key techniques, and opportunities,

    H. Zhou, C. Hu, Y . Yuan, Y . Cui, Y . Jin, C. Chen, H. Wu, D. Yuan, L. Jiang, D. Wuet al., “Large language model (llm) for telecommu- nications: A comprehensive survey on principles, key techniques, and opportunities,”IEEE Communications Surveys & Tutorials, vol. 27, no. 3, pp. 1955–2005, 2024

  2. [2]

    When llms meet cybersecurity: A systematic literature review,

    J. Zhang, H. Bu, H. Wen, Y . Liu, H. Fei, R. Xi, L. Li, Y . Yang, H. Zhu, and D. Meng, “When llms meet cybersecurity: A systematic literature review,”Cybersecurity, vol. 8, no. 1, p. 55, 2025

  3. [3]

    Large language models for cyber security: A systematic literature review,

    H. Xu, S. Wang, N. Li, K. Wang, Y . Zhao, K. Chen, T. Yu, Y . Liu, and H. Wang, “Large language models for cyber security: A systematic literature review,”ACM Transactions on Software Engineering and Methodology, 2024

  4. [4]

    Observations on llms for telecom domain: capabilities and limitations,

    S. Soman and R. HG, “Observations on llms for telecom domain: capabilities and limitations,” inProceedings of the Third International Conference on AI-ML Systems, 2023, pp. 1–5

  5. [5]

    Towards explainable network intrusion detection using large language models,

    P. R. Houssel, P. Singh, S. Layeghy, and M. Portmann, “Towards explainable network intrusion detection using large language models,” in2024 IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT). IEEE, 2024, pp. 67–72

  6. [6]

    The stride threat model

    Microsoft, “The stride threat model.” [Online]. Available: http://msdn.microsoft.com/en-us/library/ee823878(v=cs.20).aspx

  7. [7]

    From large to mammoth: A comparative evaluation of large language models in vulnerability detection

    J. Lin, D. Mohaisenet al., “From large to mammoth: A comparative evaluation of large language models in vulnerability detection.” inNDSS, 2025

  8. [8]

    Llms’ suitability for network security: A case study of stride threat modeling,

    A. AbdulGhaffar and A. Matrawy, “Llms’ suitability for network security: A case study of stride threat modeling,” 2025. [Online]. Available: https://arxiv.org/abs/2505.04101

  9. [9]

    Tele-llms: A series of specialized large language models for telecommunications,

    A. Maatouk, K. C. Ampudia, R. Ying, and L. Tassiulas, “Tele-llms: A series of specialized large language models for telecommunications,”

  10. [10]

    Available: https://arxiv.org/abs/2409.05314

    [Online]. Available: https://arxiv.org/abs/2409.05314

  11. [11]

    Telecomgpt: A framework to build telecom-specific large language models,

    H. Zou, Q. Zhao, Y . Tian, L. Bariah, F. Bader, T. Lestable, and M. Deb- bah, “Telecomgpt: A framework to build telecom-specific large language models,”IEEE Transactions on Machine Learning in Communications and Networking, 2025

  12. [12]

    Llama-3.1-foundationai-securityllm-base-8b technical report,

    P. Kassianik, B. Saglam, A. Chen, B. Nelson, A. Vellore, M. Aufiero, F. Burch, D. Kedia, A. Zohary, S. Weerawardhena, A. Priyanshu, A. Swanda, A. Chang, H. Anderson, K. Oshiba, O. Santos, Y . Singer, and A. Karbasi, “Llama-3.1-foundationai-securityllm-base-8b technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2504.21039

  13. [13]

    A practical guide for evaluating llms and llm-reliant systems,

    E. M. Rudd, C. Andrews, and P. Tully, “A practical guide for evaluating llms and llm-reliant systems,” 2025. [Online]. Available: https://arxiv.org/abs/2506.13023

  14. [14]

    A survey on small language models,

    C. Van Nguyen, X. Shen, R. Aponte, Y . Xia, S. Basu, Z. Hu, J. Chen, M. Parmar, S. Kunapuli, J. Barrowet al., “A survey on small language models,” inProceedings of the 15th International Conference on Recent Advances in Natural Language Processing-Natural Language Process- ing in the Generative AI Era, 2025, pp. 807–821

  15. [15]

    Security analysis of critical 5g interfaces,

    M. Mahyoubet al., “Security analysis of critical 5g interfaces,”IEEE Communications Surveys & Tutorials, 2024

  16. [16]

    Reuse, don’t retrain: A recipe for continued pretraining of language models,

    J. Parmar, S. Satheesh, M. Patwary, M. Shoeybi, and B. Catanzaro, “Reuse, don’t retrain: A recipe for continued pretraining of language models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.07263

  17. [17]

    K., Sharma, N., Bethge, M., and Ermis, B

    C ¸ a˘gatay Yıldız, N. K. Ravichandran, N. Sharma, M. Bethge, and B. Ermis, “Investigating continual pretraining in large language models: Insights and implications,” 2025. [Online]. Available: https://arxiv.org/abs/2402.17400

  18. [18]

    Mixtral-8x7B-Instruct-v0.1,

    Mistral AI, “Mixtral-8x7B-Instruct-v0.1,” 2023. [Online]. Available: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1

  19. [19]

    Teleqna: A benchmark dataset to assess large language models telecommunications knowledge,

    A. Maatouk, F. Ayed, N. Piovesan, A. De Domenico, M. Debbah, and Z.- Q. Luo, “Teleqna: A benchmark dataset to assess large language models telecommunications knowledge,”IEEE Network, 2025

  20. [20]

    Large language models are not robust multiple choice selectors.arXiv preprint arXiv:2309.03882, 2023

    C. Zheng, H. Zhou, F. Meng, J. Zhou, and M. Huang, “Large language models are not robust multiple choice selectors,” 2024. [Online]. Available: https://arxiv.org/abs/2309.03882

  21. [21]

    Llama 3 model card,

    AI@Meta, “Llama 3 model card,” 2024. [Online]. Available: https://github.com/meta-llama/llama3/blob/main/MODEL CARD.md

  22. [22]

    Llama-3.1-foundationai-securityllm-8b-instruct technical report,

    S. Weerawardhena, P. Kassianik, B. Nelson, B. Saglam, A. Vellore, A. Priyanshu, S. Vijay, M. Aufiero, A. Goldblatt, F. Burch, E. Li, J. He, D. Kedia, K. Oshiba, Z. Yang, Y . Singer, and A. Karbasi, “Llama-3.1-foundationai-securityllm-8b-instruct technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2508.01059

  23. [23]

    Llama 3.1 8b model,

    AI@Meta, “Llama 3.1 8b model,” 2024. [Online]. Available: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

  24. [24]

    Large language models do multi-label classification differently,

    M. Ma, G. Chochlakis, N. M. Pandiyan, J. Thomason, and S. Narayanan, “Large language models do multi-label classification differently,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 2472–2495. [Online]. Available: https://aclanthology.org/2...

  25. [25]

    Security architecture and procedures for 5G system,

    3rd Generation Partnership Project (3GPP), “Security architecture and procedures for 5G system,” Tech. Rep., TS 33.501, Release 19, 2025, version 19.2.0

  26. [26]

    Security Assurance Specification (SCAS) threats and critical assets in 3GPP network product classes,

    ——, “Security Assurance Specification (SCAS) threats and critical assets in 3GPP network product classes,” Tech. Rep., TS 33.926, Release 19, 2025, version 19.3.0

  27. [27]

    5G Cybersecurity,

    M. Bartocket al., “5G Cybersecurity,” National Institute of Standards and Technology, NIST Special Publication 800-33B, Apr. 2022. [Online]. Available: https://www.nccoe.nist.gov/sites/default/files/2022- 04/nist-5G-sp1800-33b-preliminary-draft.pdf

  28. [28]

    Study on Security for Next Radio (NR) Integrated Access and Backhaul (IAB) (Release 17),

    3rd Generation Partnership Project (3GPP), “Study on Security for Next Radio (NR) Integrated Access and Backhaul (IAB) (Release 17),” Tech. Rep., TS 33.824, Release 17, 2022, version 17.0.0

  29. [29]

    Llama 3.2 3b model,

    AI@Meta, “Llama 3.2 3b model,” 2024. [Online]. Available: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct

  30. [30]

    Llama 3.2 1b model,

    ——, “Llama 3.2 1b model,” 2024. [Online]. Available: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

  31. [31]

    The effect of sampling temperature on problem solving in large language models,

    M. Renze, “The effect of sampling temperature on problem solving in large language models,” inFindings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Linguistics, Nov. 2024, pp. 7346–7356. [Online]. Available: https://aclanthology.org/2024.findings-emnlp.432/

  32. [32]

    Exploring the impact of temperature on large language models: Hot or cold?

    L. Li, L. Sleem, G. Nichil, R. Stateet al., “Exploring the impact of temperature on large language models: Hot or cold?”Procedia Computer Science, vol. 264, pp. 242–251, 2025

  33. [33]

    The curious case of neural text degeneration,

    A. Holtzman, J. Buys, L. Du, M. Forbes, and Y . Choi, “The curious case of neural text degeneration,” inInternational Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=rygGQyrFvH

  34. [34]

    Hugging Face – The AI community building the future

    “Hugging Face – The AI community building the future.” [Online]. Available: https://huggingface.co/

  35. [35]

    Language mod- els are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

  36. [36]

    Language models don’t always say what they think: Unfaithful explanations in chain- of-thought prompting,

    M. Turpin, J. Michael, E. Perez, and S. Bowman, “Language models don’t always say what they think: Unfaithful explanations in chain- of-thought prompting,”Advances in Neural Information Processing Systems, vol. 36, pp. 74 952–74 965, 2023

  37. [37]

    Do large language models know what they don’t know?

    Z. Yin, Q. Sun, Q. Guo, J. Wu, X. Qiu, and X. Huang, “Do large language models know what they don’t know?” inFindings of the Association for Computational Linguistics: ACL 2023, A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 8653–8665. [Online]. Available: https://aclanthology.org...

  38. [38]

    Better zero-shot reasoning with role-play prompting,

    A. Kong, S. Zhao, H. Chen, Q. Li, Y . Qin, R. Sun, X. Zhou, E. Wang, and X. Dong, “Better zero-shot reasoning with role-play prompting,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024, pp. 4099–4113. IX. APPENDIX: FAILEDLLM OUTP...