pith. machine review for the scientific record. sign in

arxiv: 2604.22432 · v1 · submitted 2026-04-24 · 💻 cs.SE

Recognition: unknown

R2Code: A Self-Reflective LLM Framework for Requirements-to-Code Traceability

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:11 UTC · model grok-4.3

classification 💻 cs.SE
keywords requirements-to-code traceabilityLLM frameworkself-reflective verificationbidirectional alignmentdynamic context retrievalsoftware maintenancetrace link accuracy
0
0 comments X

The pith

R2Code improves requirement-to-code traceability accuracy by 7.4% on average while cutting token use up to 41.7% through self-reflective LLM processing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that a self-reflective LLM framework can generate more reliable links between software requirements and code than lexical or embedding-based methods. It does this by decomposing requirements into semantic layers, aligning them bidirectionally with code structures, verifying consistency through generated explanations, and adjusting context retrieval dynamically. A sympathetic reader would care because better traceability supports ongoing software maintenance by making it easier to locate related code and understand system changes. The reported experiments show gains across five datasets in multiple domains and two languages, with lower inference costs from the adaptive controls.

Core claim

R2Code integrates a decomposition-enhanced Bidirectional Alignment Network to align four-layer requirement semantics with corresponding code structures for cross-level matching, a Self-Reflective Consistency Verification module that uses explanation-guided checks to calibrate link reliability, and a Dynamic Context-Adaptive Retrieval mechanism that adjusts granularity and filters contexts via semantic-overlap weighting. This combination is shown to outperform baselines on public datasets while reducing token consumption through efficient context utilization.

What carries the argument

The three integrated components: Bidirectional Alignment Network for semantic layer alignment, Self-Reflective Consistency Verification for explanation-based reliability calibration, and Dynamic Context-Adaptive Retrieval for overlap-weighted context filtering.

If this is right

  • Trace links become more complete and consistent across projects in different domains and languages.
  • Inference costs drop substantially as context is filtered by semantic overlap rather than fixed long windows.
  • Each link receives a calibrated reliability score from the verification step.
  • The approach scales to both Java and Python codebases with reported gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The self-reflective verification step could transfer to other software tasks that need LLM outputs checked for internal consistency.
  • If the cost savings hold, teams might run traceability checks more frequently during development cycles.
  • The layered alignment idea might extend to tracing requirements to test cases or design documents.

Load-bearing premise

Current LLMs can reliably perform cross-level semantic matching and explanation-guided consistency verification without systematic hallucinations or domain biases that invalidate the links.

What would settle it

A new dataset experiment where expert reviewers manually audit all produced links and find no F1 improvement or actual accuracy drop would falsify the performance claims.

Figures

Figures reproduced from arXiv: 2604.22432 by Jacky Keung, Kehui Chen, Xiaoxue Ma, Yifei Wang, Yishu Li, Zhenyu Mao.

Figure 1
Figure 1. Figure 1: R2Code Traceability Workflow C. HSD+BAN: Decomposition and Alignment R2Code begins by converting both requirements and code entities into structured semantic representations, so that match￾ing is performed over aligned semantic dimensions rather than raw text alone. Specifically, the framework applies Hi￾erarchical Semantic Decomposition (HSD) to obtain a four￾layer requirement representation and a four-la… view at source ↗
Figure 2
Figure 2. Figure 2: Layer-wise alignment scores of BAN on iTrust view at source ↗
Figure 3
Figure 3. Figure 3: Discriminative consistency validation of SRCV view at source ↗
read the original abstract

Accurate requirement-to-code traceability is crucial for software maintenance. However, existing IR- and embedding-based methods are heavily dependent on lexical similarity, often yielding incomplete or inconsistent links across projects and languages and incurring high cost from long-context retrieval and prompting. This paper presents R2Code, an LLM-based semantic traceability framework designed to improve trace link accuracy while reducing inference cost. R2Code integrates three components: 1) a decomposition-enhanced Bidirectional Alignment Network (BAN) that aligns four-layer requirement semantics with corresponding code structures to support cross-level semantic matching; 2) a Self-Reflective Consistency Verification (SRCV) module that conducts explanation-guided consistency checking to calibrate link reliability; and 3) a Dynamic Context-Adaptive Retrieval (DCAR) mechanism that adjusts retrieval granularity and filters contexts using semantic-overlap weighting for efficient context utilization. Experiments on five public datasets spanning multiple domains and two programming languages demonstrate that R2Code consistently outperforms the strongest baselines, achieving an average F1 gain of 7.4%, while reducing token consumption by up to 41.7% through adaptive context control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces R2Code, an LLM-based framework for requirements-to-code traceability. It integrates a decomposition-enhanced Bidirectional Alignment Network (BAN) for four-layer semantic alignment between requirements and code, a Self-Reflective Consistency Verification (SRCV) module for explanation-guided link calibration, and a Dynamic Context-Adaptive Retrieval (DCAR) mechanism for adjusting retrieval granularity and filtering via semantic-overlap weighting. Experiments across five public datasets in multiple domains and two languages claim consistent outperformance over baselines with an average 7.4% F1 gain and up to 41.7% token reduction.

Significance. If the empirical results hold under rigorous validation, the work would offer a practical advance in software traceability by addressing lexical limitations of prior IR/embedding methods through semantic LLM components while mitigating high inference costs. The modular design targeting alignment, verification, and efficiency could inform future LLM applications in maintenance tasks, provided the gains are shown to stem from the framework rather than unexamined LLM behaviors.

major comments (3)
  1. [Section 4] Section 4 (Experiments): The abstract and results claim consistent 7.4% average F1 improvement and 41.7% token savings across five datasets, but provide no details on baseline selection criteria, whether they are the strongest current methods, application of statistical significance tests (e.g., paired t-tests or Wilcoxon), or variance/standard deviation across multiple LLM runs to account for stochasticity. This information is load-bearing for the central claim of outperformance.
  2. [Section 3.2] Section 3.2 (SRCV): The Self-Reflective Consistency Verification module is presented as performing explanation-guided consistency checking to calibrate link reliability. However, no ablation studies, hallucination rate measurements, or error analysis on spurious link endorsement are reported, leaving open whether the F1 gains can be attributed to the module or to LLM artifacts.
  3. [Section 3.1] Section 3.1 (BAN): The decomposition-enhanced Bidirectional Alignment Network is claimed to support cross-level semantic matching via four-layer alignment. The manuscript does not specify how the layers are defined or decomposed, nor does it include controls for whether alignment quality varies by domain or language, which directly affects the reported cross-dataset consistency.
minor comments (3)
  1. [Abstract] Abstract: The phrase 'four-layer requirement semantics' is used without a concise definition or pointer to the decomposition method, which would aid reader comprehension.
  2. [Section 4] The manuscript would benefit from a summary table of the five datasets (size, domains, languages, number of true links) to contextualize the results.
  3. [Section 3] Notation for the three components (BAN, SRCV, DCAR) should be introduced with explicit definitions in the first use in the main body rather than relying solely on the abstract.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight areas where the manuscript can be strengthened. We address each major comment below and will incorporate the suggested clarifications and analyses in the revised version.

read point-by-point responses
  1. Referee: [Section 4] Section 4 (Experiments): The abstract and results claim consistent 7.4% average F1 improvement and 41.7% token savings across five datasets, but provide no details on baseline selection criteria, whether they are the strongest current methods, application of statistical significance tests (e.g., paired t-tests or Wilcoxon), or variance/standard deviation across multiple LLM runs to account for stochasticity. This information is load-bearing for the central claim of outperformance.

    Authors: We agree that explicit documentation of these elements is necessary to support the central claims. In the revision, we will expand Section 4 with: (i) a clear statement of baseline selection criteria, confirming that the chosen methods represent the strongest published approaches for requirements-to-code traceability at the time of submission; (ii) results of statistical significance testing (paired t-tests and Wilcoxon signed-rank tests) on the F1 scores across the five datasets; and (iii) mean F1 scores with standard deviations computed over multiple independent runs (five runs per configuration with varied random seeds) to quantify stochasticity. These additions will be presented in new tables and text to substantiate the reported 7.4% average gain and token reductions. revision: yes

  2. Referee: [Section 3.2] Section 3.2 (SRCV): The Self-Reflective Consistency Verification module is presented as performing explanation-guided consistency checking to calibrate link reliability. However, no ablation studies, hallucination rate measurements, or error analysis on spurious link endorsement are reported, leaving open whether the F1 gains can be attributed to the module or to LLM artifacts.

    Authors: We acknowledge the need for targeted analysis to isolate SRCV's contribution. The revised manuscript will include: (i) ablation experiments that disable the SRCV module while keeping BAN and DCAR fixed, reporting the resulting F1 drops on all datasets; (ii) a quantitative assessment of hallucination rates via manual review of a stratified sample of generated explanations; and (iii) an error analysis categorizing spurious link endorsements and their impact on overall performance. These results will be added to Section 3.2 and the experimental section to demonstrate that the observed gains are attributable to the module rather than generic LLM behavior. revision: yes

  3. Referee: [Section 3.1] Section 3.1 (BAN): The decomposition-enhanced Bidirectional Alignment Network is claimed to support cross-level semantic matching via four-layer alignment. The manuscript does not specify how the layers are defined or decomposed, nor does it include controls for whether alignment quality varies by domain or language, which directly affects the reported cross-dataset consistency.

    Authors: We will revise Section 3.1 to provide an explicit definition of the four-layer alignment structure, including the precise decomposition rules applied to requirement semantics (e.g., functional, non-functional, constraint, and contextual layers) and corresponding code elements (e.g., method signatures, bodies, comments, and dependencies). In addition, we will add a new subsection with per-dataset and per-language breakdowns of alignment quality metrics, along with controls that measure consistency across domains and languages. This will directly address concerns about cross-dataset validity and strengthen the justification for the reported performance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework validated on external datasets

full rationale

The paper proposes R2Code as an LLM framework with three explicitly designed components (BAN for four-layer alignment, SRCV for explanation-guided verification, DCAR for adaptive retrieval). Performance metrics (7.4% average F1 gain, up to 41.7% token reduction) are reported as measured experimental outcomes across five public datasets in multiple domains and languages, not as predictions derived from the framework definitions themselves. No equations, self-referential fits, or load-bearing self-citations appear in the abstract or context that would reduce claims to inputs by construction. The derivation chain consists of design choices followed by independent empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven capability of LLMs to perform the described semantic and reflective tasks; no free parameters or invented entities are explicitly named in the abstract.

axioms (1)
  • domain assumption LLMs can reliably perform decomposition-enhanced semantic alignment and explanation-guided consistency verification for traceability links
    Invoked by the design of BAN and SRCV components; no derivation or external validation supplied in the abstract.

pith-pipeline@v0.9.0 · 5511 in / 1247 out tokens · 55959 ms · 2026-05-08T11:11:14.029814+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 8 canonical work pages · 2 internal anchors

  1. [1]

    Requirements traceability: Theory and practice,

    B. Ramesh, C. Stubbs, T. Powers, and M. Edwards, “Requirements traceability: Theory and practice,”Annals of software engineering, vol. 3, no. 1, pp. 397–415, 1997

  2. [2]

    Recovering traceability links between code and documentation: a retrospective,

    G. Antoniol, G. Canfora, G. Casazza, A. De Lucia, and E. Merlo, “Recovering traceability links between code and documentation: a retrospective,”IEEE Transactions on Software Engineering, 2025

  3. [3]

    Code patterns for automatically validating requirements-to-code traces,

    A. Ghabi and A. Egyed, “Code patterns for automatically validating requirements-to-code traces,” inProceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, 2012, pp. 200–209

  4. [4]

    Factors influencing requirements traceability practice,

    B. Ramesh, “Factors influencing requirements traceability practice,” Communications of the ACM, vol. 41, no. 12, pp. 37–44, 1998

  5. [5]

    The impact of traceability on software maintenance and evolution: A mapping study,

    F. Tian, T. Wang, P. Liang, C. Wang, A. A. Khan, and M. A. Babar, “The impact of traceability on software maintenance and evolution: A mapping study,”Journal of Software: Evolution and Process, vol. 33, no. 10, p. e2374, 2021

  6. [6]

    A systematic literature review of issue-based requirement traceability,

    Y . Lyu, H. Cho, P. Jung, and S. Lee, “A systematic literature review of issue-based requirement traceability,”Ieee Access, vol. 11, pp. 13 334– 13 348, 2023

  7. [7]

    Improving traceability link recovery using fine-grained requirements-to-code relations,

    T. Hey, F. Chen, S. Weigelt, and W. F. Tichy, “Improving traceability link recovery using fine-grained requirements-to-code relations,” in2021 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2021, pp. 12–22

  8. [8]

    Natural language processing for requirements traceability,

    J. L. Guo, J.-P. Steghöfer, A. V ogelsang, and J. Cleland-Huang, “Natural language processing for requirements traceability,” inHandbook on Natural Language Processing for Requirements Engineering. Springer, 2025, pp. 89–116

  9. [9]

    Do information retrieval algorithms for automated traceability perform effectively on issue tracking system data?

    T. Merten, D. Krämer, B. Mager, P. Schell, S. Bürsner, and B. Paech, “Do information retrieval algorithms for automated traceability perform effectively on issue tracking system data?” inInternational Working Conference on Requirements Engineering: Foundation for Software Quality. Springer, 2016, pp. 45–62

  10. [10]

    Code gradients: Towards automated traceability of llm-generated code,

    M. North, A. Atapour-Abarghouei, and N. Bencomo, “Code gradients: Towards automated traceability of llm-generated code,” in2024 IEEE 32nd International Requirements Engineering Conference (RE). IEEE, 2024, pp. 321–329

  11. [11]

    Evaluating the use of llms for documentation to code traceability,

    E. Alor, S. Khatoonabadi, and E. Shihab, “Evaluating the use of llms for documentation to code traceability,”arXiv preprint arXiv:2506.16440, 2025

  12. [12]

    A., D˛ abrowski, J., Alhoshan, W., Zhao, L., and Ferrari, A., 2025

    M. A. Zadenoori, J. Dabrowski, W. Alhoshan, L. Zhao, and A. Ferrari, “Large language models (llms) for requirements engineering (re): A systematic literature review,”arXiv preprint arXiv:2509.11446, 2025

  13. [13]

    Towards en- gineering multi-agent llms: A protocol-driven approach,

    Z. Mao, J. Keung, F. Zhang, S. Liu, Y . Wang, and J. Li, “Towards en- gineering multi-agent llms: A protocol-driven approach,”arXiv preprint arXiv:2510.12120, 2025

  14. [14]

    Simac: simulating agile collaboration to generate acceptance criteria in user story elaboration,

    Y . Li, J. Keung, Z. Yang, X. Ma, J. Zhang, and S. Liu, “Simac: simulating agile collaboration to generate acceptance criteria in user story elaboration,”Automated Software Engineering, vol. 31, no. 2, p. 55, 2024

  15. [15]

    Hgnnlink: recovering requirements-code traceability links with text and dependency-aware heterogeneous graph neural networks,

    B. Wang, Z. Zou, X. Liang, H. Jin, and P. Liang, “Hgnnlink: recovering requirements-code traceability links with text and dependency-aware heterogeneous graph neural networks,”Automated Software Engineer- ing, vol. 32, no. 2, p. 55, 2025

  16. [16]

    Software traceability: trends and future directions,

    J. Cleland-Huang, O. C. Gotel, J. Huffman Hayes, P. Mäder, and A. Zisman, “Software traceability: trends and future directions,” in Future of software engineering proceedings. ACM, 2014, pp. 55–69

  17. [17]

    Toward reference models for requirements traceability,

    B. Ramesh and M. Jarke, “Toward reference models for requirements traceability,”IEEE transactions on software engineering, vol. 27, no. 1, pp. 58–93, 2002

  18. [18]

    Gray links in the use of require- ments traceability,

    N. Niu, W. Wang, and A. Gupta, “Gray links in the use of require- ments traceability,” inProceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering, 2016, pp. 384–395

  19. [19]

    How effective is automated trace link recovery in model-driven development?

    R. Rasiman, F. Dalpiaz, and S. España, “How effective is automated trace link recovery in model-driven development?” inInternational Working Conference on Requirements Engineering: Foundation for Software Quality. Springer, 2022, pp. 35–51

  20. [20]

    A literature review of automatic traceability links recovery for software change impact analysis,

    T. W. W. Aung, H. Huo, and Y . Sui, “A literature review of automatic traceability links recovery for software change impact analysis,” in Proceedings of the 28th International Conference on Program Com- prehension, 2020, pp. 14–24

  21. [21]

    An empirical study on recovering requirement-to-code links,

    Y . Zhang, C. Wan, and B. Jin, “An empirical study on recovering requirement-to-code links,” in2016 17th IEEE/ACIS International Con- ference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). IEEE, 2016, pp. 121–126

  22. [22]

    A systematic map- ping study of information retrieval approaches applied to requirements trace recovery

    B. Wang, H. Wang, R. Luo, S. Zhang, and Q. Zhu, “A systematic map- ping study of information retrieval approaches applied to requirements trace recovery.” inSEKE, 2022, pp. 1–6

  23. [23]

    Requirements classification for trace- ability link recovery,

    T. Hey, J. Keim, and S. Corallo, “Requirements classification for trace- ability link recovery,” in2024 IEEE 32nd International Requirements Engineering Conference (RE). IEEE, 2024, pp. 155–167

  24. [24]

    Improving requirements tracing via information retrieval,

    J. H. Hayes, A. Dekhtyar, and J. Osborne, “Improving requirements tracing via information retrieval,” inProceedings. 11th IEEE Interna- tional Requirements Engineering Conference, 2003.IEEE, 2003, pp. 138–147

  25. [25]

    Ealink: An efficient and accurate pre-trained framework for issue-commit link recovery,

    C. Zhang, Y . Wang, Z. Wei, Y . Xu, J. Wang, H. Li, and R. Ji, “Ealink: An efficient and accurate pre-trained framework for issue-commit link recovery,” in2023 38th IEEE/ACM International Conference on Auto- mated Software Engineering (ASE). IEEE, 2023, pp. 217–229

  26. [26]

    Recovering from a decade: a systematic mapping of information retrieval approaches to software traceability,

    M. Borg, P. Runeson, and A. Ardö, “Recovering from a decade: a systematic mapping of information retrieval approaches to software traceability,”Empirical Software Engineering, vol. 19, no. 6, pp. 1565– 1616, 2014

  27. [27]

    The probabilistic relevance frame- work: Bm25 and beyond,

    S. Robertson, H. Zaragozaet al., “The probabilistic relevance frame- work: Bm25 and beyond,”Foundations and Trends® in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009

  28. [28]

    Indexing by latent semantic analysis,

    S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, “Indexing by latent semantic analysis,”Journal of the American society for information science, vol. 41, no. 6, pp. 391–407, 1990

  29. [29]

    From word embeddings to document distances,

    M. Kusner, Y . Sun, N. Kolkin, and K. Weinberger, “From word embeddings to document distances,” inProceedings of the 32nd International Conference on Machine Learning (ICML). PMLR, 2015, pp. 957–966. [Online]. Available: https://proceedings.mlr.press/ v37/kusnerb15.html

  30. [30]

    Recovering documentation-to-source-code traceability links using latent semantic indexing,

    A. Marcus and J. I. Maletic, “Recovering documentation-to-source-code traceability links using latent semantic indexing,” in25th International Conference on Software Engineering, 2003. Proceedings.IEEE, 2003, pp. 125–135

  31. [31]

    Assessing traceability of software engineering artifacts,

    S. K. Sundaram, J. H. Hayes, A. Dekhtyar, and E. A. Holbrook, “Assessing traceability of software engineering artifacts,”Requirements engineering, vol. 15, no. 3, pp. 313–335, 2010

  32. [32]

    On the equivalence of information retrieval methods for automated traceability link recovery,

    R. Oliveto, M. Gethers, D. Poshyvanyk, and A. De Lucia, “On the equivalence of information retrieval methods for automated traceability link recovery,” in2010 IEEE 18th International Conference on Program Comprehension. IEEE, 2010, pp. 68–71

  33. [33]

    Semantically enhanced soft- ware traceability using deep learning techniques,

    J. Guo, J. Cheng, and J. Cleland-Huang, “Semantically enhanced soft- ware traceability using deep learning techniques,” in2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). IEEE, 2017, pp. 3–14

  34. [34]

    Traceability transformed: Generating more accurate links with pre-trained bert mod- els,

    J. Lin, Y . Liu, Q. Zeng, M. Jiang, and J. Cleland-Huang, “Traceability transformed: Generating more accurate links with pre-trained bert mod- els,” in2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 2021, pp. 324–335

  35. [35]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,”arXiv preprint arXiv:1908.10084, 2019

  36. [36]

    Efficient neural ranking using forward indexes and lightweight en- coders,

    J. Leonhardt, H. Müller, K. Rudra, M. Khosla, A. Anand, and A. Anand, “Efficient neural ranking using forward indexes and lightweight en- coders,”ACM Transactions on Information Systems, vol. 42, no. 5, pp. 1–34, 2024

  37. [37]

    CodeBERT: A Pre-Trained Model for Programming and Natural Languages

    Z. Feng, “Codebert: A pre-trained model for program-ming and natural languages,”arXiv preprint arXiv:2002.08155, 2020

  38. [38]

    Unixcoder: Unified cross-modal pre-training for code representation,

    D. Guo, S. Lu, N. Duan, Y . Wang, M. Zhou, and J. Yin, “Unixcoder: Unified cross-modal pre-training for code representation,”arXiv preprint arXiv:2203.03850, 2022

  39. [39]

    A cross-level requirement trace link update model based on bidirectional encoder representations from transformers,

    J. Tian, L. Zhang, and X. Lian, “A cross-level requirement trace link update model based on bidirectional encoder representations from transformers,”Mathematics, vol. 11, no. 3, p. 623, 2023

  40. [40]

    Lissa: toward generic traceability link recovery through retrieval-augmented generation,

    D. Fuchß, T. Hey, J. Keim, H. Liu, N. Ewald, T. Thirolf, and A. Koziolek, “Lissa: toward generic traceability link recovery through retrieval-augmented generation,” inProceedings of the IEEE/ACM 47th International Conference on Software Engineering. ICSE, vol. 25, 2025

  41. [41]

    Chart2code-mola: Efficient multi-modal code generation via adaptive expert routing,

    Y . Wang, J. Keung, Z. Mao, J. Zhang, and Y . Cao, “Chart2code-mola: Efficient multi-modal code generation via adaptive expert routing,”arXiv preprint arXiv:2511.23321, 2025

  42. [42]

    Requirements traceability link recovery via retrieval-augmented generation,

    T. Hey, D. Fuchß, J. Keim, and A. Koziolek, “Requirements traceability link recovery via retrieval-augmented generation,” inInternational Work- ing Conference on Requirements Engineering: Foundation for Software Quality. Springer, 2025, pp. 381–397

  43. [43]

    Establishing traceability between natural language requirements and software artifacts by combining rag and llms,

    S. J. Ali, V . Naganathan, and D. Bork, “Establishing traceability between natural language requirements and software artifacts by combining rag and llms,” inInternational Conference on Conceptual Modeling. Springer, 2024, pp. 295–314

  44. [44]

    Llm- based class diagram derivation from user stories with chain-of-thought promptings,

    Y . Li, J. Keung, X. Ma, C. Y . Chong, J. Zhang, and Y . Liao, “Llm- based class diagram derivation from user stories with chain-of-thought promptings,” in2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 2024, pp. 45–50

  45. [45]

    Towards requirements engineering for genai-enabled software: Bridging respon- sibility gaps through human oversight requirements,

    Z. Mao, J. Keung, Y . Sun, Y . Wang, S. Liu, and J. Li, “Towards requirements engineering for genai-enabled software: Bridging respon- sibility gaps through human oversight requirements,”arXiv preprint arXiv:2511.13069, 2025

  46. [46]

    The requirements tracing on target (retro). net dataset,

    J. H. Hayes, A. Dekhtyar, and J. Payne, “The requirements tracing on target (retro). net dataset,” in2018 IEEE 26th International Require- ments Engineering Conference (RE). IEEE, 2018, pp. 424–427

  47. [47]

    Retrieval- augmented generation for knowledge-intensive nlp tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020

  48. [48]

    Deepseek api guide: Json output,

    DeepSeek, “Deepseek api guide: Json output,” https://api-docs.deepseek. com/guides/json_mode, accessed: 2026-02-05