arxiv: 2604.22207 · v1 · submitted 2026-04-24 · 💻 cs.SE · cs.AI· cs.CL

Recognition: unknown

Evaluating LLM-Based Goal Extraction in Requirements Engineering: Prompting Strategies and Their Limitations

Anna Arnaudo , Riccardo Coppola , Maurizio Morisio , Flavio Giobergia , Andrea Bioddo , Angelo Bongiorno , Luca Dadone

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:28 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL

keywords large language modelsrequirements engineeringgoal extractionprompt engineeringfeedback loopgoal-oriented requirements engineeringin-context learningsoftware documentation

0 comments

The pith

An LLM pipeline with a feedback loop extracts low-level goals from documentation at 61 percent accuracy but works best to speed up human requirements work.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests a sequence of large language models that first identify actors, then pull high-level and low-level functional goals out of software documentation using carefully engineered prompts. Different prompting styles are compared, including few-shot examples and a two-model feedback loop in which one model proposes goals and a second critiques them. The zero-shot version of this loop improves results over plain few-shot prompting, yet the final accuracy of 61 percent in the hardest extraction step leads the authors to conclude that the system accelerates manual goal collection more reliably than it replaces it. They note that adding few-shot examples to the feedback loop brings no further gain and point to the prompting used by the critic model as the current bottleneck.

Core claim

A chain of LLMs processes documentation through actor identification, high-level goal extraction, and low-level goal extraction, with a generation-critic feedback loop that lets one model critique and refine the output of another. This loop combined with zero-shot prompting outperforms standalone few-shot prompting, while the same loop paired with few-shot examples yields no extra benefit, suggesting that the critic model's prompting strategy sets the performance ceiling. The pipeline reaches 61 percent accuracy on low-level goal identification, a result the authors interpret as evidence that the method is most useful for accelerating rather than fully automating manual goal extraction in G.

What carries the argument

The generation-critic feedback loop, in which one LLM generates candidate goals and a second LLM evaluates and refines them before the next step.

If this is right

The feedback loop raises accuracy when used with zero-shot prompting for low-level goal extraction.
Combining the feedback loop with few-shot examples produces no additional accuracy gain.
Overall performance remains at 61 percent accuracy for the final extraction stage.
The method is positioned as a way to accelerate manual goal extraction rather than replace it entirely.
Refining the number and quality of examples plus adding retrieval-augmented generation or chain-of-thought prompting could raise accuracy further.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same actor-to-goal pipeline could be applied to other textual requirements artifacts such as use-case descriptions.
If current similarity metrics overlook goals that are useful but worded differently from the ground truth, then human usefulness ratings on new documents would be a stricter test.
Testing the pipeline on documentation from domains outside the original study could show whether the 61 percent figure generalizes.
The fact that zero-shot feedback beats few-shot alone hints that the critic step may reduce the need for many hand-crafted examples in similar extraction tasks.

Load-bearing premise

That the similarity metrics and human-annotated ground truth used in the evaluation correctly capture whether an extracted goal is accurate and useful for later requirements engineering work.

What would settle it

A new set of documentation processed by the pipeline, followed by independent expert ratings of each extracted goal for correctness and downstream utility, scored without reference to the original similarity metrics.

Figures

Figures reproduced from arXiv: 2604.22207 by Andrea Bioddo, Angelo Bongiorno, Anna Arnaudo, Flavio Giobergia, Luca Dadone, Maurizio Morisio, Riccardo Coppola.

**Figure 1.** Figure 1: Schema of the proposed architecture, consisting of a LLM chain view at source ↗

read the original abstract

Due to the textual and repetitive nature of many Requirements Engineering (RE) artefacts, Large Language Models (LLMs) have proven useful to automate their generation and processing. In this paper, we discuss a possible approach for automating the Goal-Oriented Requirements Engineering (GORE) process by extracting functional goals from software documentation through three phases: actor identification, high and low-level goal extraction. To implement these functionalities, we propose a chain of LLMs fed with engineered prompts. We experimented with different variants of in-context learning and measured the similarities between input data and in-context examples to better investigate their impact. Another key element is the generation-critic mechanism, implemented as a feedback loop involving two LLMs. Although the pipeline achieved 61% accuracy in low-level goal identification, the final stage, these results indicate the approach is best suited as a tool to accelerate manual extraction rather than as a full replacement. The feedback-loop mechanism with Zero-shot outperformed stand-alone Few-shot, with an ablation study suggesting that performance slightly degrades without the feedback cycle. However, we reported that the combination of the feedback mechanism with Few-shot does not deliver any advantage, possibly suggesting that the primary performance ceiling is the prompting strategy applied to the 'critic' LLM. Together with the refinement of both the quantity and quality of the Shot examples, future research will integrate Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) prompting to improve accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper tests LLM prompting variants for goal extraction in requirements engineering and reports that a zero-shot critic loop reaches 61% accuracy, but the evaluation rests on undefined similarity metrics and unstated dataset details.

read the letter

The paper compares several prompting setups for pulling actors and goals out of software documentation. A generation-critic feedback loop with zero-shot prompting comes out ahead of standalone few-shot on low-level goals, and the ablation shows the loop adds a bit when used alone but not when paired with few-shot examples. That specific comparison on this task is the new measurement they provide.

Referee Report

3 major / 1 minor

Summary. The paper proposes a chained LLM pipeline for automating Goal-Oriented Requirements Engineering (GORE) by extracting functional goals from software documentation across three phases: actor identification, high-level goal extraction, and low-level goal extraction. It evaluates variants of in-context learning (zero-shot and few-shot) combined with a generation-critic feedback loop, reports 61% accuracy on the final low-level goal stage, finds that zero-shot with feedback outperforms standalone few-shot while few-shot with feedback adds no benefit, and concludes via ablation that the approach is best positioned as an accelerator for manual extraction rather than a replacement, with future work suggested on RAG and CoT.

Significance. If the results hold, the work supplies a concrete empirical comparison of prompting strategies for LLM-based goal extraction in requirements engineering, including an ablation isolating the feedback loop's contribution and explicit acknowledgment of performance ceilings. This could guide prompt design choices in RE automation tasks and highlight practical boundaries of current LLMs, providing a useful data point for the community even if the absolute accuracy remains moderate.

major comments (3)

[Abstract and experimental evaluation] The central performance claim of 61% accuracy on low-level goal identification (Abstract) is presented without any definition of the similarity metric, how accuracy is derived from it, the size of the evaluation dataset, or inter-annotator agreement on the human ground truth. These omissions make it impossible to assess the reliability or reproducibility of the reported comparisons between prompting strategies and the feedback-loop ablation.
[Discussion and conclusions] The claim that the pipeline is 'best suited as a tool to accelerate manual extraction rather than as a full replacement' (Abstract) rests on unvalidated similarity to human annotations; no error analysis, failure-case breakdown, or downstream validation (e.g., impact on traceability or verification tasks) is provided to confirm that high-similarity outputs are practically useful.
[Ablation study] The ablation study finding that performance degrades without the feedback cycle and that few-shot plus feedback yields no advantage (Abstract) lacks supporting details on how the critic LLM was prompted or any quantitative breakdown of where the feedback helps or fails, limiting insight into the suggested performance ceiling.

minor comments (1)

The abstract refers to 'measured the similarities between input data and in-context examples' but provides no further elaboration or results from this analysis; consider expanding this into a dedicated subsection or table.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the referee's constructive and detailed comments. We appreciate the focus on improving clarity, reproducibility, and depth of analysis in our work on LLM-based goal extraction for requirements engineering. Below we provide point-by-point responses to the major comments, indicating planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract and experimental evaluation] The central performance claim of 61% accuracy on low-level goal identification (Abstract) is presented without any definition of the similarity metric, how accuracy is derived from it, the size of the evaluation dataset, or inter-annotator agreement on the human ground truth. These omissions make it impossible to assess the reliability or reproducibility of the reported comparisons between prompting strategies and the feedback-loop ablation.

Authors: We agree that the abstract should be more self-contained to support immediate assessment of the key result. In the revised version we will expand the abstract to briefly define the similarity metric (cosine similarity over sentence embeddings), state how accuracy is computed from it, report the evaluation dataset size (number of documents and extracted goals), and reference the inter-annotator agreement achieved during ground-truth creation. These elements are already described in the experimental setup and evaluation sections; elevating the essential facts to the abstract will directly address the reproducibility concern while respecting abstract length limits. revision: yes
Referee: [Discussion and conclusions] The claim that the pipeline is 'best suited as a tool to accelerate manual extraction rather than as a full replacement' (Abstract) rests on unvalidated similarity to human annotations; no error analysis, failure-case breakdown, or downstream validation (e.g., impact on traceability or verification tasks) is provided to confirm that high-similarity outputs are practically useful.

Authors: We acknowledge that the positioning of the approach as an accelerator is currently supported primarily by the moderate absolute accuracy. To strengthen the claim we will add a new subsection on error analysis that categorizes failure modes (e.g., missed actors, overly generic goals, or incorrect refinements) with quantitative counts and representative examples. We will also include a short discussion of potential downstream effects on traceability and verification tasks, drawing on the observed error patterns even though we did not run new end-to-end experiments. These additions will provide concrete evidence for the practical utility assessment. revision: yes
Referee: [Ablation study] The ablation study finding that performance degrades without the feedback cycle and that few-shot plus feedback yields no advantage (Abstract) lacks supporting details on how the critic LLM was prompted or any quantitative breakdown of where the feedback helps or fails, limiting insight into the suggested performance ceiling.

Authors: We agree that additional transparency on the critic component is needed. In the revised ablation section we will include the exact prompt template used for the critic LLM and provide a quantitative breakdown (e.g., percentage of generations where the critic proposed changes, acceptance rate of those changes, and per-category improvement statistics). This will clarify the contribution of the feedback loop and better substantiate the observation about the prompting-strategy ceiling. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation of prompting variants against external annotations

full rationale

The paper reports an experimental pipeline for LLM-based goal extraction from requirements documents, with accuracy (61% on low-level goals) computed via similarity metrics to human-annotated ground truth, plus ablations on feedback loops, zero-shot vs. few-shot, and in-context example similarity. No mathematical derivations, equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the described methodology or results. All reported outcomes are direct measurements from the experiments rather than reductions to the paper's own inputs by construction, so the evaluation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The evaluation rests on standard assumptions about LLM promptability and the validity of similarity-based accuracy metrics in requirements engineering; no free parameters or new entities are introduced.

axioms (2)

domain assumption LLMs can be instructed via prompts to identify actors and extract goals from requirements text at usable accuracy
Invoked throughout the pipeline design and evaluation.
domain assumption Human-annotated goals and similarity metrics constitute appropriate ground truth for measuring extraction quality
Underlies the 61% accuracy claim and ablation conclusions.

pith-pipeline@v0.9.0 · 5583 in / 1409 out tokens · 34366 ms · 2026-05-08T11:28:20.234914+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 3 canonical work pages · 1 internal anchor

[1]

ISBN: 9780738165912

Systems and software engineering – Life cycle processes –Requirements engi- neering, 2011. ISBN: 9780738165912

2011
[2]

A., Khan, A

Akbar, M. A., Khan, A. A., and Liang, P.Ethical aspects of chatgpt in software engineering research.arXiv preprint arXiv:2306.07557(2023). [3]Alliance, A.What does INVEST Stand For? | Agile Alliance, Dec. 2015

work page arXiv 2023
[3]

P., and Ghianni, H.WebFuzzing/EMB: v3.4.0, Jan

Arcuri, A., Zhang, M., Golmohammadi, A., Belhadi, A., Duman, O., Seran, S., Galeotti, J. P., and Ghianni, H.WebFuzzing/EMB: v3.4.0, Jan. 2025

2025
[4]

In2023 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C)(2023), IEEE, pp

Arulmohan, S., Meurs, M.-J., and Mosser, S.Extracting domain models from textual requirements in the era of large language models. In2023 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C)(2023), IEEE, pp. 580–587

2023
[5]

R.Generating domain models from natural language text using nlp: a benchmark dataset and experimental comparison of tools.Software and Systems Modeling 23, 6 (2024), 1493–1511

Bozyigit, F., Bardakci, T., Khalilipour, A., Challenger, M., Ramackers, G., Babur, Ö., and Chaudron, M. R.Generating domain models from natural language text using nlp: a benchmark dataset and experimental comparison of tools.Software and Systems Modeling 23, 6 (2024), 1493–1511

2024
[6]

Das, S., Deb, N., Cortesi, A., and Chaki, N.Extracting goal models from natural language requirement specifications.Journal of Systems and Software(2024), 111981

2024
[7]

In2023 49th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)(2023), IEEE, pp

De Vito, G., Palomba, F., Gravino, C., Di Martino, S., and Ferrucci, F.Echo: An approach to enhance use case quality exploiting large language models. In2023 49th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)(2023), IEEE, pp. 53–60

2023
[8]

InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Work- shop)(2025), J

Ebrahim, M., Guirguis, S., and Basta, C.Enhancing software requirements engineering with language models and prompting techniques: Insights from the current research and future directions. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Work- shop)(2025), J. Zhao, M. Wang, and Z. Liu, E...

2025
[9]

Errica, F., Sanvito, D., Siracusano, G., and Bifulco, R.What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)(Albuquerque, New Mexico, 2025), A...

2025
[10]

In2025 IEEE/ACM International Workshop on Natural Language-Based Software Engineering (NLBSE)(2025), IEEE, pp

Feldt, R., and Coppola, R.Semantic api alignment: Linking high-level user goals to apis. In2025 IEEE/ACM International Workshop on Natural Language-Based Software Engineering (NLBSE)(2025), IEEE, pp. 17–20

2025
[11]

Feldt, R., Kang, S., Yoon, J., and Yoo, S.Towards autonomous testing agents via conversational large language models, 2023

2023
[12]

Ferrari, A., and Spoletini, P.Formal requirements engineering and large language models: A two-way roadmap
[13]

InProceedings of the 8th International Workshop on Software Specification and Design, IEEE Comput

Finkelstein, A., and Dowell, J.A comedy of errors: the london ambulance service case study. InProceedings of the 8th International Workshop on Software Specification and Design, IEEE Comput. Soc. Press, pp. 2–4. [15]Kang, S., Yoon, J., and Yoo, S.Large language models are few-shot testers: Ex- ploring llm-based general bug reproduction. In2023 IEEE/ACM 45...

2023
[14]

Requirements Engineering 6(2002), 237–251

Kavakli, E.Goal-oriented requirements engineering: A unifying framework. Requirements Engineering 6(2002), 237–251. [17]Lin, Z.How to write effective prompts for large language models, Sept. 2023

2002
[15]

Munkres, J.Algorithms for the assignment and transportation problems. 32–38
[16]

A., W aseem, M., Zhang, Z., Rasheed, Z., Systä, K., and Abrahamsson, P.AI based Multiagent Approach for Requirements Elicitation and Analysis

Sami, M. A., W aseem, M., Zhang, Z., Rasheed, Z., Systä, K., and Abrahamsson, P.AI based Multiagent Approach for Requirements Elicitation and Analysis
[17]

InProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems (2024), pp

Siddeshwar, V., Alwidian, S., and Makrehchi, M.A comparative study of large language models for goal model extraction. InProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems (2024), pp. 253–263

2024
[18]

In Proceedings fifth ieee international symposium on requirements engineering(2001), IEEE, pp

V an Lamsweerde, A.Goal-oriented requirements engineering: A guided tour. In Proceedings fifth ieee international symposium on requirements engineering(2001), IEEE, pp. 249–262

2001
[19]

978–1005

V an Lamsweerde, A., and Letier, E.Handling obstacles in goal-oriented re- quirements engineering. 978–1005
[20]

Vogelsang, A.Prompting the future: Integrating generative LLMs and require- ments engineering
[21]

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al.Emergent abilities of large language models.arXiv preprint arXiv:2206.07682(2022)

work page internal anchor Pith review arXiv 2022
[22]

Yoon, J., Feldt, R., and Yoo, S.Autonomous large language model agents enabling intent-driven mobile gui testing, 2023

2023
[23]

A., D˛ abrowski, J., Alhoshan, W., Zhao, L., and Ferrari, A., 2025

Zadenoori, M. A., Dąbrowski, J., Alhoshan, W., Zhao, L., and Ferrari, A. Large Language Models (LLMs) for Requirements Engineering (RE): A Systematic Literature Review, Sept. 2025. arXiv:2509.11446 [cs]

work page arXiv 2025
[24]

Q., and Artzi, Y.BERTScore: Evaluating text generation with BERT

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y.BERTScore: Evaluating text generation with BERT