arxiv: 2605.07320 · v1 · submitted 2026-05-08 · 💻 cs.HC

Recognition: no theorem link

Splitting User Stories Into Tasks with AI -- A Foe or an Ally?

Christian Ploder, Luka Pavli\v{c}, Reinhard Bernsteiner, Stephan Schl\"ogl

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:03 UTC · model grok-4.3

classification 💻 cs.HC

keywords agile software developmentuser storiestask splittinggenerative AIhybrid approachplanning efficiencysoftware planning

0 comments

The pith

AI can help create more granular task lists from user stories but requires human oversight to filter out irrelevant suggestions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper explores using generative AI to break down user stories into tasks during agile software development. Through a controlled experiment, it compares standard manual methods against using a generative AI tool. The study finds that AI produces finer details and misses fewer tasks but sometimes adds unnecessary ones. Developers preferred using AI as a starting point alongside their usual practices rather than relying on it alone. The work points to AI serving as a support tool that boosts completeness when paired with human judgment.

Core claim

The controlled experiment showed that AI-assisted approaches generated more granular task lists from user stories and helped ensure no important tasks were overlooked, yet participants noted that AI occasionally produced irrelevant tasks. As a result, the preferred method was a hybrid one that combines AI generation with conventional human review to achieve higher accuracy in planning.

What carries the argument

A controlled experiment that directly compares traditional task-splitting methods with AI-assisted task splitting using a generative AI tool on the same user stories.

If this is right

AI assistance leads to more detailed and complete task breakdowns in agile planning.
Human oversight remains necessary to remove irrelevant tasks generated by AI.
A hybrid AI-plus-human method is favored for maintaining planning accuracy.
Integrating AI tools can improve efficiency in breaking down user stories without full replacement of developers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams might experiment with different AI tools to see if the hybrid benefits persist across platforms.
AI could potentially be trained on specific project contexts to reduce irrelevant outputs over time.
Similar assistance patterns might appear in other agile activities like estimating effort or prioritizing tasks.

Load-bearing premise

The results observed in this controlled lab setting with one particular AI tool will hold true for real ongoing projects, different tools, or varied team sizes.

What would settle it

A follow-up study in live agile projects using multiple AI tools where fully automated task splitting matches or exceeds hybrid accuracy without extra human review.

Figures

Figures reproduced from arXiv: 2605.07320 by Christian Ploder, Luka Pavli\v{c}, Reinhard Bernsteiner, Stephan Schl\"ogl.

**Figure 1.** Figure 1: Sample user interface of the system that participants were required to upgrade - list of all products. Participants were then randomly assigned to either the control or experimental group, with teams of three simulating agile development settings. Each team received eight user stories, designed to exceed a single iteration’s workload, requiring them to prioritize tasks. The user stories included functiona… view at source ↗

**Figure 2.** Figure 2: Sample user interface of the system that participants were required to upgrade - temperature history data [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

In agile software development, breaking down user stories into actionable tasks is a critical yet time-consuming process. This paper investigates the potential of Generative AI tools to assist in task splitting, aiming to enhance planning efficiency. We conducted a controlled experiment comparing traditional task-splitting methods with AI-assisted approaches using GitLab Duo. Our findings indicate that while current AI tools are not yet mature enough to replace developers, they can aid in generating more granular task lists and ensuring no important tasks are overlooked. Participants favored a hybrid approach, combining AI tools with conventional methods to maintain high accuracy in planning. This study highlights the potential benefits and limitations of integrating Generative AI into agile development processes, suggesting that AI tools can serve as valuable aids in task splitting, provided there is human oversight to filter out irrelevant tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper finds GitLab Duo helps with more granular user story task splits but participants prefer a hybrid human-AI approach over AI alone.

read the letter

This paper's key finding is that GitLab Duo can generate more granular task lists from user stories and help avoid missing important tasks, yet the participants preferred a hybrid method that keeps humans involved for accuracy. The work is new in running a controlled comparison of traditional versus AI-assisted splitting using this specific tool. It does well by collecting direct feedback on preferences and pointing out that AI is not ready to replace developers but can serve as an aid. The abstract frames the experiment clearly around enhancing planning efficiency in agile development. The soft spots center on how far these results can be taken. The experiment is limited to one AI tool in a controlled environment, so it is unclear if the hybrid advantage would appear with other models or in live projects with different team dynamics. Details on participant count, the exact way they scored granularity and completeness, and any statistical support are not in the abstract, which leaves the claims a bit provisional. The lab setting might introduce biases that do not show up in real work. This paper is for agile practitioners or researchers looking at practical uses of generative AI in software engineering. Someone wanting evidence on whether to integrate such tools into planning workflows would get some value here, though the scope is narrow and the findings should be seen as initial rather than definitive. It deserves a serious referee. The design is straightforward and the topic is relevant to current industry practices. The authors should expand on methods and limitations in revisions to strengthen the contribution.

Referee Report

3 major / 2 minor

Summary. The paper reports a controlled experiment comparing traditional manual task-splitting of user stories with AI-assisted splitting using GitLab Duo in an agile development context. It claims that current generative AI tools are not mature enough to replace developers but can generate more granular task lists and reduce the risk of overlooking important tasks, with participants preferring a hybrid human-AI approach for maintaining planning accuracy.

Significance. If the empirical results hold after addressing methodological gaps, the work provides timely evidence on the practical utility and limitations of generative AI coding assistants in supporting agile planning activities. It contributes to the HCI and software engineering literature by documenting participant perceptions of granularity, completeness, and the value of oversight, which can inform tool design and team workflows for integrating AI without compromising quality.

major comments (3)

[Methods] Methods section: The experimental design description omits the participant sample size, recruitment criteria, experience levels in agile practices, number and selection criteria for the user stories, and precise protocol for the AI-assisted condition (including prompting strategy). These omissions make it impossible to assess selection bias, statistical power, or whether observed differences in granularity and completeness are attributable to the tool rather than the lab setting or story choice.
[Results] Results section: Claims that AI produces 'more granular task lists' and ensures 'no important tasks are overlooked' are supported only by participant preferences and qualitative feedback; no quantitative metrics (e.g., task count per story, completeness scores), inter-rater reliability, or statistical tests comparing conditions are reported. This weakens the load-bearing assertion that AI provides measurable assistance beyond the hybrid preference.
[Discussion] Discussion section: The generalizability claim is undercut by reliance on a single tool (GitLab Duo) and a controlled lab environment; the paper does not discuss how results might change with different models, larger teams, ongoing projects, or varying domain expertise, nor does it provide evidence that the hybrid advantage would persist outside the experimental constraints.

minor comments (2)

[Abstract] Abstract: The summary of findings could include a brief mention of the evaluation criteria used for granularity and completeness to give readers an immediate sense of the measurement approach.
[Appendix] The manuscript would benefit from an appendix containing the exact user stories, AI prompts, and task lists produced in each condition to support reproducibility and allow readers to judge the granularity differences directly.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point-by-point below, with clear indications of how we will revise the paper.

read point-by-point responses

Referee: [Methods] Methods section: The experimental design description omits the participant sample size, recruitment criteria, experience levels in agile practices, number and selection criteria for the user stories, and precise protocol for the AI-assisted condition (including prompting strategy). These omissions make it impossible to assess selection bias, statistical power, or whether observed differences in granularity and completeness are attributable to the tool rather than the lab setting or story choice.

Authors: We agree that the Methods section requires greater detail for reproducibility and to allow evaluation of validity. We will revise it to report the participant sample size, recruitment criteria and process, participants' levels of experience with agile practices, the number and selection criteria for the user stories, and the full protocol for the AI-assisted condition including the prompting strategy employed with GitLab Duo. These additions will directly address concerns about selection bias and attribution of effects. revision: yes
Referee: [Results] Results section: Claims that AI produces 'more granular task lists' and ensures 'no important tasks are overlooked' are supported only by participant preferences and qualitative feedback; no quantitative metrics (e.g., task count per story, completeness scores), inter-rater reliability, or statistical tests comparing conditions are reported. This weakens the load-bearing assertion that AI provides measurable assistance beyond the hybrid preference.

Authors: The study was intentionally qualitative, centered on developers' perceptions and workflow preferences rather than quantitative benchmarking. We will revise the Results section to include descriptive quantitative details where available (such as task counts per story across conditions) and to explicitly state that claims about granularity and completeness rest on thematic analysis of participant feedback. We will also add a limitations paragraph noting the absence of statistical tests and inter-rater reliability as a consequence of the chosen study design. This preserves the integrity of the exploratory findings while improving transparency. revision: partial
Referee: [Discussion] Discussion section: The generalizability claim is undercut by reliance on a single tool (GitLab Duo) and a controlled lab environment; the paper does not discuss how results might change with different models, larger teams, ongoing projects, or varying domain expertise, nor does it provide evidence that the hybrid advantage would persist outside the experimental constraints.

Authors: We accept that the current Discussion understates limitations to generalizability. We will expand this section to discuss how findings might differ with alternative AI models, in larger or distributed teams, within ongoing real-world projects, and across varying domain expertise. We will also frame the hybrid preference as a hypothesis requiring further validation outside lab constraints and outline targeted future work to test persistence of the observed advantages. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical user study with no derivations or self-referential predictions

full rationale

The paper reports a controlled experiment comparing traditional task-splitting by participants against AI-assisted splitting using GitLab Duo. Outcomes rest on observed participant behavior, granularity counts, completeness, and preference ratings. No equations, fitted parameters, first-principles derivations, or predictions that reduce to inputs by construction appear in the abstract or described methods. Self-citations, if present, are not load-bearing for any central claim. The study is self-contained against external benchmarks via direct observation, yielding a circularity score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are present; the paper reports an empirical comparison rather than a theoretical model or derivation.

pith-pipeline@v0.9.0 · 5440 in / 1101 out tokens · 37886 ms · 2026-05-11T01:03:09.388270+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 7 canonical work pages

[1]

arXiv preprint arXiv:2406.11638 , year=

Arora, D., Sonwane, A., Wadhwa, N., Mehrotra, A., Utpala, S., Bairi, R., Kanade, A., Natarajan, N.: Masai: Modular architecture for software-engineering ai agents. ArXivabs/2406.11638(2024), https://api.semanticscholar.org/CorpusID:270558999

work page arXiv 2024
[2]

Addison-Wesley Professional (2005)

Cohn, M.: Agile Estimating and Planning. Addison-Wesley Professional (2005)

2005
[3]

PLOS Computational Biology19(6), 1–31 (06 2023)

Correa, C.G., Ho, M.K., Callaway, F., Daw, N.D., Griffiths, T.L.: Humans decom- pose tasks by trading off utility and computational cost. PLOS Computational Biology19(6), 1–31 (06 2023). https://doi.org/10.1371/journal.pcbi.1011087, https://doi.org/10.1371/journal.pcbi.1011087

work page doi:10.1371/journal.pcbi.1011087 2023
[4]

https://digital.ai/resource-center/analyst- reports/state-of-agile-report, accessed: 2025_02_27

Digital.AI: 15th State of Agile Report. https://digital.ai/resource-center/analyst- reports/state-of-agile-report, accessed: 2025_02_27
[5]

https://futurearchi.blog/en/construction-tasks, accessed: 2025_01_27

Future Architecture: The art of splitting long tasks in construction. https://futurearchi.blog/en/construction-tasks, accessed: 2025_01_27
[6]

https://blogs.itemis.com/en/spidr-five-simple-techniques-for-a-perfectly-split- user-story, accessed: 2025_01_27

Itemis: SPIDR – five simple techniques for a perfectly split user story. https://blogs.itemis.com/en/spidr-five-simple-techniques-for-a-perfectly-split- user-story, accessed: 2025_01_27
[7]

https://doi.org/10.48550/arXiv.2302.05099

Khanfor, A.: Tasks decomposition approaches in crowdsourcing software develop- ment (02 2023). https://doi.org/10.48550/arXiv.2302.05099

work page doi:10.48550/arxiv.2302.05099 2023
[8]

Empirical Softw

Ko, A.J., LaToza, T.D., Burnett, M.M.: A practical guide to controlled exper- iments of software engineering tools with human participants. Empirical Softw. Engg.20(1), 110–141 (Feb 2015). https://doi.org/10.1007/s10664-013-9279-3, https://doi.org/10.1007/s10664-013-9279-3

work page doi:10.1007/s10664-013-9279-3 2015
[9]

SN Computer Science4(09 2023)

Kumar, B., Tiwari, U., Dobhal, D.: Machine learning based approach for user story clustering in agile engineering. SN Computer Science4(09 2023). https://doi.org/10.1007/s42979-023-02212-2

work page doi:10.1007/s42979-023-02212-2 2023
[10]

In: 2022 Seventh International Confer- ence on Parallel, Distributed and Grid Computing (PDGC)

Kumar, B., Tiwari, U., Dobhal, D.C.: User story splitting in agile software devel- opment using machine learning approach. In: 2022 Seventh International Confer- ence on Parallel, Distributed and Grid Computing (PDGC). pp. 167–171 (2022). https://doi.org/10.1109/PDGC56933.2022.10053226

work page doi:10.1109/pdgc56933.2022.10053226 2022
[11]

Gartner Reseaerch (2024)

Manjunath Bhat, Cameron Haight, B.B.: How Platform Engineering Teams Can Augment DevOps With AI. Gartner Reseaerch (2024)

2024
[12]

Applied Sciences14(24) (2024)

Pavlič, L., Saklamaeva, V., Beranič, T.: Can large-language models re- place humans in agile effort estimation? lessons from a controlled experi- ment. Applied Sciences14(24) (2024). https://doi.org/10.3390/app142412006, https://www.mdpi.com/2076-3417/14/24/12006

work page doi:10.3390/app142412006 2024
[13]

Project Management Institute, Newtown Square, PA, 5 edn

PMI (ed.): A Guide to the Project Management Body of Knowledge (PMBOK Guide). Project Management Institute, Newtown Square, PA, 5 edn. (2013) 12 L. Pavlič et al

2013
[14]

In: 2024 IEEE International Conference on Software Maintenance and Evolution (ICSME)

Rahman, T., Zhu, Y., Maha, L., Roy, C., Roy, B., Schneider, K.: Take loads off your developers: Automated user story generation using large language model. In: 2024 IEEE International Conference on Software Maintenance and Evolution (ICSME). pp. 791–801. IEEE (2024)

2024
[15]

Retrospectiva del Sprint de Nexus, Scrum.org

Schwaber, K., Sutherland, J.: The Definitive Guide to Scrum: The Rules of the Game. Retrospectiva del Sprint de Nexus, Scrum.org. (2017), https://books.google.si/books?id=8ONgzgEACAAJ

2017
[16]

Wohlin, C., Runeson, P., Höst, M., Ohlsson, M.C., Regnell, B., Wesslén, A., et al.: Experimentation in software engineering, vol. 236. Springer (2012)

2012
[17]

In: Šmite, D., Guerra, E., Wang, X., Marchesi, M., Gregory, P

Zhang,Z.,Rayhan,M.,Herda,T.,Goisauf,M.,Abrahamsson,P.:Llm-basedagents for automating the enhancement of user story quality: An early report. In: Šmite, D., Guerra, E., Wang, X., Marchesi, M., Gregory, P. (eds.) Agile Processes in Software Engineering and Extreme Programming. pp. 117–126. Springer Nature Switzerland, Cham (2024)

2024