arxiv: 2603.06276 · v2 · submitted 2026-03-06 · 💻 cs.SE

Recognition: 1 theorem link

· Lean Theorem

Story Point Estimation Using Large Language Models

Pranam Prakash Shetty , Adarsh Balakrishnan , Mengqiao Xu , Xiaoyin Xi , Zhe Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:24 UTC · model grok-4.3

classification 💻 cs.SE

keywords story point estimationlarge language modelszero-shot promptingsoftware effort estimationagile developmentfew-shot learningcomparative judgments

0 comments

The pith

Large language models predict story points more accurately than supervised deep learning models even with zero training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can estimate the relative effort for software tasks, expressed as story points, using only the task title and description. Traditional supervised models need substantial labeled data from the same project to reach usable accuracy. The authors demonstrate that zero-shot prompting of LLMs already exceeds the performance of deep learning models trained on 80 percent of available data across 16 projects. Few-shot prompting with a handful of examples improves results further. The study also checks whether pairwise comparisons of effort are easier for LLMs to judge than direct point values and whether those comparisons help when supplied as examples.

Core claim

Without any training data, large language models using zero-shot prompting predict story points for backlog items better than deep neural networks trained on 80 percent of the data from the same project. Adding a small number of examples through few-shot prompting raises accuracy still more. Comparative judgments between pairs of items are not easier for the models to predict than direct story-point values, yet they remain useful as few-shot examples for improving story-point predictions.

What carries the argument

Zero-shot and few-shot prompting of large language models applied directly to item titles and descriptions to output story point estimates.

If this is right

Software teams can apply LLMs to estimate effort on new projects without first collecting large amounts of labeled historical data.
A few human-annotated examples or pairwise comparisons can be added to prompting to raise prediction accuracy.
Comparative judgments between items serve as effective few-shot examples even if they are not easier to predict than direct values.
LLMs reduce dependence on project-specific training datasets for agile effort estimation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams without historical data could adopt LLM-based estimation as a starting point and refine it with minimal examples.
The same prompting strategy might extend to related subjective judgments such as priority ranking or risk assessment.
Combining zero-shot LLM outputs with actual time logs from completed tasks could create hybrid estimators for future projects.

Load-bearing premise

The 16 projects and four language models tested are representative enough that the zero-shot advantage will hold for other projects and models.

What would settle it

A fresh software project where a deep learning model trained on 80 percent of its own data produces more accurate story point predictions than zero-shot LLM prompting would falsify the central claim.

Figures

Figures reproduced from arXiv: 2603.06276 by Adarsh Balakrishnan, Mengqiao Xu, Pranam Prakash Shetty, Xiaoyin Xi, Zhe Yu.

**Figure 1.** Figure 1: Examples of items with ground truth story points and the generated ground truth comparative judgments. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

read the original abstract

This study investigates the use of large language models (LLMs) for story point estimation. Story points are unitless, project-specific effort estimates that help developers on the scrum team forecast which product backlog items they plan to complete in a sprint. To facilitate this process, machine learning models, especially deep neural networks, have been applied to predict the story points based on the title and description of each item. However, such machine learning models require sufficient amounts of training data (with ground truth story points annotated by human developers) from the same software project to achieve decent prediction performance. This motivated us to explore whether LLMs are capable of (RQ1) predicting story points without training data or (RQ2) with only a few training data points. Our empirical results with four LLMs on 16 software projects show that, without any training data (zero-shot prompting), LLMs can predict story points better than supervised deep learning models trained on 80% of the data. The prediction performance of LLMs can be further improved with a few training examples (few-shot prompting). In addition, a recent study explored the use of comparative judgments (between a given pair of items which one requires more effort to implement) instead of directly annotating the story points to reduce the cognitive burden on developers. Therefore, this study also explores (RQ3) whether comparative judgments are easier to predict than story points for LLMs and (RQ4) whether comparative judgments can serve as few-shot examples for LLMs to improve their predictions of story points. Empirical results suggest that it is not easier for LLMs to predict comparative judgments than to directly estimate the story points, but comparative judgments can serve as few-shot examples to improve the LLMs' prediction performance as well as the human-annotated story points.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Zero-shot LLMs beat trained models on story points in this setup, but the result hinges on untested assumptions about data leakage from public projects.

read the letter

The main takeaway is that zero-shot prompting with four LLMs outperformed supervised deep learning models trained on 80% of the data across 16 projects, and few-shot examples plus comparative judgments gave further gains. That is the concrete new angle: applying LLMs directly to effort estimation without project-specific training data, plus testing whether pairwise comparisons help as prompts. The work is straightforward and addresses a practical pain point in agile teams that lack large labeled histories. It runs the experiments on real issue data and checks multiple models, which is better than single-model claims in some prior ML papers on the same task. The comparative judgment angle is also a reasonable extension of recent human-factors work. The soft spots are in the execution details and the zero-shot interpretation. The abstract gives no numbers on exact prompts, variance across projects, or statistical tests, so it is hard to judge how large or stable the gains actually are. More importantly, the 16 projects come from public trackers, and modern LLMs have almost certainly seen similar GitHub-scale text during pretraining. Without any membership check or temporal split to rule out leakage, the zero-shot advantage could partly reflect memorized patterns rather than fresh reasoning. That does not invalidate the whole study, but it weakens the headline comparison to supervised models. The paper is aimed at software engineering researchers who work on effort estimation or LLM applications in SE. A reader already familiar with prompting techniques will get the most out of the RQ3 and RQ4 results on comparative judgments. It is worth sending to peer review because the topic is relevant, the experimental scope is reasonable, and the core idea can be sharpened with contamination controls and fuller reporting. A referee can push for those fixes without starting from scratch.

Referee Report

3 major / 2 minor

Summary. The paper evaluates large language models for story point estimation in agile software projects. It claims that zero-shot prompting allows four LLMs to outperform supervised deep learning models trained on 80% of the data across 16 public projects (RQ1), that few-shot prompting further improves performance (RQ2), that comparative judgments are not easier for LLMs to predict than direct story points (RQ3), and that comparative judgments can serve as effective few-shot examples (RQ4).

Significance. If the zero-shot superiority result holds after contamination checks and full methodological disclosure, the work would be significant for software engineering practice: it would demonstrate that LLMs can deliver usable effort estimates without project-specific labeled data, reducing the data-collection barrier that currently limits supervised approaches and potentially enabling broader adoption of automated estimation in small or new projects.

major comments (3)

[Abstract, §4] Abstract and §4 (empirical results): the central claim that zero-shot LLMs outperform DL models trained on 80% of the data is load-bearing yet rests on an unverified assumption that the 16 public projects contain no pretraining overlap with the LLMs. No membership-inference test, temporal cutoff analysis, or decontamination step is described; without it the performance edge may reflect memorization rather than generalization.
[§3] §3 (methodology): the prompting strategies, exact model versions, temperature settings, and output parsing rules are not specified in sufficient detail to allow reproduction or to rule out prompt-engineering artifacts. The evaluation metrics (MAE, accuracy, or rank correlation?) and any statistical significance tests comparing zero-shot vs. supervised baselines are also omitted.
[§4, Table 2] §4 and Table 2 (project selection): the 16 projects are drawn from public issue trackers, but no cross-project validation scheme, project-size stratification, or control for domain variability is reported. This weakens the generalizability assertion for both the zero-shot and few-shot results.

minor comments (2)

[§2] Notation for story-point scales and comparative-judgment encoding should be defined once in §2 and used consistently; currently the mapping from LLM output tokens to numeric story points is described only informally.
[Figure 3] Figure 3 (few-shot curves) lacks error bars or confidence intervals, making it difficult to judge whether the reported gains over zero-shot are statistically reliable.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point-by-point below, indicating the revisions we will make to improve methodological transparency and address potential threats to validity.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (empirical results): the central claim that zero-shot LLMs outperform DL models trained on 80% of the data is load-bearing yet rests on an unverified assumption that the 16 public projects contain no pretraining overlap with the LLMs. No membership-inference test, temporal cutoff analysis, or decontamination step is described; without it the performance edge may reflect memorization rather than generalization.

Authors: We agree that potential contamination from public issue trackers is a valid concern for LLM-based claims. The original manuscript did not include explicit decontamination or membership-inference tests. In the revision we will add a dedicated subsection in §4 that (1) lists the known training cutoffs for each of the four LLMs, (2) performs a temporal analysis using issue creation dates to identify post-cutoff projects, and (3) reports zero-shot results restricted to those post-cutoff issues. We will also note the absence of full membership-inference testing as a limitation and discuss why story-point labels are unlikely to have been directly memorized even if issue text was seen. revision: yes
Referee: [§3] §3 (methodology): the prompting strategies, exact model versions, temperature settings, and output parsing rules are not specified in sufficient detail to allow reproduction or to rule out prompt-engineering artifacts. The evaluation metrics (MAE, accuracy, or rank correlation?) and any statistical significance tests comparing zero-shot vs. supervised baselines are also omitted.

Authors: We acknowledge that the current §3 lacks the level of detail needed for reproducibility. In the revised manuscript we will expand §3 with: exact model identifiers and versions (e.g., gpt-4-0613, Llama-2-70b-chat), temperature=0 for all runs, the complete zero-shot and few-shot prompt templates, and the deterministic parsing rules used to extract numeric story-point values from free-form LLM output. We will also state that Mean Absolute Error (MAE) is the primary metric, supplemented by thresholded accuracy, and will add Wilcoxon signed-rank tests with p-values for all zero-shot versus supervised comparisons. revision: yes
Referee: [§4, Table 2] §4 and Table 2 (project selection): the 16 projects are drawn from public issue trackers, but no cross-project validation scheme, project-size stratification, or control for domain variability is reported. This weakens the generalizability assertion for both the zero-shot and few-shot results.

Authors: The 16 projects were selected to span different domains and sizes (as summarized in Table 2), but we did not explicitly describe stratification or cross-project protocols. In the revision we will add a paragraph detailing the selection criteria, report project sizes (number of issues) and primary domains, and include a supplementary analysis that stratifies MAE results by project size quartiles. We will also clarify that the supervised baselines use within-project 80/20 splits and will discuss cross-project generalization as an explicit limitation and direction for future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical LLM comparisons rest on external benchmarks and direct measurements

full rationale

The paper conducts an empirical evaluation of zero-shot and few-shot LLM prompting for story point estimation, directly comparing performance metrics against supervised deep learning models trained on 80% of the same 16 project datasets. All claims derive from observable prediction accuracy on held-out items rather than any self-referential equations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the result to its own inputs. The methodology is self-contained and externally replicable without invoking uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLMs possess general software engineering knowledge sufficient for effort estimation without project-specific fine-tuning. No free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption LLMs can interpret task titles and descriptions to estimate relative effort without domain-specific training data from the target project.
Invoked implicitly in the zero-shot and few-shot prompting setup described in the abstract.

pith-pipeline@v0.9.0 · 5632 in / 1247 out tokens · 26500 ms · 2026-05-15T15:24:18.343603+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

zero-shot prompting... LLMs can predict story points better than supervised deep learning models trained on 80% of the data

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 7 internal anchors

[1]

The scrum guide: The definitive guide to scrum: The rules of the game,

K. Schwaber and J. Sutherland, “The scrum guide: The definitive guide to scrum: The rules of the game,” https://scrumguides.org/scrum-guide.html, 2020, accessed: 2026-02-10

work page 2020
[2]

Cohn,Agile estimating and planning

M. Cohn,Agile estimating and planning. Pearson Education, 2005

work page 2005
[3]

A deep learning model for estimating story points,

M. Choetkiertikul, H. K. Dam, T. Tran, T. Pham, A. Ghose, and T. Menzies, “A deep learning model for estimating story points,”IEEE Transactions on Software Engineering, vol. 45, no. 7, pp. 637–656, 2018

work page 2018
[4]

Gpt2sp: A transformer-based agile story point estimation approach,

M. Fu and C. Tantithamthavorn, “Gpt2sp: A transformer-based agile story point estimation approach,”IEEE Transactions on Software Engineering, vol. 49, no. 2, pp. 611–625, 2022

work page 2022
[5]

A systematic review of software effort estimation using machine learning,

M. Shepperd, S. Counsell, R. C. Sharp, and B. Bowes, “A systematic review of software effort estimation using machine learning,”Information and Software Technology, vol. 54, no. 1, pp. 41–54, 2012

work page 2012
[6]

Language models are few-shot learners,

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplanet al., “Language models are few-shot learners,”Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020

work page 1901
[7]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosmaet al., “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022

work page 2022
[9]

Towards an understanding of large language models in software engineering tasks,

Z. Zheng, K. Ning, Q. Zhong, J. Chenet al., “Towards an understanding of large language models in software engineering tasks,”arXiv preprint, 2023, arXiv:2308.11396. [Online]. Available: https://arxiv.org/abs/2308.11396

work page arXiv 2023
[10]

Efficient story point estimation with comparative learning,

M. M. Khan, X. Xi, A. Meneely, and Z. Yu, “Efficient story point estimation with comparative learning,” 2025. [Online]. Available: https://arxiv.org/abs/2507.14642

work page arXiv 2025
[11]

A law of comparative judgment

L. L. Thurstone, “A law of comparative judgment.”Psychological review, vol. 34, p. 273–286, 1927

work page 1927
[12]

Rank analysis of incomplete block designs: I. the method of paired comparisons,

R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,”Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952

work page 1952
[13]

B. W. Boehm, C. Abts, A. W. Brown, S. Chulani, B. K. Clark, E. Horowitz, R. Madachy, D. J. Reifer, and B. Steece,Software Cost Estimation with COCOMO II. Prentice Hall, 2000

work page 2000
[14]

A review of studies on expert estimation of software development effort,

M. Jørgensen, “A review of studies on expert estimation of software development effort,”Journal of Systems and Software, vol. 70, no. 1–2, pp. 37–60, 2004

work page 2004
[15]

Measuring application development productivity,

A. J. Albrecht, “Measuring application development productivity,”Pro- ceedings of the Joint SHARE/GUIDE/IBM Application Development Symposium, pp. 83–92, 1979

work page 1979
[16]

Resource estimation for objectory projects,

G. Karner, “Resource estimation for objectory projects,” inObjective Systems SF AB Working Paper, 1993

work page 1993
[17]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems, vol. 35, 2022. [Online]. Available: https://arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Large language models for software engineering: A systematic literature review,

X. Hou, Y . Zhao, Y . Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engineering: A systematic literature review,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 8, 2024. [Online]. Available: https://arxiv.org/abs/2308.10620

work page arXiv 2024
[19]

A survey on large language models for software engineering,

Q. Zhang, C. Fang, Y . Xie, Y . Zhang, Y . Yang, W. Sun, S. Yu, and Z. Chen, “A survey on large language models for software engineering,”arXiv preprint arXiv:2312.15223, 2023. [Online]. Available: https://arxiv.org/abs/2312.15223

work page arXiv 2023
[20]

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tanget al., “Codexglue: A machine learning benchmark dataset for code understanding and generation,”arXiv preprint arXiv:2102.04664, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Pinto, J. Kaplan, H. Edwardset al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021. [Online]. Available: https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

[Online]. Available: https://arxiv.org/abs/2306.03091

work page internal anchor Pith review arXiv
[24]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can language models resolve real-world github issues?”arXiv preprint arXiv:2310.06770, 2023. [Online]. Available: https://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

De-hallucinator: Mitigating llm hallucinations in code generation tasks via iterative grounding,

A. Eghbali and M. Pradel, “De-hallucinator: Mitigating llm hallucinations in code generation tasks via iterative grounding,”arXiv preprint arXiv:2401.01701, 2024. [Online]. Available: https://arxiv.org/abs/2401. 01701

work page arXiv 2024
[26]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AIet al., “Deepseek-v3.2: Pushing the frontier of open large language models,”arXiv preprint arXiv:2512.02556, 2025. [Online]. Available: https://arxiv.org/abs/2512.02556

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Gemini 2.5 Flash-Lite,

Google, “Gemini 2.5 Flash-Lite,” https://docs.cloud.google.com/vertex-ai/ generative-ai/docs/models/gemini/2-5-flash-lite, 2025, accessed: 2026- 03-05

work page 2025
[28]

GPT-5 nano model,

OpenAI, “GPT-5 nano model,” https://developers.openai.com/api/docs/ models/gpt-5-nano, 2025, accessed: 2026-03-05

work page 2025
[29]

Kimi K2: Open Agentic Intelligence

M. AI, “Kimi k2: Advanced LLM with 128k context,”arXiv preprint arXiv:2507.20534, 2025. [Online]. Available: https://arxiv.org/abs/2507. 20534

work page internal anchor Pith review Pith/arXiv arXiv 2025