pith. machine review for the scientific record. sign in

arxiv: 2604.27006 · v1 · submitted 2026-04-29 · 💻 cs.SE · cs.AI

Recognition: unknown

Beyond Accuracy: LLM Variability in Evidence Screening for Software Engineering SLRs

Authors on Pith no claims yet

Pith reviewed 2026-05-07 13:06 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLM variabilitysystematic literature reviewevidence screeningsoftware engineeringnon-determinismabstract metadataclassical classifiersperformance comparison
0
0 comments X

The pith

LLMs vary widely and remain non-deterministic when screening papers for software engineering systematic reviews.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests twelve large language models from four providers and four classical classifiers on the screening phase of two real systematic literature reviews that together cover 518 papers. It shows that different LLMs produce markedly different inclusion decisions and retain unpredictable behavior even when temperature is fixed at zero. Abstract text proves essential for acceptable performance, but adding title or keyword fields produces no reliable further gains. Direct head-to-head comparisons reveal no consistent advantage for LLMs over traditional models, implying that adoption decisions should rest on practical constraints rather than blanket claims of superiority.

Core claim

LLMs exhibited substantial heterogeneity and residual non-determinism even at temperature zero. Abstract availability was decisive: removing it consistently degraded performance, while adding title and/or keywords to the abstract yielded no robust gains. Compared to classical models, performance differences were not consistent enough to support generalizable LLM superiority.

What carries the argument

The shared experimental protocol that applies 12 LLMs and 4 classical classifiers to two software engineering SLRs while varying input metadata (abstract alone versus abstract plus title or keywords).

If this is right

  • Abstract text must be collected as the primary input; titles and keywords cannot be relied upon to compensate for its absence.
  • Any LLM-based screening workflow requires explicit reporting of run-to-run variability and pilot validation on the target corpus.
  • Model choice should be driven by reproducibility, cost, and metadata constraints rather than expected accuracy gains.
  • Classical classifiers remain competitive and should be included as baselines in future LLM screening studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid screening pipelines that run several LLMs in parallel and reconcile their outputs could reduce the impact of single-model variability.
  • The same protocol could be applied to systematic reviews outside software engineering to test whether abstract dominance and LLM heterogeneity are domain-general.
  • Persistent non-determinism at temperature zero points to deeper model architecture or training factors that prompting alone may not control.

Load-bearing premise

Findings from these two specific SLRs with 518 papers and the chosen prompting protocol will generalize to other reviews, domains, or larger collections.

What would settle it

A third SLR experiment in which all LLMs show near-zero variability at temperature zero and clearly outperform classical models on the same metrics would falsify the reported heterogeneity and lack of generalizable superiority.

Figures

Figures reproduced from arXiv: 2604.27006 by Danilo Monteiro Ribeiro, Erika Yahata, Gilberto Sussumu Hida.

Figure 1
Figure 1. Figure 1: Methodological structure of the study by Felizardo et al. [2024] view at source ↗
Figure 2
Figure 2. Figure 2: Experimental pipeline: Data preparation; Comparative analysis of LLM performance; Identification of the view at source ↗
Figure 3
Figure 3. Figure 3: Phase 1 pipeline for model comparison. p r om pt = f " " " Assume you a r e a s o f t w a r e e n g i n e e r i n g r e s e a r c h e r c o n d u c t i n g a s y s t e m a t i c l i t e r a t u r e r e vi e w ( SLR ) . C o n s i d e r i n g t h e t i t l e , a b s t r a c t , and ke yw o r d s o f a p ri m a r y st u d y : U si n g a L i k e r t s c a l e f rom 1 t o 7 ( 1 − S t r o n g l y Di s a g r e e … view at source ↗
Figure 4
Figure 4. Figure 4: Prompt Template for LLM-based Classification. view at source ↗
Figure 5
Figure 5. Figure 5: Variability of responses in each iteration (n = 5) per model for both SLR1 and SLR2. view at source ↗
Figure 6
Figure 6. Figure 6: Agreement between the responses for each model and SLR across iterations. view at source ↗
Figure 7
Figure 7. Figure 7: Phase 2 results for SLR1 view at source ↗
Figure 8
Figure 8. Figure 8: Phase 2 results for SLR2 Thus, the presence of an abstract was essential for robust LLM performance. When available, abstracts alone (Variant A) or combined with the title and/or keywords (Variants B, C, D) produced equivalent results. Only the composition based exclusively on title and keywords (Variant E) exhibited systematically inferior performance, suggesting that LLMs depend on the contextual informa… view at source ↗
Figure 9
Figure 9. Figure 9: Meta-analysis of aggregated effect sizes for feature composition variants across both SLRs. Points represent view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of accuracy and F1-score for LLMs and classical machine learning methods on SLR1 with 95% view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of accuracy and F1-score for LLMs and classical machine learning methods on SLR2 with 95% view at source ↗
Figure 12
Figure 12. Figure 12: Flowchart describing our proposed use of LLM in screening; each publication should go through the process view at source ↗
read the original abstract

Context: Study screening in systematic literature reviews is costly, inconsistency-prone, and risk-asymmetric, since false negatives can compromise validity. Despite rapid uptake of Large Language Models (LLMs), there is limited evidence on how such models behave during the study screening phase, particularly regarding the choice of specific LLMs and their comparison with classical models. Objective: To assess LLM performance and variability in screening, quantify the impact of input metadata (abstract, title, keywords), and compare LLMs with classical classifiers under a shared protocol. Methods: We analyzed 12 LLMs from 4 providers (OpenAI, Google Gemini, Anthropic, Llama) and 4 classical models (Logistic Regression, Support Vector Classification, Random Forest, and Naive Bayes) on 2 real Systematic Literature Reviews (SLRs), totaling 518 papers. The experimental design investigated 3 critical dimensions: (i) LLMs performance variability, (ii) the impact of input feature composition (abstract, title, and keywords) on LLM performance, and (iii) the real gain of using LLMs instead of more traditional classification models. Results: LLMs exhibited substantial heterogeneity and residual non-determinism even at temperature zero. Abstract availability was decisive: removing it consistently degraded performance, while adding title and/or keywords to the abstract yielded no robust gains. Compared to classical models, performance differences were not consistent enough to support generalizable LLM superiority. Discussion: LLM adoption should be justified by operational and governance constraints (reproducibility, cost, metadata availability), supported by pilot validation and explicit reporting of variability and input configuration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper reports a controlled experiment screening 518 papers from two real software engineering SLRs using 12 LLMs (OpenAI, Google, Anthropic, Llama) and four classical classifiers (LR, SVC, RF, NB). It finds substantial LLM heterogeneity and residual non-determinism at temperature zero, decisive performance degradation when abstracts are removed, no robust gains from adding title or keywords, and inconsistent LLM-classical performance gaps that do not support generalizable superiority. The discussion advises justifying LLM adoption via operational constraints, pilot validation, and explicit variability reporting.

Significance. If the reported patterns of variability, abstract primacy, and inconsistent classical-model gaps hold, the work supplies concrete empirical guidance for SE researchers choosing screening tools, emphasizing reproducibility risks and the limited upside of richer metadata. The use of real labeled SLR data and a shared protocol across model families strengthens its practical relevance.

major comments (3)
  1. [Methods and Discussion] The central claims about LLM heterogeneity, abstract dominance, and lack of consistent superiority over classical models rest on experiments with only two SLRs. The manuscript provides no selection criteria, topic diversity metrics, or abstract-quality distribution for these SLRs, leaving open the possibility that the observed patterns are sample-specific rather than general properties of SE evidence screening.
  2. [Results] The statement that 'adding title and/or keywords to the abstract yielded no robust gains' is load-bearing for the input-feature conclusions, yet the Results section does not report the exact statistical tests, effect-size thresholds, or per-SLR variance used to establish 'robustness'; with only two SLRs the power to detect small gains is limited.
  3. [Discussion] The Discussion recommendation that 'LLM adoption should be justified by operational and governance constraints... supported by pilot validation' extrapolates from the two-SLR sample to general practice; without a sensitivity analysis or explicit scope limitations, this guidance risks over-generalization.
minor comments (2)
  1. [Abstract and Methods] The abstract and Methods would benefit from a concise table listing the exact 12 LLMs, their providers, and the four classical models with their hyper-parameter settings.
  2. [Methods] Clarify whether the ground-truth labels from the original SLRs were treated as fixed or whether any re-screening or inter-rater reliability checks were performed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help us improve the clarity and appropriate scoping of our work. We agree that the small number of SLRs is a substantive limitation and will revise the manuscript to provide more transparent reporting on sample selection, statistical methods, and the bounded nature of our recommendations. Below we respond point by point.

read point-by-point responses
  1. Referee: [Methods and Discussion] The central claims about LLM heterogeneity, abstract dominance, and lack of consistent superiority over classical models rest on experiments with only two SLRs. The manuscript provides no selection criteria, topic diversity metrics, or abstract-quality distribution for these SLRs, leaving open the possibility that the observed patterns are sample-specific rather than general properties of SE evidence screening.

    Authors: We agree that the restriction to two SLRs constrains generalizability and that the original manuscript did not sufficiently document selection criteria. The two SLRs were selected because they are recent, independently conducted software engineering reviews that had already produced publicly available gold-standard labels, allowing us to avoid new annotation costs while using real screening decisions. In the revision we will add an explicit subsection in Methods describing this rationale, the topics covered by each SLR, and the source papers. We will also insert a dedicated Limitations paragraph in the Discussion that states the absence of topic-diversity or abstract-quality metrics and cautions that the reported patterns should be treated as specific to the studied SLRs rather than universal properties of SE evidence screening. This directly addresses the referee’s concern without overstating the scope of the findings. revision: yes

  2. Referee: [Results] The statement that 'adding title and/or keywords to the abstract yielded no robust gains' is load-bearing for the input-feature conclusions, yet the Results section does not report the exact statistical tests, effect-size thresholds, or per-SLR variance used to establish 'robustness'; with only two SLRs the power to detect small gains is limited.

    Authors: We accept that the Results section requires greater statistical transparency. Although we performed paired comparisons (Wilcoxon signed-rank tests) and examined effect sizes between input configurations, these details were not fully reported. In the revised manuscript we will expand the Results section to include the precise tests used, the effect-size thresholds applied, per-SLR performance variances, and confidence intervals around the observed differences. We will also add an explicit statement acknowledging the low statistical power associated with only two SLRs and will qualify the phrase “no robust gains” to reflect that small improvements cannot be ruled out. These additions will make the input-feature conclusions more defensible. revision: yes

  3. Referee: [Discussion] The Discussion recommendation that 'LLM adoption should be justified by operational and governance constraints... supported by pilot validation' extrapolates from the two-SLR sample to general practice; without a sensitivity analysis or explicit scope limitations, this guidance risks over-generalization.

    Authors: We acknowledge the risk of over-generalization. A full sensitivity analysis across many additional SLRs is not feasible within the current study. In the revision we will therefore add explicit scope limitations to the Discussion, stating that the practical guidance is derived from the empirical patterns observed in the two examined SE SLRs and should be validated via pilot studies in any new context. We will also include a forward-looking call for replication on further SLRs. The recommendation will be rephrased to emphasize conditional, context-specific justification rather than a universal prescription. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation

full rationale

The paper reports direct experimental measurements of LLM and classical model performance, variability at temperature zero, and feature impacts (abstract/title/keywords) on two external real-world SLR datasets (518 papers total). No mathematical derivations, first-principles predictions, fitted parameters renamed as outputs, or self-referential definitions appear in the chain. Claims rest on observed metrics from off-the-shelf models applied to independent ground-truth screenings rather than reducing to inputs by construction. Self-citations, if present, are not load-bearing for any core result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the accuracy of the original SLR ground-truth labels and the assumption that the two chosen reviews are representative of typical software-engineering screening tasks.

axioms (1)
  • domain assumption Ground-truth inclusion/exclusion decisions from the original SLRs are accurate and unbiased.
    All performance metrics are computed against these labels as the reference standard.

pith-pipeline@v0.9.0 · 5591 in / 1229 out tokens · 80982 ms · 2026-05-07T13:06:06.316861+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 15 canonical work pages · 1 internal anchor

  1. [1]

    Katia Romero Felizardo, Márcia Sampaio Lima, Anderson Deizepe, Tayana Uchôa Conte, and Igor Steinmacher

    URLhttp://arxiv.org/abs/2502.08050. Katia Romero Felizardo, Márcia Sampaio Lima, Anderson Deizepe, Tayana Uchôa Conte, and Igor Steinmacher. ChatGPT application in Systematic Literature Reviews in Software Engineering: An evaluation of its accuracy to support the selection activity. InProceedings of the 18th ACM/IEEE International Symposium on Empirical S...

  2. [2]

    ISBN 979-8-4007-1047-6

    ACM. ISBN 979-8-4007-1047-6. doi:10.1145/3674805.3686666. URLhttps://dl.acm.org/doi/10.1145/3674805.3686666. Eugene Syriani, Istvan David, and Gauransh Kumar. Screening articles for systematic reviews with chatgpt.Journal of Computer Languages, 80:101287,

  3. [3]

    doi:10.1016/j.infsof.2021.106589

    ISSN 09505849. doi:10.1016/j.infsof.2021.106589. URL https://linkinghub.elsevier.com/retrieve/pii/ S0950584921000690. LukasThode,UmarIftikhar,andDanielMendez.ExploringtheuseofLLMsfortheselectionphaseinsystematicliterature studies.InformationandSoftwareTechnology,184:107757,2025. ISSN09505849. doi:10.1016/j.infsof.2025.107757. URLhttps://linkinghub.elsevie...

  4. [4]

    doi:10.1371/journal.pone.0227742

    ISSN 1932-6203. doi:10.1371/journal.pone.0227742. URLhttps://dx.plos.org/10.1371/journal.pone.0227742. Matteo Esposito, Andrea Janes, Davide Taibi, and Valentina Lenarduzzi. Generative AI in Evidence-Based Software Engineering: A White Paper,

  5. [5]

    Xufei Luo, Fengxian Chen, Di Zhu, Ling Wang, Zijun Wang, Hui Liu, Meng Lyu, Ye Wang, Qi Wang, and Yaolong Chen

    URLhttp://arxiv.org/abs/2407.17440. Xufei Luo, Fengxian Chen, Di Zhu, Ling Wang, Zijun Wang, Hui Liu, Meng Lyu, Ye Wang, Qi Wang, and Yaolong Chen. Potential roles of large language models in the production of systematic reviews and meta-analyses.Journal of Medical Internet Research, 26:e56780,

  6. [6]

    ISBN 979-8-4007-0567-0

    ACM. ISBN 979-8-4007-0567-0. doi:10.1145/3643664.3648202. URL https://dl.acm.org/doi/10.1145/3643664.3648202. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ra...

  7. [7]

    URL http://arxiv.org/abs/2005.14165. Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, XiaochengFeng,BingQin,andTingLiu.ASurveyonHallucinationinLargeLanguageModels: Principles,Taxonomy, Challenges,andOpenQuestions,2025. ISSN1046-8188,1558-2868. URL http://arxiv.org/abs/2311.05232. Ziwei Xu, Sanjay ...

  8. [8]

    & Kankanhalli, M

    URLhttp://arxiv.org/abs/2401.11817. Vipula Rawte, Amit Sheth, and Amitava Das. A Survey of Hallucination in Large Foundation Models,

  9. [9]

    A survey of hallucination in large foundation models

    URL http://arxiv.org/abs/2309.05922. 14 Beyond Accuracy: LLM Variability in SLR ScreeningA Preprint Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM Conference on Fairness, Account- ability, and Transparency, pages 610–623,...

  10. [10]

    doi:10.1145/3442188.3445922

    Association for Computing Machinery. doi:10.1145/3442188.3445922. Lingjiao Chen, Matei Zaharia, and James Zou. How is chatgpt’s behavior changing over time?Harvard Data Science Review, 6(2):1–47,

  11. [11]

    doi:10.1109/TLT.2023.3317396

    ISSN 1939-1382, 2372-0050. doi:10.1109/TLT.2023.3317396. URL https://ieeexplore.ieee.org/document/10256042/. Barbara A Kitchenham, Tore Dyba, and Magne Jorgensen. Evidence-based software engineering. InProceedings. 26th International Conference on Software Engineering, pages 273–281, Los Alamitos, CA, USA,

  12. [12]

    Google DeepMind

    Accessed on: 2025-11-03. Google DeepMind. Gemini api documentation. https://ai.google.dev/docs/models,

  13. [13]

    Anthropic

    Accessed on: 2025-11-03. Anthropic. Claudeapidocumentation. https://docs.anthropic.com/en/resources/overview,2025. Accessed on: 2025-11-03. Together AI. Together ai platform - llama models.https://www.together.ai/,

  14. [14]

    Deterministic

    Accessed on: 2025-11-03. Berk Atil, Sarp Aykent, Alexa Chittams, Lisheng Fu, Rebecca J. Passonneau, Evan Radcliffe, Guru Rajan Rajagopal, Adam Sloan, Tomasz Tudrej, Ferhan Ture, Zhe Wu, Lixinyu Xu, and Breck Baldwin. Non-Determinism of "Deterministic" LLM Settings,

  15. [15]

    Matthew Renze

    URLhttp://arxiv.org/abs/2408.04667. Matthew Renze. The effect of sampling temperature on problem solving in large language models. InFindings of the association for computational linguistics: EMNLP 2024, pages 7346–7356, Miami, Florida, USA,

  16. [16]

    doi:10.1016/j.eswa.2025.129592

    ISSN 09574174. doi:10.1016/j.eswa.2025.129592. URLhttps://linkinghub.elsevier.com/retrieve/pii/ S0957417425032075. Kilem L Gwet.Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. Advanced Analytics, LLC, Gaithersburg, MD, USA,

  17. [17]

    Farid Anvari and Daniël Lakens

    doi:10.1016/j.cct.2015.09.002. Farid Anvari and Daniël Lakens. Using anchor-based methods to determine the smallest effect size of interest.Journal of Experimental Social Psychology, 96:104159,

  18. [18]

    doi:10.1016/0306-4573(88)90021-0

    doi:10.1016/0306-4573(88)90021-0. Steven Bird, Ewan Klein, and Edward Loper.Natural Language Processing with Python. O’Reilly Media, Sebastopol, CA,