pith. machine review for the scientific record. sign in

arxiv: 2604.11287 · v1 · submitted 2026-04-13 · 💻 cs.AI · q-bio.OT

Recognition: 2 theorem links

· Lean Theorem

Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Using a Large Language Model

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:57 UTC · model grok-4.3

classification 💻 cs.AI q-bio.OT
keywords LLM consistencyexercise prescriptionsemantic similarityFITT principleintra-model variabilityAI safety checksclinical scenariosGemini model
0
0 comments X

The pith

Exercise prescriptions generated repeatedly by the same AI model are semantically alike but differ in key details like workout intensity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This study tests whether a large language model gives consistent exercise advice when the same question is asked multiple times. It created 120 prescriptions across six patient scenarios and compared them for meaning, structure according to the FITT principle, and safety notes. Outputs matched closely in overall meaning but varied in how hard exercises should be and how many safety details were added. A reader would care because unreliable numbers in fitness plans could affect safety or results if AI advice reaches real users. The findings indicate that prompt design affects stability and that extra rules plus expert review are needed before clinical use.

Core claim

When the same large language model generated exercise prescriptions twenty times for each of six fixed clinical scenarios, semantic similarity measured by sentence embeddings stayed high at mean cosine values of 0.879 to 0.939, with stronger agreement in constrained cases, while quantitative elements such as exercise intensity showed clear variability and safety sentence counts differed significantly by scenario type even though safety content appeared in every output.

What carries the argument

Repeated generation of outputs under identical prompts, evaluated through sentence embedding similarity for semantics and an AI judge for FITT structure adherence.

Load-bearing premise

That embedding similarity scores and AI judgments of structure accurately reflect whether differences between prescriptions would matter for actual patients without human expert review.

What would settle it

If human exercise physiologists review pairs of outputs from the same scenario and rate them as having clinically meaningful differences in intensity or safety for the described patient, the high-consistency interpretation would not hold.

Figures

Figures reproduced from arXiv: 2604.11287 by Kihyuk Lee.

Figure 1
Figure 1. Figure 1: Distribution of SBERT-based cosine similarity across scenarios (all-MiniLM-L6-v2). Each box represents 190 pairwise similarity values from 20 repeated outputs. Dark gray = clinical cases (S1–S4); light gray = healthy adult cases (S5–S6). Dashed line indicates 0.90. Kruskal-Wallis H = 328.37, p < 0.001. T2DM = type 2 diabetes mellitus; OA = osteoarthritis; CA = cancer; HTN = hypertension [PITH_FULL_IMAGE:f… view at source ↗
Figure 2
Figure 2. Figure 2: Pairwise p-values from Dunn's post hoc test with Bonferroni correction. Black cells indicate significant differences (p < 0.05), gray cells intermediate values, and white cells non￾significant comparisons (p ≥ 0.05). T2DM = type 2 diabetes mellitus; OA = osteoarthritis; CA = cancer; HTN = hypertension [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of FITT intensity classifications across scenarios as assessed by Claude Sonnet 4.6 (AI-as￾a-Judge). Left panel shows aerobic intensity; right panel shows resistance intensity. Classifications are based on initial prescription phase (Weeks 1–4). The lightest shade indicates unclassifiable outputs. 3.3 Safety Expression Consistency Evaluation of the presence or absence of safety-related express… view at source ↗
read the original abstract

Background: Large language models (LLMs) have been explored as tools for generating personalized exercise prescriptions, yet the consistency of outputs under identical conditions remains insufficiently examined. Objective: This study evaluated the intra-model consistency of LLM-generated exercise prescriptions using a repeated generation design. Methods: Six clinical scenarios were used to generate exercise prescriptions using Gemini 2.5 Flash (20 outputs per scenario; total n = 120). Consistency was assessed across three dimensions: (1) semantic consistency using SBERT-based cosine similarity, (2) structural consistency based on the FITT principle using an AI-as-a-judge approach, and (3) safety expression consistency, including inclusion rates and sentence-level quantification. Results: Semantic similarity was high across scenarios (mean cosine similarity: 0.879-0.939), with greater consistency in clinically constrained cases. Frequency showed consistent patterns, whereas variability was observed in quantitative components, particularly exercise intensity. Unclassifiable intensity expressions were observed in 10-25% of resistance training outputs. Safety-related expressions were included in 100% of outputs; however, safety sentence counts varied significantly across scenarios (H=86.18, p less than 0.001), with clinical cases generating more safety expressions than healthy adult cases. Conclusions: LLM-generated exercise prescriptions demonstrated high semantic consistency but showed variability in key quantitative components. Reliability depends substantially on prompt structure, and additional structural constraints and expert validation are needed before clinical deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper reports a repeated-generation experiment (n=120) in which Gemini 2.5 Flash produced exercise prescriptions for six clinical scenarios (20 outputs each). Consistency was quantified via SBERT cosine similarity (range 0.879–0.939), an AI-as-a-judge assessment of FITT-principle structure, and counts of safety-related sentences. Results indicate high semantic similarity, consistent frequency but variable intensity (10–25 % unclassifiable), and 100 % inclusion of safety expressions whose sentence counts differed significantly across scenarios (Kruskal–Wallis H=86.18, p<0.001). The authors conclude that prompt structure affects reliability and that expert validation plus structural constraints are required before clinical use.

Significance. If the automated metrics prove to be valid proxies for clinical meaning, the work supplies useful empirical data on intra-model variability in a safety-critical domain and reinforces the need for human oversight when LLMs are applied to exercise prescription. The repeated-generation design and reporting of variability are appropriate for the research question.

major comments (3)
  1. [Methods] Methods section (AI-as-a-judge paragraph): No validation of the LLM judge against human clinicians is described, nor is inter-rater reliability (e.g., Cohen’s kappa) or prompt details for the judge provided. Because the central claims about structural consistency and safety-expression variability rest entirely on this unvalidated classifier, the reported differences in intensity classification and safety-sentence counts cannot yet be interpreted as clinically meaningful.
  2. [Results] Results section (semantic-consistency paragraph): SBERT cosine similarities are presented as evidence of “high semantic consistency,” yet no analysis correlates these scores with expert judgments of clinical equivalence. In safety-critical prescriptions, high embedding similarity can mask consequential differences (e.g., contraindicated intensity for a given comorbidity); without such anchoring, the claim that semantic consistency is high does not directly support the paper’s conclusions about reliability.
  3. [Results] Results section (safety-expression analysis): The Kruskal–Wallis test on safety-sentence counts is reported, but neither post-hoc pairwise comparisons nor effect-size measures are given. This limits the ability to determine which specific scenarios drive the significant difference and weakens the interpretation that “clinical cases generating more safety expressions” is a robust finding.
minor comments (2)
  1. [Abstract] Abstract: The cosine-similarity range (0.879–0.939) is given without indicating whether these are per-scenario means or an overall range; adding a table or explicit per-scenario values would improve clarity.
  2. [Methods] The exact prompts used for both generation and the AI judge are referenced but not reproduced; placing them in supplementary material would enhance reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important limitations in our interpretation of the automated metrics. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Methods] Methods section (AI-as-a-judge paragraph): No validation of the LLM judge against human clinicians is described, nor is inter-rater reliability (e.g., Cohen’s kappa) or prompt details for the judge provided. Because the central claims about structural consistency and safety-expression variability rest entirely on this unvalidated classifier, the reported differences in intensity classification and safety-sentence counts cannot yet be interpreted as clinically meaningful.

    Authors: We agree that the absence of human validation for the AI-as-a-judge limits the clinical interpretability of the structural and safety findings. The AI judge was used as a scalable, reproducible proxy for FITT-principle adherence and safety-sentence detection across the 120 outputs. We will revise the Methods section to include the complete judge prompt and add an explicit limitations paragraph noting the lack of clinician validation or inter-rater reliability metrics. We will also qualify the Results statements on intensity classification and safety counts to indicate they are preliminary and require expert confirmation before clinical claims can be made. revision: partial

  2. Referee: [Results] Results section (semantic-consistency paragraph): SBERT cosine similarities are presented as evidence of “high semantic consistency,” yet no analysis correlates these scores with expert judgments of clinical equivalence. In safety-critical prescriptions, high embedding similarity can mask consequential differences (e.g., contraindicated intensity for a given comorbidity); without such anchoring, the claim that semantic consistency is high does not directly support the paper’s conclusions about reliability.

    Authors: We acknowledge that SBERT cosine similarity measures lexical and semantic overlap but does not guarantee clinical equivalence. The metric was chosen because it provides an objective, quantitative indicator of output stability under repeated prompting. We will add a dedicated paragraph in the Discussion section that explicitly states this limitation, gives examples of how high similarity could still permit clinically important discrepancies, and clarifies that the semantic-consistency results support only the narrower claim of intra-model output stability rather than direct clinical reliability. No post-hoc correlation analysis with clinicians is possible without new annotations. revision: partial

  3. Referee: [Results] Results section (safety-expression analysis): The Kruskal–Wallis test on safety-sentence counts is reported, but neither post-hoc pairwise comparisons nor effect-size measures are given. This limits the ability to determine which specific scenarios drive the significant difference and weakens the interpretation that “clinical cases generating more safety expressions” is a robust finding.

    Authors: The referee is correct that post-hoc tests and effect sizes were omitted. We will revise the Results section to report Dunn’s post-hoc pairwise comparisons (with Bonferroni adjustment) and an effect-size measure (eta-squared) for the Kruskal–Wallis test. These additions will identify which scenario pairs differ significantly and quantify the magnitude of the observed differences in safety-sentence counts, thereby strengthening the interpretation that certain clinical cases elicit more safety expressions. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement study with no derivations or self-referential steps.

full rationale

The paper performs a repeated-generation experiment on an external LLM (Gemini 2.5 Flash) across fixed clinical scenarios, then applies off-the-shelf external tools (SBERT embeddings for cosine similarity, an AI-as-a-judge prompt for FITT classification, and standard non-parametric statistical tests) to quantify observed variability. No equations, fitted parameters, predictions derived from prior outputs, or self-citations are used to justify any central claim. All reported quantities (cosine ranges, inclusion rates, Kruskal-Wallis H values) are direct empirical observations rather than quantities that reduce to the study inputs by construction. The study is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

This is an empirical evaluation study that relies on established NLP embedding models and standard statistical tests without introducing new free parameters, ad-hoc axioms, or invented entities.

axioms (2)
  • domain assumption SBERT embeddings provide a valid measure of semantic similarity for exercise prescription text
    Used as the basis for cosine similarity calculations in the methods
  • domain assumption The AI-as-a-judge approach can reliably classify adherence to the FITT principle
    Applied to assess structural consistency without reported human validation of the judge

pith-pipeline@v0.9.0 · 5563 in / 1344 out tokens · 56376 ms · 2026-05-10T15:57:01.788692+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

    cs.CL 2026-04 conditional novelty 5.0

    Three LLMs exhibit distinct consistency profiles in repeated exercise prescription generation, with GPT-4.1 producing unique but semantically stable outputs while Gemini 2.5 Flash achieves high similarity through text...

Reference graph

Works this paper leans on

29 extracted references · 5 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Introduction The advancement of large language models (LLMs) has opened new possibilities in the field of exercise prescription. LLMs are capable of generating structured exercise recommendations that account for individual health status, disease characteristics, and c ontraindications, and their potential as decision - support tools in settings where acc...

  2. [2]

    estimated

    Materials and Methods 2.1. Study Design and Overview This study was conducted to evaluate the intra -model consistency of exercise prescriptions generated by a large language model (LLM) under controlled conditions. A repeated generation design was applied, in which identical clinical scenarios and prompts we re repeatedly submitted to produce multiple ou...

  3. [3]

    Results 3.1 Semantic Consistency Analysis SBERT (all-MiniLM-L6-v2)-based cosine similarity was computed for 190 pairwise comparisons per scenario, with mean similarity values ranging from 0.879 to 0.939 across all scenarios, indicating generally high consistency (Table 3, Figure 1). The highest co nsistency was observed in S3 (Mean = 0.939, SD = 0.021) an...

  4. [4]

    Discussion This study systematically evaluated the repeated generation consistency of LLM -based exercise prescriptions produced by Gemini 2.5 Flash across three dimensions: semantic consistency, FITT structural classification, and safety expression consistency. The results confirmed that overall semantic consistency was high, while variability was observ...

  5. [5]

    on LLM performance decline in treatment planning tasks, and suggest that both prompt structure and inherent model behavior contribute to output variability, underscoring the need for structured prompt design alongside continued evaluation of model-level characteristics. Future studies comparing multiple LLMs under identical prompt conditions would help cl...

  6. [6]

    High semantic consistency was observed across all scenarios, with greater consistency in cases with more clearly defined clinical constraints

    Conclusion This study evaluated the repeated generation consistency of LLM-generated exercise prescriptions across three dimensions: semantic similarity, FITT structural classification, and safety expression consistency. High semantic consistency was observed across all scenarios, with greater consistency in cases with more clearly defined clinical constr...

  7. [7]

    The potential of AI to create personalized exercise plans

    Enichen EJ, Young CC, Frates EP. The potential of AI to create personalized exercise plans. Health Promot Pract. 2025; online ahead of print. doi:10.1177/15248399251394695

  8. [8]

    Artificial intelligence in sport: exploring the potential of using ChatGPT in resistance training prescription

    Washif J, Pagaduan J, James C, Dergaa I, Beaven C. Artificial intelligence in sport: exploring the potential of using ChatGPT in resistance training prescription. Biol Sport. 2024;41:209-220

  9. [9]

    ChatGPT generated training plans for runners are not rated optimal by coaching experts, but increase in quality with additional input information

    Düking P , Sperlich B, V oigt L, Van Hooren B, Zanini M, Zinner C. ChatGPT generated training plans for runners are not rated optimal by coaching experts, but increase in quality with additional input information. J Sports Sci Med. 2024;23:56

  10. [10]

    ChatGPT-4o-generated exercise plans for patients with type 2 diabetes mellitus — assessment of their safety and other quality criteria by coaching experts

    Akrimi S, Schwensfeier L, Düking P, Kreutz T, Brinkmann C. ChatGPT-4o-generated exercise plans for patients with type 2 diabetes mellitus — assessment of their safety and other quality criteria by coaching experts. Sports. 2025;13:92

  11. [11]

    Using large language models to enhance exercise recommendations and physical activity in clinical and healthy populations: scoping review

    Lai X, Chen J, Lai Y , Huang S, Cai Y , Sun Z, Wang X, Pan K, Gao Q, Huang C. Using large language models to enhance exercise recommendations and physical activity in clinical and healthy populations: scoping review. JMIR Med Inform. 2025;13:e59309

  12. [12]

    Structured clinical reasoning for exercise prescription in patients with comorbidity

    van der Leeden M, Stuiver MM, Huijsmans R, Geleijn E, de Rooij M, Dekker J. Structured clinical reasoning for exercise prescription in patients with comorbidity. Disabil Rehabil. 2020;42:1474-1479

  13. [13]

    AI -generated exercise prescriptions for at -risk populations: safety and feasibility of a large language model assessed by expert evaluation

    Choi M, Park J, Lee M, Beom J, Jung SY , Lee K. AI -generated exercise prescriptions for at -risk populations: safety and feasibility of a large language model assessed by expert evaluation. J Clin Med. 2026;15(6):2457

  14. [14]

    A statistical framework for evaluating the repeatability and reproducibility of large language models

    Shyr C, Ren B, Hsu CY , Yan C, Tinker RJ, Cassini TA, et al. A statistical framework for evaluating the repeatability and reproducibility of large language models. medRxiv. 2025. https://doi.org/10.1101/2025.08.06.25333170

  15. [15]

    Judging LLM-as-a-judge with MT-bench and chatbot arena

    Zheng L, Chiang WL, Sheng Y , Zhuang S, Wu Z, Zhuang Y , et al. Judging LLM-as-a-judge with MT-bench and chatbot arena. Adv Neural Inf Process Syst. 2023;36

  16. [16]

    Evaluating clinical AI summaries with large language models as judges

    Croxford E, Gao Y , First E, Pellegrino N, Schnier M, Caskey J, et al. Evaluating clinical AI summaries with large language models as judges. NPJ Digit Med. 2025;8:640

  17. [17]

    arXiv preprint arXiv:2410.21819 (2025)

    Wataoka K, Takahashi T, Ri R. Self -preference bias in LLM -as-a-judge. arXiv preprint arXiv:2410.21819. 2024. Available from: https://arxiv.org/abs/2410.21819

  18. [18]

    Sentence -BERT: sentence embeddings using Siamese BERT -networks

    Reimers N, Gurevych I. Sentence -BERT: sentence embeddings using Siamese BERT -networks. Proceedings of EMNLP-IJCNLP 2019. 2019:3982-3992

  19. [19]

    LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    Li H, Dong Q, Chen J, Su H, Zhou Y , Ai Q, et al. LLMs -as-judges: a comprehensive survey on LLM-based evaluation methods. arXiv preprint arXiv:2412.05579. 2024. Available from: https://arxiv.org/abs/2412.05579

  20. [20]

    ACSM position stand: quantity and quality of exercise for developing and maintaining cardiorespiratory, musculoskeletal, and neuromotor fitness in apparently healthy adults

    Garber CE, Blissmer B, Deschenes MR, Franklin BA, Lamonte MJ, Lee IM, et al. ACSM position stand: quantity and quality of exercise for developing and maintaining cardiorespiratory, musculoskeletal, and neuromotor fitness in apparently healthy adults. Med S ci Sports Exerc. 2011;43(7):1334-1359

  21. [21]

    Physical activity and exercise intensity terminology: a joint ACSM expert statement and ESSA consensus statement

    Bishop DJ, Beck B, Biddle SJH, Denay KL, Ferri A, Gibala MJ, et al. Physical activity and exercise intensity terminology: a joint ACSM expert statement and ESSA consensus statement. Med Sci Sports Exerc. 2025;57(11):2599-2613

  22. [22]

    Loading recommendations for muscle strength, hypertrophy, and local endurance: a re-examination of the repetition continuum

    Schoenfeld BJ, Grgic J, Van Every DW, Plotkin DL. Loading recommendations for muscle strength, hypertrophy, and local endurance: a re-examination of the repetition continuum. Sports. 2021;9(2):32

  23. [23]

    Currier BS, D'Souza AC, Fiatarone Singh MA, Lowisz CV , Rawson ES, Schoenfeld BJ, et al. American College of Sports Medicine position stand: resistance training prescription for muscle function, hypertrophy, and physical performance in healthy adults: an overview of reviews. Med Sci Sports Exerc. 2026;58(4):851-872

  24. [24]

    ACSM's guidelines for exercise testing and prescription

    American College of Sports Medicine. ACSM's guidelines for exercise testing and prescription. 12th ed. Philadelphia: Wolters Kluwer; 2024

  25. [25]

    Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation

    Carandang KAM, Arana JM, Casin ER, Monterola C, Tan DS, Valenzuela JFB, et al. Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation. In: Proceedings of the 63rd Annual Meeting of the Association for Comp utational Linguistics (V olume 6: Industry Track). 2025:1413-1422

  26. [26]

    Assessing the accuracy and reliability of large language models in psychiatry using standardized multiple -choice questions: cross-sectional study

    Hanss K, Sarma KV , Glowinski AL, Krystal A, Saunders R, Halls A, et al. Assessing the accuracy and reliability of large language models in psychiatry using standardized multiple -choice questions: cross-sectional study. J Med Internet Res. 2025;27:e69910

  27. [27]

    Quantifying the reasoning abilities of LLMs on clinical cases

    Qiu P, Wu C, Liu S, Fan Y , Zhao W, Chen Z, et al. Quantifying the reasoning abilities of LLMs on clinical cases. Nat Commun. 2025;16(1):9799

  28. [28]

    Large language models in medicine

    Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930-1940

  29. [29]

    A Survey on LLM-as-a-Judge

    Gu J, Jiang X, Shi Z, Tan H, Zhai X, Xu C, et al. A survey on LLM -as-a-judge. arXiv preprint arXiv:2411.15594. 2024. Available from: https://arxiv.org/abs/2411.15594