Recognition: 2 theorem links
· Lean TheoremConsistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Using a Large Language Model
Pith reviewed 2026-05-10 15:57 UTC · model grok-4.3
The pith
Exercise prescriptions generated repeatedly by the same AI model are semantically alike but differ in key details like workout intensity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When the same large language model generated exercise prescriptions twenty times for each of six fixed clinical scenarios, semantic similarity measured by sentence embeddings stayed high at mean cosine values of 0.879 to 0.939, with stronger agreement in constrained cases, while quantitative elements such as exercise intensity showed clear variability and safety sentence counts differed significantly by scenario type even though safety content appeared in every output.
What carries the argument
Repeated generation of outputs under identical prompts, evaluated through sentence embedding similarity for semantics and an AI judge for FITT structure adherence.
Load-bearing premise
That embedding similarity scores and AI judgments of structure accurately reflect whether differences between prescriptions would matter for actual patients without human expert review.
What would settle it
If human exercise physiologists review pairs of outputs from the same scenario and rate them as having clinically meaningful differences in intensity or safety for the described patient, the high-consistency interpretation would not hold.
Figures
read the original abstract
Background: Large language models (LLMs) have been explored as tools for generating personalized exercise prescriptions, yet the consistency of outputs under identical conditions remains insufficiently examined. Objective: This study evaluated the intra-model consistency of LLM-generated exercise prescriptions using a repeated generation design. Methods: Six clinical scenarios were used to generate exercise prescriptions using Gemini 2.5 Flash (20 outputs per scenario; total n = 120). Consistency was assessed across three dimensions: (1) semantic consistency using SBERT-based cosine similarity, (2) structural consistency based on the FITT principle using an AI-as-a-judge approach, and (3) safety expression consistency, including inclusion rates and sentence-level quantification. Results: Semantic similarity was high across scenarios (mean cosine similarity: 0.879-0.939), with greater consistency in clinically constrained cases. Frequency showed consistent patterns, whereas variability was observed in quantitative components, particularly exercise intensity. Unclassifiable intensity expressions were observed in 10-25% of resistance training outputs. Safety-related expressions were included in 100% of outputs; however, safety sentence counts varied significantly across scenarios (H=86.18, p less than 0.001), with clinical cases generating more safety expressions than healthy adult cases. Conclusions: LLM-generated exercise prescriptions demonstrated high semantic consistency but showed variability in key quantitative components. Reliability depends substantially on prompt structure, and additional structural constraints and expert validation are needed before clinical deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports a repeated-generation experiment (n=120) in which Gemini 2.5 Flash produced exercise prescriptions for six clinical scenarios (20 outputs each). Consistency was quantified via SBERT cosine similarity (range 0.879–0.939), an AI-as-a-judge assessment of FITT-principle structure, and counts of safety-related sentences. Results indicate high semantic similarity, consistent frequency but variable intensity (10–25 % unclassifiable), and 100 % inclusion of safety expressions whose sentence counts differed significantly across scenarios (Kruskal–Wallis H=86.18, p<0.001). The authors conclude that prompt structure affects reliability and that expert validation plus structural constraints are required before clinical use.
Significance. If the automated metrics prove to be valid proxies for clinical meaning, the work supplies useful empirical data on intra-model variability in a safety-critical domain and reinforces the need for human oversight when LLMs are applied to exercise prescription. The repeated-generation design and reporting of variability are appropriate for the research question.
major comments (3)
- [Methods] Methods section (AI-as-a-judge paragraph): No validation of the LLM judge against human clinicians is described, nor is inter-rater reliability (e.g., Cohen’s kappa) or prompt details for the judge provided. Because the central claims about structural consistency and safety-expression variability rest entirely on this unvalidated classifier, the reported differences in intensity classification and safety-sentence counts cannot yet be interpreted as clinically meaningful.
- [Results] Results section (semantic-consistency paragraph): SBERT cosine similarities are presented as evidence of “high semantic consistency,” yet no analysis correlates these scores with expert judgments of clinical equivalence. In safety-critical prescriptions, high embedding similarity can mask consequential differences (e.g., contraindicated intensity for a given comorbidity); without such anchoring, the claim that semantic consistency is high does not directly support the paper’s conclusions about reliability.
- [Results] Results section (safety-expression analysis): The Kruskal–Wallis test on safety-sentence counts is reported, but neither post-hoc pairwise comparisons nor effect-size measures are given. This limits the ability to determine which specific scenarios drive the significant difference and weakens the interpretation that “clinical cases generating more safety expressions” is a robust finding.
minor comments (2)
- [Abstract] Abstract: The cosine-similarity range (0.879–0.939) is given without indicating whether these are per-scenario means or an overall range; adding a table or explicit per-scenario values would improve clarity.
- [Methods] The exact prompts used for both generation and the AI judge are referenced but not reproduced; placing them in supplementary material would enhance reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important limitations in our interpretation of the automated metrics. We address each major comment below and indicate where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [Methods] Methods section (AI-as-a-judge paragraph): No validation of the LLM judge against human clinicians is described, nor is inter-rater reliability (e.g., Cohen’s kappa) or prompt details for the judge provided. Because the central claims about structural consistency and safety-expression variability rest entirely on this unvalidated classifier, the reported differences in intensity classification and safety-sentence counts cannot yet be interpreted as clinically meaningful.
Authors: We agree that the absence of human validation for the AI-as-a-judge limits the clinical interpretability of the structural and safety findings. The AI judge was used as a scalable, reproducible proxy for FITT-principle adherence and safety-sentence detection across the 120 outputs. We will revise the Methods section to include the complete judge prompt and add an explicit limitations paragraph noting the lack of clinician validation or inter-rater reliability metrics. We will also qualify the Results statements on intensity classification and safety counts to indicate they are preliminary and require expert confirmation before clinical claims can be made. revision: partial
-
Referee: [Results] Results section (semantic-consistency paragraph): SBERT cosine similarities are presented as evidence of “high semantic consistency,” yet no analysis correlates these scores with expert judgments of clinical equivalence. In safety-critical prescriptions, high embedding similarity can mask consequential differences (e.g., contraindicated intensity for a given comorbidity); without such anchoring, the claim that semantic consistency is high does not directly support the paper’s conclusions about reliability.
Authors: We acknowledge that SBERT cosine similarity measures lexical and semantic overlap but does not guarantee clinical equivalence. The metric was chosen because it provides an objective, quantitative indicator of output stability under repeated prompting. We will add a dedicated paragraph in the Discussion section that explicitly states this limitation, gives examples of how high similarity could still permit clinically important discrepancies, and clarifies that the semantic-consistency results support only the narrower claim of intra-model output stability rather than direct clinical reliability. No post-hoc correlation analysis with clinicians is possible without new annotations. revision: partial
-
Referee: [Results] Results section (safety-expression analysis): The Kruskal–Wallis test on safety-sentence counts is reported, but neither post-hoc pairwise comparisons nor effect-size measures are given. This limits the ability to determine which specific scenarios drive the significant difference and weakens the interpretation that “clinical cases generating more safety expressions” is a robust finding.
Authors: The referee is correct that post-hoc tests and effect sizes were omitted. We will revise the Results section to report Dunn’s post-hoc pairwise comparisons (with Bonferroni adjustment) and an effect-size measure (eta-squared) for the Kruskal–Wallis test. These additions will identify which scenario pairs differ significantly and quantify the magnitude of the observed differences in safety-sentence counts, thereby strengthening the interpretation that certain clinical cases elicit more safety expressions. revision: yes
Circularity Check
No circularity: purely empirical measurement study with no derivations or self-referential steps.
full rationale
The paper performs a repeated-generation experiment on an external LLM (Gemini 2.5 Flash) across fixed clinical scenarios, then applies off-the-shelf external tools (SBERT embeddings for cosine similarity, an AI-as-a-judge prompt for FITT classification, and standard non-parametric statistical tests) to quantify observed variability. No equations, fitted parameters, predictions derived from prior outputs, or self-citations are used to justify any central claim. All reported quantities (cosine ranges, inclusion rates, Kruskal-Wallis H values) are direct empirical observations rather than quantities that reduce to the study inputs by construction. The study is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption SBERT embeddings provide a valid measure of semantic similarity for exercise prescription text
- domain assumption The AI-as-a-judge approach can reliably classify adherence to the FITT principle
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Semantic consistency was assessed using a pretrained sentence embedding model (all-MiniLM-L6-v2). Pairwise cosine similarity... Structural consistency was evaluated based on four FITT components... using Claude Sonnet 4.6 as an independent LLM evaluator
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Six clinical scenarios... repeated generation design... total n = 120
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models
Three LLMs exhibit distinct consistency profiles in repeated exercise prescription generation, with GPT-4.1 producing unique but semantically stable outputs while Gemini 2.5 Flash achieves high similarity through text...
Reference graph
Works this paper leans on
-
[1]
Introduction The advancement of large language models (LLMs) has opened new possibilities in the field of exercise prescription. LLMs are capable of generating structured exercise recommendations that account for individual health status, disease characteristics, and c ontraindications, and their potential as decision - support tools in settings where acc...
-
[2]
estimated
Materials and Methods 2.1. Study Design and Overview This study was conducted to evaluate the intra -model consistency of exercise prescriptions generated by a large language model (LLM) under controlled conditions. A repeated generation design was applied, in which identical clinical scenarios and prompts we re repeatedly submitted to produce multiple ou...
-
[3]
Results 3.1 Semantic Consistency Analysis SBERT (all-MiniLM-L6-v2)-based cosine similarity was computed for 190 pairwise comparisons per scenario, with mean similarity values ranging from 0.879 to 0.939 across all scenarios, indicating generally high consistency (Table 3, Figure 1). The highest co nsistency was observed in S3 (Mean = 0.939, SD = 0.021) an...
-
[4]
Discussion This study systematically evaluated the repeated generation consistency of LLM -based exercise prescriptions produced by Gemini 2.5 Flash across three dimensions: semantic consistency, FITT structural classification, and safety expression consistency. The results confirmed that overall semantic consistency was high, while variability was observ...
-
[5]
on LLM performance decline in treatment planning tasks, and suggest that both prompt structure and inherent model behavior contribute to output variability, underscoring the need for structured prompt design alongside continued evaluation of model-level characteristics. Future studies comparing multiple LLMs under identical prompt conditions would help cl...
-
[6]
High semantic consistency was observed across all scenarios, with greater consistency in cases with more clearly defined clinical constraints
Conclusion This study evaluated the repeated generation consistency of LLM-generated exercise prescriptions across three dimensions: semantic similarity, FITT structural classification, and safety expression consistency. High semantic consistency was observed across all scenarios, with greater consistency in cases with more clearly defined clinical constr...
-
[7]
The potential of AI to create personalized exercise plans
Enichen EJ, Young CC, Frates EP. The potential of AI to create personalized exercise plans. Health Promot Pract. 2025; online ahead of print. doi:10.1177/15248399251394695
-
[8]
Artificial intelligence in sport: exploring the potential of using ChatGPT in resistance training prescription
Washif J, Pagaduan J, James C, Dergaa I, Beaven C. Artificial intelligence in sport: exploring the potential of using ChatGPT in resistance training prescription. Biol Sport. 2024;41:209-220
2024
-
[9]
ChatGPT generated training plans for runners are not rated optimal by coaching experts, but increase in quality with additional input information
Düking P , Sperlich B, V oigt L, Van Hooren B, Zanini M, Zinner C. ChatGPT generated training plans for runners are not rated optimal by coaching experts, but increase in quality with additional input information. J Sports Sci Med. 2024;23:56
2024
-
[10]
ChatGPT-4o-generated exercise plans for patients with type 2 diabetes mellitus — assessment of their safety and other quality criteria by coaching experts
Akrimi S, Schwensfeier L, Düking P, Kreutz T, Brinkmann C. ChatGPT-4o-generated exercise plans for patients with type 2 diabetes mellitus — assessment of their safety and other quality criteria by coaching experts. Sports. 2025;13:92
2025
-
[11]
Using large language models to enhance exercise recommendations and physical activity in clinical and healthy populations: scoping review
Lai X, Chen J, Lai Y , Huang S, Cai Y , Sun Z, Wang X, Pan K, Gao Q, Huang C. Using large language models to enhance exercise recommendations and physical activity in clinical and healthy populations: scoping review. JMIR Med Inform. 2025;13:e59309
2025
-
[12]
Structured clinical reasoning for exercise prescription in patients with comorbidity
van der Leeden M, Stuiver MM, Huijsmans R, Geleijn E, de Rooij M, Dekker J. Structured clinical reasoning for exercise prescription in patients with comorbidity. Disabil Rehabil. 2020;42:1474-1479
2020
-
[13]
AI -generated exercise prescriptions for at -risk populations: safety and feasibility of a large language model assessed by expert evaluation
Choi M, Park J, Lee M, Beom J, Jung SY , Lee K. AI -generated exercise prescriptions for at -risk populations: safety and feasibility of a large language model assessed by expert evaluation. J Clin Med. 2026;15(6):2457
2026
-
[14]
Shyr C, Ren B, Hsu CY , Yan C, Tinker RJ, Cassini TA, et al. A statistical framework for evaluating the repeatability and reproducibility of large language models. medRxiv. 2025. https://doi.org/10.1101/2025.08.06.25333170
-
[15]
Judging LLM-as-a-judge with MT-bench and chatbot arena
Zheng L, Chiang WL, Sheng Y , Zhuang S, Wu Z, Zhuang Y , et al. Judging LLM-as-a-judge with MT-bench and chatbot arena. Adv Neural Inf Process Syst. 2023;36
2023
-
[16]
Evaluating clinical AI summaries with large language models as judges
Croxford E, Gao Y , First E, Pellegrino N, Schnier M, Caskey J, et al. Evaluating clinical AI summaries with large language models as judges. NPJ Digit Med. 2025;8:640
2025
-
[17]
arXiv preprint arXiv:2410.21819 (2025)
Wataoka K, Takahashi T, Ri R. Self -preference bias in LLM -as-a-judge. arXiv preprint arXiv:2410.21819. 2024. Available from: https://arxiv.org/abs/2410.21819
-
[18]
Sentence -BERT: sentence embeddings using Siamese BERT -networks
Reimers N, Gurevych I. Sentence -BERT: sentence embeddings using Siamese BERT -networks. Proceedings of EMNLP-IJCNLP 2019. 2019:3982-3992
2019
-
[19]
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
Li H, Dong Q, Chen J, Su H, Zhou Y , Ai Q, et al. LLMs -as-judges: a comprehensive survey on LLM-based evaluation methods. arXiv preprint arXiv:2412.05579. 2024. Available from: https://arxiv.org/abs/2412.05579
work page internal anchor Pith review arXiv 2024
-
[20]
ACSM position stand: quantity and quality of exercise for developing and maintaining cardiorespiratory, musculoskeletal, and neuromotor fitness in apparently healthy adults
Garber CE, Blissmer B, Deschenes MR, Franklin BA, Lamonte MJ, Lee IM, et al. ACSM position stand: quantity and quality of exercise for developing and maintaining cardiorespiratory, musculoskeletal, and neuromotor fitness in apparently healthy adults. Med S ci Sports Exerc. 2011;43(7):1334-1359
2011
-
[21]
Physical activity and exercise intensity terminology: a joint ACSM expert statement and ESSA consensus statement
Bishop DJ, Beck B, Biddle SJH, Denay KL, Ferri A, Gibala MJ, et al. Physical activity and exercise intensity terminology: a joint ACSM expert statement and ESSA consensus statement. Med Sci Sports Exerc. 2025;57(11):2599-2613
2025
-
[22]
Loading recommendations for muscle strength, hypertrophy, and local endurance: a re-examination of the repetition continuum
Schoenfeld BJ, Grgic J, Van Every DW, Plotkin DL. Loading recommendations for muscle strength, hypertrophy, and local endurance: a re-examination of the repetition continuum. Sports. 2021;9(2):32
2021
-
[23]
Currier BS, D'Souza AC, Fiatarone Singh MA, Lowisz CV , Rawson ES, Schoenfeld BJ, et al. American College of Sports Medicine position stand: resistance training prescription for muscle function, hypertrophy, and physical performance in healthy adults: an overview of reviews. Med Sci Sports Exerc. 2026;58(4):851-872
2026
-
[24]
ACSM's guidelines for exercise testing and prescription
American College of Sports Medicine. ACSM's guidelines for exercise testing and prescription. 12th ed. Philadelphia: Wolters Kluwer; 2024
2024
-
[25]
Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation
Carandang KAM, Arana JM, Casin ER, Monterola C, Tan DS, Valenzuela JFB, et al. Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation. In: Proceedings of the 63rd Annual Meeting of the Association for Comp utational Linguistics (V olume 6: Industry Track). 2025:1413-1422
2025
-
[26]
Assessing the accuracy and reliability of large language models in psychiatry using standardized multiple -choice questions: cross-sectional study
Hanss K, Sarma KV , Glowinski AL, Krystal A, Saunders R, Halls A, et al. Assessing the accuracy and reliability of large language models in psychiatry using standardized multiple -choice questions: cross-sectional study. J Med Internet Res. 2025;27:e69910
2025
-
[27]
Quantifying the reasoning abilities of LLMs on clinical cases
Qiu P, Wu C, Liu S, Fan Y , Zhao W, Chen Z, et al. Quantifying the reasoning abilities of LLMs on clinical cases. Nat Commun. 2025;16(1):9799
2025
-
[28]
Large language models in medicine
Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930-1940
2023
-
[29]
Gu J, Jiang X, Shi Z, Tan H, Zhai X, Xu C, et al. A survey on LLM -as-a-judge. arXiv preprint arXiv:2411.15594. 2024. Available from: https://arxiv.org/abs/2411.15594
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.