Recognition: unknown
Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models
Pith reviewed 2026-05-10 03:17 UTC · model grok-4.3
The pith
Three large language models generate exercise prescriptions with fundamentally different consistency patterns even under identical temperature-zero settings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under temperature=0, GPT-4.1 produced 100% unique outputs with mean semantic similarity of 0.955, Gemini 2.5 Flash produced only 27.5% unique outputs while reaching 0.950 similarity through repetition, and Claude Sonnet 4.6 scored 0.903 similarity; safety expressions reached ceiling levels across all three models, and these patterns demonstrate that identical decoding settings produce distinct consistency profiles undetectable by single-output evaluation.
What carries the argument
Repeated generation protocol under fixed temperature=0, tracking semantic similarity, output uniqueness rate, FITT classification stability, and safety expression across 360 total outputs.
If this is right
- Model selection for LLM-based exercise prescription tools must account for repeated-generation behavior rather than single-output quality alone.
- Single-prompt evaluations cannot detect whether high semantic similarity arises from stable reasoning or from text duplication.
- Safety expression metrics reach ceiling levels and provide no differentiation between models.
- Output consistency under repetition should be treated as a core requirement for reliable clinical deployment of such systems.
- Model choice in this domain functions as a clinical decision with direct implications for patient-facing advice.
Where Pith is reading between the lines
- Evaluation pipelines for medical LLMs should routinely include repeated sampling to surface hidden repetition or drift.
- The distinct profiles may reflect differences in training or alignment that could be diagnosed through targeted ablation studies on other clinical text tasks.
- Parallel use of multiple models and cross-checking their outputs could serve as a practical safeguard when deploying any one model for exercise planning.
- The same repeated-generation test could be applied to other medical generation domains such as dietary plans or rehabilitation protocols to check generalizability.
Load-bearing premise
That the chosen semantic similarity metric and uniqueness rate accurately reflect clinical reliability of the generated exercise prescriptions rather than merely stylistic differences.
What would settle it
If expert clinicians reviewing the full set of repeated outputs judge them clinically equivalent in safety and appropriateness across all three models, the claim that these consistency profiles affect deployment reliability would be undermined.
Figures
read the original abstract
This study compared repeated generation consistency of exercise prescription outputs across three large language models (LLMs), specifically GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash, under temperature=0 conditions. Each model generated prescriptions for six clinical scenarios 20 times, yielding 360 total outputs analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. Mean semantic similarity was highest for GPT-4.1 (0.955), followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903), with significant inter-model differences confirmed (H = 458.41, p < .001). Critically, these scores reflected fundamentally different generative behaviors: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content, while Gemini 2.5 Flash showed pronounced output repetition (27.5% unique outputs), indicating that its high similarity score derived from text duplication rather than consistent reasoning. Identical decoding settings thus yielded fundamentally different consistency profiles, a distinction that single-output evaluations cannot capture. Safety expression reached ceiling levels across all models, confirming its limited utility as a differentiating metric. These results indicate that model selection constitutes a clinical rather than merely technical decision, and that output behavior under repeated generation conditions should be treated as a core criterion for reliable deployment of LLM-based exercise prescription systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper empirically compares the consistency of exercise prescription outputs generated by three LLMs (GPT-4.1, Claude Sonnet 4.6, Gemini 2.5 Flash) under identical temperature=0 settings. For six clinical scenarios each generated 20 times (360 outputs total), it reports mean semantic similarity scores (GPT-4.1: 0.955; Gemini: 0.950; Claude: 0.903) with significant inter-model differences (H=458.41, p<.001), alongside uniqueness rates (GPT-4.1: 100% unique; Gemini: 27.5% unique) and ceiling-level safety expression across models. The central claim is that these reveal fundamentally different generative behaviors not detectable in single-output evaluations, implying that repeated-generation consistency should be a core criterion for clinical deployment and that model selection is a clinical decision.
Significance. If the quantitative distinctions hold under full methodological disclosure, the work provides a useful empirical demonstration that fixed decoding parameters can produce divergent consistency profiles across LLMs, with value for AI safety and reliability research in healthcare applications. It correctly distinguishes semantic stability from output duplication and reports clear statistical separation, strengthening the case for repeated-sampling protocols over single-shot assessments.
major comments (2)
- [Abstract] Abstract: The conclusion that 'model selection constitutes a clinical rather than merely technical decision' and that repeated-generation consistency 'should be treated as a core criterion for reliable deployment' is not supported by the presented evidence, as the study reports no expert clinical review, outcome validation, or assessment of whether the observed differences in uniqueness or semantic content affect prescription safety, efficacy, or patient suitability.
- [Abstract] Abstract and results: The distinction between GPT-4.1 (high similarity with 100% unique outputs) and Gemini (high similarity from 27.5% unique outputs due to repetition) is load-bearing for the claim that single-output evaluations miss key behaviors, yet the manuscript provides no details on the embedding model, similarity threshold, or exact uniqueness detection method (e.g., string match vs. semantic), preventing assessment of whether these metrics capture clinically relevant consistency rather than stylistic templating.
minor comments (2)
- The model names (GPT-4.1, Claude Sonnet 4.6, Gemini 2.5 Flash) should be verified against current official designations and version dates for reproducibility.
- [Abstract] The abstract mentions analysis across four dimensions (semantic similarity, output reproducibility, FITT classification, and safety expression) but reports quantitative results primarily on similarity and uniqueness; a brief summary of FITT findings would improve completeness.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address each major comment below and have revised the manuscript to improve methodological transparency and moderate interpretive claims where the evidence is limited.
read point-by-point responses
-
Referee: [Abstract] Abstract: The conclusion that 'model selection constitutes a clinical rather than merely technical decision' and that repeated-generation consistency 'should be treated as a core criterion for reliable deployment' is not supported by the presented evidence, as the study reports no expert clinical review, outcome validation, or assessment of whether the observed differences in uniqueness or semantic content affect prescription safety, efficacy, or patient suitability.
Authors: We agree that the original abstract language overstated the direct clinical implications. The study demonstrates measurable differences in generative consistency under repeated sampling but does not include expert review, outcome data, or validation of clinical impact. We have revised the abstract to state that the observed profiles 'suggest that model selection may carry clinical implications' and that repeated-generation consistency 'merits consideration in deployment decisions,' rather than asserting it as a core criterion. We have also added an explicit limitations paragraph noting the absence of clinical validation. revision: yes
-
Referee: [Abstract] Abstract and results: The distinction between GPT-4.1 (high similarity with 100% unique outputs) and Gemini (high similarity from 27.5% unique outputs due to repetition) is load-bearing for the claim that single-output evaluations miss key behaviors, yet the manuscript provides no details on the embedding model, similarity threshold, or exact uniqueness detection method (e.g., string match vs. semantic), preventing assessment of whether these metrics capture clinically relevant consistency rather than stylistic templating.
Authors: The referee is correct that the original manuscript omitted key methodological details required to evaluate the metrics. We have added a new subsection in the Methods section describing the embedding model, the cosine similarity threshold and its rationale, and the precise uniqueness detection procedure (combining normalized string matching with semantic checks). These additions clarify how semantic stability was distinguished from textual duplication and allow readers to assess the clinical relevance of the reported differences. revision: yes
Circularity Check
No circularity: purely empirical repeated-generation comparison
full rationale
The paper reports an empirical study that generates 360 exercise-prescription outputs (20 repetitions per scenario across 6 scenarios and 3 models) under temperature=0, then measures semantic similarity, uniqueness rate, FITT classification, and safety expression directly from those outputs. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text or abstract. All central claims (e.g., GPT-4.1 100% unique vs. Gemini 27.5% unique despite similar mean similarity) rest on observed data rather than any reduction to prior inputs by construction. This is the expected non-finding for a straightforward empirical comparison study.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Semantic similarity via embedding-based metrics accurately captures equivalence of clinical exercise prescriptions
- standard math Temperature=0 setting produces deterministic outputs without stochastic variation
Reference graph
Works this paper leans on
-
[1]
Introduction The rapid advancement of large language models (LLMs) has opened new possibilities in healthcare and health management (Raza et al., 2024; Meng et al., 2024). LLMs can generate contextually appropriate text based on user prompts, and their potential applic ations have been discussed across a range of domains including clinical consultation su...
2024
-
[2]
estimated
Materials and Methods 2.1. Study Design This study employed an experimental observational design to compare repeated generation consistency of exercise prescription outputs across three LLMs. Identical clinical scenarios and prompts were submitted to each model under controlled conditions, and intra-model consistency and inter-model differences were quant...
2026
-
[3]
Output reproducibility was evaluated by exact match comparison of preprocessed texts within each scenario-model condition, with unique output counts and pro portions (%) calculated by scenario and model. 2.7. Statistical Analysis Descriptive statistics are presented as mean ± standard deviation. Non -parametric tests were applied throughout given the smal...
-
[4]
All three models showed significant variation in consistency acro ss scenarios (all p < 0.001), with detailed pairwise comparisons presented in Table 1 and Figure 2
Results 3.1 Intra-model Semantic Consistency (RQ1) Overall mean semantic similarity was highest for GPT-4.1 (Mean = 0.955, SD = 0.028), followed by Gemini-2.5-Flash (Mean = 0.950, SD = 0.070) and Claude-Sonnet-4.6 (Mean = 0.903, SD = 0.071). All three models showed significant variation in consistency acro ss scenarios (all p < 0.001), with detailed pairw...
-
[5]
Discussion Previous studies on LLM -based exercise prescription ha ve largely focused on single -model evaluations, and cross-model differences in output characteristics have not been adequately examined. Although a previous study confirmed the intra-model consistency of Gemini under repeated generation conditions (Lee et al., 2026), whether similar patte...
2026
-
[6]
Conclusion The present study confirmed that LLM -generated exercise prescription outputs differ markedly across models in terms of semantic consistency and output reproducibility, even under identical conditions. While GPT -4.1 achieved both textual diversity and seman tic consistency, the output repetition observed in Gemini -2.5-Flash and the semantic v...
-
[7]
Akrimi, S., Schwensfeier, L., Düking, P., Kreutz, T., & Brinkmann, C. (2025). ChatGPT-4o-generated exercise plans for patients with type 2 diabetes mellitus: Assessment of their safety and other quality criteria by coaching experts. Sports, 13(4), 92. https://doi.org/10.3390/sports13040092
-
[8]
American College of Sports Medicine. (2024). ACSM's guidelines for exercise testing and prescription (12th ed.). Wolters Kluwer
2024
-
[9]
Atil, B., Aykent, S., Chittams, A., Fu, L., Passonneau, R. J., Radcliffe, E., Rajagopal, G. R., Sloan, A., Tudrej, T., Ture, F., Wu, Z., Xu, L., & Baldwin, B. (2025). Non- determinism of “deterministic” LLM system settings in hosted environments. In Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems (pp. 135–148). Association for ...
-
[10]
Aydin, S., Karabacak, M., Vlachos, V ., & Margetis, K. (2024). Large language models in patient education: A scoping review of applications in medicine. Frontiers in Medicine, 11, 1477898. https://doi.org/10.3389/fmed.2024.1477898
-
[11]
Bishop, D. J., Beck, B., Biddle, S. J. H., Denay, K. L., Ferri, A., Gibala, M. J., Headley, S., Jones, A. M., Jung, M., Lee, M. J.-C., Moholdt, T., Newton, R. U., Nimphius, S., Pescatello, L. S., Saner, N. J., & Tzarimas, C. (2025). Physical activity and exercise intensity terminology: A joint ACSM expert statement and ESSA consensus statement. Medicine &...
-
[12]
Y ., & Lee, K
Choi, M., Park, J., Lee, M., Beom, J., Jung, S. Y ., & Lee, K. (2026). AI-generated exercise prescriptions for at-risk populations: Safety and feasibility of a large language model assessed by expert evaluation. Journal of Clinical Medicine, 15(6),
2026
-
[13]
https://doi.org/10.3390/jcm15062457
-
[14]
M., Mayampurath, A., Liao, F., Goswami, C., Wong, K
Croxford, E., Gao, Y ., First, E., Pellegrino, N., Schnier, M., Caskey, J., Oguss, M., Wills, G., Chen, G., Dligach, D., Churpek, M. M., Mayampurath, A., Liao, F., Goswami, C., Wong, K. K., Patterson, B. W., & Afshar, M. (2025). Evaluating clinical AI summaries with large language models as judges. npj Digital Medicine, 8,
2025
-
[15]
https://doi.org/10.1038/s41746-025-01648-5
-
[16]
Currier, B. S., D’Souza, A. C., Fiatarone Singh, M. A., Lowisz, C. V ., Rawson, E. S., Schoenfeld, B. J., Smith-Ryan, A. E., Steen, J. P., Thomas, G. A., Triplett, N. T., Washington, T. A., Werner, T. J., & Phillips, S. M. (2026). American College of Sports Medicine position stand: Resistance training prescription for muscle function, hypertrophy, and phy...
-
[17]
Dergaa, I., Ben Saad, H., El Omri, A., Glenn, J. M., Clark, C. C. T., Washif, J. A., Guelmami, N., Hammouda, O., Al-Horani, R. A., Reynoso-Sánchez, L. F., Romdhani, M., Paineiras-Domingos, L. L., Vancini, R. L., Taheri, M., Mataruna- Dos-Santos, L. J., Trabelsi, K., Chtourou, H., Zghibi, M., Eken, Ö., Swed, S., Ben Aissa, M., Shawki, H. H., El-Seedi, H. R...
-
[18]
Düking, P., Sperlich, B., V oigt, L., Van Hooren, B., Zanini, M., & Zinner, C. (2024). ChatGPT generated training plans for runners are not rated optimal by coaching experts, but increase in quality with additional input information. Journal of Sports Science and Medicine, 23, 56–65
2024
-
[19]
The potential of AI to create personalized exercise plans
Enichen, E. J., Young, C. C., & Frates, E. P. (2025). The potential of AI to create personalized exercise plans. Health Promotion Practice. Advance online publication. https://doi.org/10.1177/15248399251394695
-
[20]
R., Jofré-Saldía, E., Candia, A
Festa, R. R., Jofré-Saldía, E., Candia, A. A., Monsalves-Álvarez, M., Flores-Opazo, M., Peñailillo, L., Marzuca-Nassr, G. N., Aguilar-Farias, N., Fritz-Silva, N., & Cancino-Lopez, J. (2023). Next steps to advance general physical activity recommendations towards physical exercise prescription: A narrative review. BMJ Open Sport & Exercise Medicine, 9, e00...
-
[21]
E., Blissmer, B., Deschenes, M
Garber, C. E., Blissmer, B., Deschenes, M. R., Franklin, B. A., Lamonte, M. J., Lee, I.-M., Nieman, D. C., & Swain, D. P. (2011). American College of Sports Medicine position stand: Quantity and quality of exercise for developing and maintaining cardiorespiratory, musculoskeletal, and neuromotor fitness in apparently healthy adults. Medicine & Science in ...
-
[22]
He, Z., Wang, J., Zhang, B., & Li, Y . (2026). Knowledge-grounded large language model for personalized sports training plan generation. Scientific Reports, 16, 6793. https://doi.org/10.1038/s41598-026-37075-z
-
[23]
Kim, B., Kang, J., Jung, Y . J., & Ahn, J. (2026). Generative and large-scale artificial intelligence in exercise and sports medicine: A narrative review. The Asian Journal of Kinesiology, 28(1), 58–72. https://doi.org/10.15758/ajk.2026.28.1.58
-
[24]
Kim, J. H. (2026). Automated prescription of therapeutic exercise for shoulder impingement syndrome using literature-driven rule generation architecture. Musculoskeletal Science and Practice, 76, 103520. https://doi.org/10.1016/j.msksp.2026.103520
-
[25]
Lai, X., Chen, J., Lai, Y ., Huang, S., Cai, Y ., Sun, Z., Wang, X., Pan, K., Gao, Q., & Huang, C. (2025a). Using large language models to enhance exercise recommendations and physical activity in clinical and healthy populations: Scoping review. JMIR Medical Informatics, 13, e59309. https://doi.org/10.2196/59309
-
[26]
Lai, X., Lai, Y ., Chen, J., Huang, S., Gao, Q., & Huang, C. (2025b). Evaluation strategies for large language model-based models in exercise and health coaching: Scoping review. Journal of Medical Internet Research, 27, e79217. https://doi.org/10.2196/79217
-
[27]
Lai, X., Lai, Y ., Chen, J., Huang, S., Gao, Q., & Huang, C. (2026). An AI-assisted adaptive boolean rubric for exercise prescription evaluation: A pilot validation study. International Journal of Medical Informatics, 207, 106202. https://doi.org/10.1016/j.ijmedinf.2025.106202
-
[28]
Lee, K. (2026). Consistency of AI-generated exercise prescriptions: A repeated generation study using a large language model. arXiv preprint arXiv:2604.11287. https://arxiv.org/abs/2604.11287
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
Li, G., Li, H., Su, Y ., Li, Y ., Jiang, S., & Zhang, G. (2025). GPT-4 as a virtual fitness coach: A case study assessing its effectiveness in providing weight loss and fitness guidance. BMC Public Health, 25, 2466. https://doi.org/10.1186/s12889-025-23666- 6
-
[30]
Li, H., Dong, Q., Chen, J., Su, H., Zhou, Y ., Ai, Q., Ye, Z., & Liu, Y . (2024). LLMs- as-judges: A comprehensive survey on LLM-based evaluation methods. arXiv preprint arXiv:2412.05579. https://arxiv.org/abs/2412.05579
work page internal anchor Pith review arXiv 2024
-
[31]
Meng, X., Yan, X., Zhang, K., Liu, D., Cui, X., Yang, Y ., Zhang, M., Cao, C., Wang, J., Wang, X., Gao, J., Wang, Y .-G.-S., Ji, J.-M., Qiu, Z., Li, M., Qian, C., Guo, T., Ma, S., Wang, Z., . . . Tang, Y .-D. (2024). The application of large language models in medicine: A scoping review. iScience, 27, 109713. https://doi.org/10.1016/j.isci.2024.109713
-
[32]
Nduka, T. C., Ndakotsu, A., Nriagu, V . C., Karikalan, S., Abdulkareem, L., Omede, F. O., & Bob-Manuel, T. (2025). AI-generated diet and exercise recommendations for cardiovascular health compared to established cardiology society guidelines. Cureus, 17(8), e90968. https://doi.org/10.7759/cureus.90968
-
[33]
Negra, Y ., Sammoud, S., Bouguezzi, R., Markov, A., Capranica, L., Müller, P., & Chaabene, H. (2026). Effects of a ChatGPT-generated eccentric training programme on speed, change of direction, agility, and jumping performance in U14 tennis players: A non-randomised controlled study. Journal of Sports Sciences. Advance online publication. https://doi.org/1...
-
[34]
Philuek, P., Kusump, S., Sathianpoonsook, T., Jansupom, C., Sawanyawisuth, P., Sawanyawisuth, K., & Chainarong, A. (2025). The effects of chat GPT generated exercise program in healthy overweight young adults: A pilot study. Journal of Human Sport and Exercise, 20, 169–179. https://doi.org/10.14198/jhse.2025.201.15
-
[35]
Puce, L., Bragazzi, N. L., Currà, A., & Trompetto, C. (2025). Harnessing generative artificial intelligence for exercise and training prescription: Applications and implications in sports and physical activity—A systematic literature review. Applied Sciences, 15(7), 3497. https://doi.org/10.3390/app15073497
-
[36]
Raza, M. M., Venkatesh, K. P., & Kvedar, J. C. (2024). Generative AI and large language models in health care: Pathways to implementation. npj Digital Medicine, 7, 62. https://doi.org/10.1038/s41746-023-00988-4
-
[37]
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3982–3992). https://doi.org/10.18653/v1/D19-1410
-
[38]
Schoenfeld, B. J., Grgic, J., Van Every, D. W., & Plotkin, D. L. (2021). Loading recommendations for muscle strength, hypertrophy, and local endurance: A re- examination of the repetition continuum. Sports, 9(2), 32. https://doi.org/10.3390/sports9020032
-
[39]
Schütze, K., Shehatha, R., Beer, K., Needham, M., Smith, T., Bagg, M., Doverty, A., & Cooper, I. (2026). Evaluating ChatGPT’s advice and recommendations regarding exercise for people with inclusion body myositis. Neuromuscular Disorders, 62, 106418. https://doi.org/10.1016/j.nmd.2026.106418
-
[40]
Shin, D., Hsieh, G., & Kim, Y . H. (2025). PlanFitting: Personalized exercise planning with large language model-driven conversational agent. In Proceedings of the 7th ACM Conference on Conversational User Interfaces (CUI ’25). https://doi.org/10.1145/3719160.3736607
-
[41]
Shyr, C., Ren, B., Hsu, C.-Y ., Yan, C., Tinker, R. J., Cassini, T. A., Hamid, R., Wright, A., Bastarache, L., Peterson, J. F., Malin, B. A., & Xu, H. (2025). A statistical framework for evaluating the repeatability and reproducibility of large language models. medRxiv. https://doi.org/10.1101/2025.08.06.25333170
- [42]
-
[43]
Washif, J., Pagaduan, J., James, C., Dergaa, I., & Beaven, C. (2024). Artificial intelligence in sport: Exploring the potential of using ChatGPT in resistance training prescription. Biology of Sport, 41(2), 209–220. https://doi.org/10.5114/biolsport.2024.132987
- [44]
-
[45]
Zaleski, A. L., Berkowsky, R., Craig, K. J. T., & Pescatello, L. S. (2024). Comprehensiveness, accuracy, and readability of exercise recommendations provided by an AI-based chatbot: Mixed methods study. JMIR Medical Education, 10, e51308. https://doi.org/10.2196/51308
-
[46]
Zhang, Y .-F., & Liu, X.-Q. (2024). Using ChatGPT to promote college students’ participation in physical activities and its effect on mental health. World Journal of Psychiatry, 14, 330–341. https://doi.org/10.5498/wjp.v14.i2.330
-
[47]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Zheng, L., Chiang, W.-L., Sheng, Y ., Zhuang, S., Wu, Z., Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). Judging LLM-as- a-judge with MT-bench and chatbot arena. Advances in Neural Information Processing Systems, 36. https://arxiv.org/abs/2306.05685 Supplementary Material 1. Exercise Prescription Genera...
work page internal anchor Pith review arXiv 2023
-
[48]
The placeholder [CLINICAL CASE] was replaced with the corresponding clinical case description for each scenario
Prompt Template The following prompt template was used to generate exercise prescriptions for all six clinical scenarios. The placeholder [CLINICAL CASE] was replaced with the corresponding clinical case description for each scenario. INSTRUCTION_TEMPLATE = """Based on [CLINICAL CASE], please develop a 12-week exercise program. Ensure that the plan adhere...
2026
-
[49]
All cases are hypothetical and were constructed without the use of real patient information
Clinical Case Descriptions Used as Input The following clinical case descriptions were substituted into the [CLINICAL CASE] placeholder of the prompt template above. All cases are hypothetical and were constructed without the use of real patient information. Case 1. Type 2 Diabetes Mellitus + Obesity Participant Profile Male, 55 years old, 7-year history ...
-
[50]
Frequency
FITT Structural Classification 1.1 Prompt Used for FITT Classification (Claude Sonnet 4.6) The following prompt was submitted to Claude Sonnet 4.6 (Anthropic) for each of the 120 preprocessed outputs. The placeholder [EXERCISE PRESCRIPTION TEXT] was replaced with the corresponding output text. FITT_PROMPT = """Classify the FITT components based on the ini...
2011
-
[51]
Contraindication
Safety Expression Consistency Evaluation 2.1 Prompt Used for Safety Evaluation (Claude Sonnet 4.6) The following prompt was submitted to Claude Sonnet 4.6 (Anthropic) for binary inclusion assessment of safety-related expressions in each output. SAFETY_PROMPT = """Evaluate the presence or absence of safety-related expressions in the following exercise pres...
2026
-
[52]
PREPROCESSING_PROMPT = """From the following text, extract only the exercise prescription body
Preprocessing Prompt 3.1 Prompt Used for Output Preprocessing (Claude Sonnet 4.6) Prior to SBERT-based semantic similarity analysis, a standardized preprocessing prompt was applied to all 120 raw outputs using Claude Sonnet 4.6 to extract only the exercise prescription body, excluding formatting elements such as greetings, closing remarks, tables, and bul...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.