Recognition: unknown
Cross-individual generalizability of machine learning models for ball speed prediction in baseball pitching
Pith reviewed 2026-05-08 15:42 UTC · model grok-4.3
The pith
Machine learning models for predicting baseball pitch speed lose most accuracy when tested on new pitchers they have not seen before.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Machine learning models for ball speed prediction achieve high accuracy only within the same pitcher; under cross-individual evaluation the R-squared falls from 0.91 to 0.38. The trunk and pivot leg retain relatively strong generalization, with the pivot leg still above R-squared 0.25 at the start of weight shift. Models also produce larger positive errors for intermediate pitchers than for experts, revealing a systematic bias tied to expertise level.
What carries the argument
Leave-one-subject-out cross-validation of regression models that map full-body kinematic time series to ball speed, with separate analysis of body-segment contributions and expertise-group bias.
If this is right
- Practical deployment of such models for coaching or scouting will require per-athlete calibration data rather than a single general model.
- Kinematic features from the trunk and pivot leg should be prioritized when building models intended for use across multiple pitchers.
- Training sets must be stratified by expertise level to reduce systematic overestimation of less experienced pitchers.
- Early-phase data from the pivot leg alone can still support limited but usable cross-individual predictions.
Where Pith is reading between the lines
- The large drop in accuracy points to substantial unshared structure in individual pitching mechanics that standard marker-based kinematics do not capture.
- Coaching tools or wearable sensors based on these models may need on-site data collection from each new user instead of relying on population-level training.
- Transfer-learning or domain-adaptation methods could be tested to close the generalization gap without collecting full new datasets for every athlete.
- The expertise bias suggests that models may be picking up on correlated but non-causal differences in technique rather than pure speed-production mechanics.
Load-bearing premise
That the 50-pitcher group and the way individuals were held out for testing fully represent the mechanical differences between pitchers without being distorted by tiredness, equipment, or unmeasured competitive-level factors.
What would settle it
Retraining the same models on a new sample of several hundred pitchers that includes wider ranges of body size, fatigue states, and equipment and then measuring whether cross-individual R-squared remains near 0.38 or rises substantially.
read the original abstract
Although machine learning (ML)-based performance outcome prediction is an important topic in contemporary sports science, one important issue is the limited understanding of the cross-individual generalizability of ML models in sports contexts. To address this issue, this study aimed to evaluate the cross-individual generalizability of ML models for predicting ball speed in baseball pitching. A dataset comprising 50 pitchers from various competitive levels was analyzed. Cross-individual generalizability was assessed using leave-one-subject-out cross-validation. Specifically, the effects of expertise level and restrictions on spatiotemporal motion information were examined to identify factors influencing model generalizability. The results revealed that, under cross-individual evaluation, (1) predictive performance was markedly lower than under within-individual evaluation, with R-squared value decreasing from 0.91 to 0.38; (2) the model tended to overestimate the performance of Intermediate pitchers relative to Expert pitchers, with a significant group difference in signed prediction error (p < .05); and (3) the trunk and pivot leg demonstrated relatively high generalization performance, with the pivot leg showing notable generalizability even during the weight-shift initiation phase (R-squared value > 0.25). These findings underscore the importance of cross-individual evaluation in enhancing the practical applicability of ML in sports settings and contribute to a deeper understanding of the biomechanical factors underlying the target movement.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates the cross-individual generalizability of machine learning models for predicting ball speed in baseball pitching. Using a dataset of 50 pitchers from various competitive levels, it applies leave-one-subject-out cross-validation to compare within-individual (R²=0.91) versus cross-individual (R²=0.38) performance. It reports that the model overestimates performance for intermediate relative to expert pitchers (significant signed-error group difference, p<0.05) and that trunk and pivot-leg features generalize better, with the pivot leg retaining R²>0.25 even during the weight-shift initiation phase.
Significance. If the empirical results hold after addressing potential confounds, the work supplies concrete evidence that within-subject ML models for sports outcomes degrade substantially when applied across individuals, underscoring the practical necessity of cross-individual evaluation. The segment-specific findings and statistical tests for expertise effects are useful for guiding feature selection in future biomechanical ML studies. The leave-one-subject-out design and reporting of signed errors with p-values constitute clear strengths in the empirical approach.
major comments (2)
- [Methods (leave-one-subject-out cross-validation)] The leave-one-subject-out cross-validation (described in the Methods and used for the headline R² drop from 0.91 to 0.38) does not report any balancing or covariate adjustment for session-level factors such as fatigue across repeated pitches, equipment variations, or unrecorded competitive-level covariates. Without such controls, the observed performance collapse cannot be unambiguously attributed to irreducible inter-individual biomechanical differences rather than unmodeled within-session confounds.
- [Results (segment-specific analysis)] The segment-specific generalization results (trunk and pivot leg, R²>0.25 for pivot leg in weight-shift phase) rest on restrictions of spatiotemporal motion information whose exact implementation, feature-extraction pipeline, and whether selection occurred inside each CV fold are not detailed enough to rule out information leakage or inconsistent preprocessing across folds.
minor comments (2)
- [Abstract] The abstract omits the ML model architecture, hyperparameter tuning protocol, and precise data-exclusion rules; these should be summarized even at the abstract level for immediate reproducibility assessment.
- [Methods] Clarify the exact sample sizes and pitching counts per expertise group (expert vs. intermediate) and any exclusion criteria applied before the LOSO procedure.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the interpretation of our cross-individual evaluation results. We address each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [Methods (leave-one-subject-out cross-validation)] The leave-one-subject-out cross-validation (described in the Methods and used for the headline R² drop from 0.91 to 0.38) does not report any balancing or covariate adjustment for session-level factors such as fatigue across repeated pitches, equipment variations, or unrecorded competitive-level covariates. Without such controls, the observed performance collapse cannot be unambiguously attributed to irreducible inter-individual biomechanical differences rather than unmodeled within-session confounds.
Authors: We agree that the leave-one-subject-out design does not include explicit balancing or covariate adjustment for session-level factors such as fatigue, equipment variations, or unrecorded competitive-level details. The data collection occurred in standardized laboratory sessions with consistent equipment, but fatigue was not systematically recorded or modeled, and competitive level was used only for post-hoc group comparisons rather than as a covariate in the CV. This is a genuine limitation of the current analysis. We will revise the Methods and Discussion sections to explicitly acknowledge these unmodeled factors and note that the observed R² drop may partly reflect such confounds in addition to inter-individual biomechanical differences. We will also add a recommendation for future studies to incorporate session-level covariates where possible. revision: partial
-
Referee: [Results (segment-specific analysis)] The segment-specific generalization results (trunk and pivot leg, R²>0.25 for pivot leg in weight-shift phase) rest on restrictions of spatiotemporal motion information whose exact implementation, feature-extraction pipeline, and whether selection occurred inside each CV fold are not detailed enough to rule out information leakage or inconsistent preprocessing across folds.
Authors: We acknowledge that the current Methods section lacks sufficient detail on the segment-specific restrictions. The analysis restricted features to kinematic variables from the trunk and pivot leg during defined phases (e.g., weight-shift initiation), with feature extraction performed via standard inverse kinematics pipelines on the motion-capture data. Feature selection (if any) was conducted independently within each training fold of the leave-one-subject-out CV to prevent leakage, and all preprocessing steps were applied uniformly. We will expand the Methods section with a dedicated subsection describing the exact feature restriction criteria, phase definitions, extraction pipeline, and confirmation that selection occurred inside each CV fold. If space permits, we will include a supplementary table listing the retained features per segment. revision: yes
Circularity Check
Empirical LOSO cross-validation yields no circularity
full rationale
The paper's central claims consist of reported R² values and signed-error differences obtained by training ML models on 49 subjects and testing on the held-out subject (leave-one-subject-out). These are direct empirical performance statistics computed on unseen data; they do not reduce by any equation or self-citation to quantities defined in terms of the fitted parameters themselves. No self-definitional steps, fitted-input predictions, uniqueness theorems, or ansatzes appear in the derivation chain. The evaluation protocol is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- ML model hyperparameters
axioms (1)
- domain assumption Pitching kinematics from the 50-pitcher sample are representative enough for cross-individual claims
Reference graph
Works this paper leans on
-
[1]
To address this issue, this study aimed to evaluate the cross-individual generalizability of ML models for predicting ball speed in baseball pitching
Cross-individual generalizability of machine learning models for ball speed prediction in baseball pitching Ryota Takamido1, Chiharu Suzuki1, Hiroki Nakamoto2 1 Sports Innovation Organization, National Institute of Fitness and Sports in Kanoya, Kanoya, Kagoshima 891-2393, Japan 2 Faculty of Physical Education, National Institute of Fitness and Sports in K...
2015
-
[2]
have employed LOSOCV to validate predictive or classification performance, several important questions remain underexplored. (1) Does skill level affect the cross-individual generalizability? If experts exhibit greater inter-individual variability than novices and intermediates (Takamido et al., 2019), cross-individual generalization may be more difficult...
2019
-
[3]
That is, if skilled athletes achieve higher performance than others, despite similar motion inputs, ML models may consistently underestimate expert performance
may influence the direction of prediction errors. That is, if skilled athletes achieve higher performance than others, despite similar motion inputs, ML models may consistently underestimate expert performance. (2) Which body segments and movement phases affect cross-individual generalizability? If the movements of specific body segments at specific phase...
2019
-
[4]
The participants included four high school players, 10 collegiate athletes, 20 industrial league players, three independent league players, and 13 professional league players
Materials and Methods 2.1 Dataset development The dataset comprised baseball pitching motions and corresponding ball velocity data collected from 50 pitchers and was identical to that used in our previous study (Takamido et al., 2025). The participants included four high school players, 10 collegiate athletes, 20 industrial league players, three independe...
2025
-
[5]
The GNN-GRU model showed the best predictive performance in a previous study on ball speed prediction (Yang et al., 2025)
were selected as candidate model architectures. The GNN-GRU model showed the best predictive performance in a previous study on ball speed prediction (Yang et al., 2025). Although that study did not clearly describe how the dataset was partitioned, we included this model as a candidate to evaluate the cross-individual generalization performance. The Trans...
2025
-
[6]
To account for the effects of the hyperparameter settings, six models with different parameter configurations were prepared for each architecture (Table 1)
did not include the transformer in its comparison with the GNN-GRU model, it was also selected in this study as another candidate model, given its strong capacity for modeling time-series data. To account for the effects of the hyperparameter settings, six models with different parameter configurations were prepared for each architecture (Table 1). For GN...
2025
-
[7]
Results 3.1 Baseline model selection Table 2 summarizes the predictive performance of each model architecture across parameter settings. The Transformer model with four attention heads, 𝑑! = 64 and 𝑑" = 128 exhibited the highest generalization performance (R² = 0.38) and was therefore selected as the baseline model for subsequent analyses. In terms of pre...
2025
-
[8]
Discussion The results of this study provide important insights into ML-based performance outcome prediction. First, the baseline model tests showed a substantial decline in predictive performance under cross-individual evaluation compared with within-individual evaluation, with R² decreasing from 0.91 to 0.38. This result is consistent with previous stud...
2020
-
[9]
may limit the model’s ability to capture inter-individual variability; larger datasets could improve generalization performance. Third, the study relied on relatively simple input features (i.e., time series of three-dimensional joint positions), and incorporating more advanced feature representations may improve the generalization performance. Finally, a...
2022
-
[10]
Conclusion In conclusion, the main findings of this study can be summarized as follows. Under cross-individual evaluation, (1) predictive performance decreases substantially under cross-individual evaluation, with R² decreasing from 0.91 to 0.38; (2) ML models tend to overestimate the performance of Intermediate pitchers relative to Experts; and (3) the t...
-
[11]
https://doi.org/10.3390/asi9040075 Crotin, R. L., Slowik, J. S., Brewer, G., Cain, E. L., Jr., & Fleisig, G. S. (2022). Determinants of biomechanical efficiency in collegiate and professional baseball pitchers. American Journal of Sports Medicine, 50(12), 3374–3380. https://doi.org/10.1177/03635465221119194 Freire-Obregón, D., Santana, O. J., Lorenzo-Nava...
-
[12]
https://doi.org/10.1186/s13102-025-01294-0 Kageyama, M., Sugiyama, T., Takai, Y ., Kanehisa, H., & Maeda, A. (2014). Kinematic and kinetic profiles of trunk and lower limbs during baseball pitching in collegiate pitchers. Journal of Sports Science and Medicine, 13, 742–750. Körding, K. P., & Wolpert, D. M. (2004). Bayesian integration in sensorimotor lear...
-
[13]
https://doi.org/10.3390/s22072560 Liu, J., Wang, G., Duan, L. Y ., Abdiyeva, K., & Kot, A. C. (2018). Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Transactions on Image Processing, 27(4), 1586–1599. https://doi.org/10.1109/TIP.2017.2785279 Manzi, J. E., Dowling, B., Krichevsky, S., Roberts, N. L. S., Suda...
-
[14]
N., Kaiser, Ł., & Polosukhin, I
https://doi.org/10.3390/s21072288 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems,
-
[15]
Worsey, M. T. O., Espinosa, H. G., Shepherd, J. B., & Thiel, D. V. (2021). One size doesn't fit all: Supervised machine learning classification in athlete-monitoring. IEEE Sensors Letters, 5(3), 1–4. https://doi.org/10.1109/LSENS.2021.3060376 Yang, C., Jin, P., & Chen, Y. (2025). Leveraging graph neural networks and gate recurrent units for accurate and t...
-
[16]
https://doi.org/10.1038/s41598-025-88284-x Zhao, X., Chan, V ., & Graham, R. B. (2026). From classical models to attention-based transformers: A comparative study of injury prediction pipelines in female varsity soccer. Journal of Biomechanics, 201, 113278. https://doi.org/10.1016/j.jbiomech.2026.113278 Tables Table
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.