pith. machine review for the scientific record. sign in

arxiv: 2605.05487 · v1 · submitted 2026-05-06 · 💻 cs.HC

Recognition: unknown

Cross-individual generalizability of machine learning models for ball speed prediction in baseball pitching

Chiharu Suzuki, Hiroki Nakamoto, Ryota Takamido

Pith reviewed 2026-05-08 15:42 UTC · model grok-4.3

classification 💻 cs.HC
keywords machine learningbaseball pitchingball speed predictioncross-individual generalizabilitykinematic analysissports performanceleave-one-subject-out validation
0
0 comments X

The pith

Machine learning models for predicting baseball pitch speed lose most accuracy when tested on new pitchers they have not seen before.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates how well machine learning models trained to predict ball speed from pitching motion data perform when applied to pitchers outside the training set. Using data from 50 pitchers of mixed competitive levels, it compares within-individual testing to cross-individual testing via leave-one-subject-out validation. Performance collapses from an R-squared of 0.91 inside the same pitcher to 0.38 across different pitchers. The trunk and pivot leg still yield usable predictions even early in the throw, while the model systematically overestimates speeds for intermediate-level pitchers relative to experts. The work shows that standard motion-capture features do not transfer readily across individuals and that practical use in sports requires explicit attention to this gap.

Core claim

Machine learning models for ball speed prediction achieve high accuracy only within the same pitcher; under cross-individual evaluation the R-squared falls from 0.91 to 0.38. The trunk and pivot leg retain relatively strong generalization, with the pivot leg still above R-squared 0.25 at the start of weight shift. Models also produce larger positive errors for intermediate pitchers than for experts, revealing a systematic bias tied to expertise level.

What carries the argument

Leave-one-subject-out cross-validation of regression models that map full-body kinematic time series to ball speed, with separate analysis of body-segment contributions and expertise-group bias.

If this is right

  • Practical deployment of such models for coaching or scouting will require per-athlete calibration data rather than a single general model.
  • Kinematic features from the trunk and pivot leg should be prioritized when building models intended for use across multiple pitchers.
  • Training sets must be stratified by expertise level to reduce systematic overestimation of less experienced pitchers.
  • Early-phase data from the pivot leg alone can still support limited but usable cross-individual predictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The large drop in accuracy points to substantial unshared structure in individual pitching mechanics that standard marker-based kinematics do not capture.
  • Coaching tools or wearable sensors based on these models may need on-site data collection from each new user instead of relying on population-level training.
  • Transfer-learning or domain-adaptation methods could be tested to close the generalization gap without collecting full new datasets for every athlete.
  • The expertise bias suggests that models may be picking up on correlated but non-causal differences in technique rather than pure speed-production mechanics.

Load-bearing premise

That the 50-pitcher group and the way individuals were held out for testing fully represent the mechanical differences between pitchers without being distorted by tiredness, equipment, or unmeasured competitive-level factors.

What would settle it

Retraining the same models on a new sample of several hundred pitchers that includes wider ranges of body size, fatigue states, and equipment and then measuring whether cross-individual R-squared remains near 0.38 or rises substantially.

read the original abstract

Although machine learning (ML)-based performance outcome prediction is an important topic in contemporary sports science, one important issue is the limited understanding of the cross-individual generalizability of ML models in sports contexts. To address this issue, this study aimed to evaluate the cross-individual generalizability of ML models for predicting ball speed in baseball pitching. A dataset comprising 50 pitchers from various competitive levels was analyzed. Cross-individual generalizability was assessed using leave-one-subject-out cross-validation. Specifically, the effects of expertise level and restrictions on spatiotemporal motion information were examined to identify factors influencing model generalizability. The results revealed that, under cross-individual evaluation, (1) predictive performance was markedly lower than under within-individual evaluation, with R-squared value decreasing from 0.91 to 0.38; (2) the model tended to overestimate the performance of Intermediate pitchers relative to Expert pitchers, with a significant group difference in signed prediction error (p < .05); and (3) the trunk and pivot leg demonstrated relatively high generalization performance, with the pivot leg showing notable generalizability even during the weight-shift initiation phase (R-squared value > 0.25). These findings underscore the importance of cross-individual evaluation in enhancing the practical applicability of ML in sports settings and contribute to a deeper understanding of the biomechanical factors underlying the target movement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates the cross-individual generalizability of machine learning models for predicting ball speed in baseball pitching. Using a dataset of 50 pitchers from various competitive levels, it applies leave-one-subject-out cross-validation to compare within-individual (R²=0.91) versus cross-individual (R²=0.38) performance. It reports that the model overestimates performance for intermediate relative to expert pitchers (significant signed-error group difference, p<0.05) and that trunk and pivot-leg features generalize better, with the pivot leg retaining R²>0.25 even during the weight-shift initiation phase.

Significance. If the empirical results hold after addressing potential confounds, the work supplies concrete evidence that within-subject ML models for sports outcomes degrade substantially when applied across individuals, underscoring the practical necessity of cross-individual evaluation. The segment-specific findings and statistical tests for expertise effects are useful for guiding feature selection in future biomechanical ML studies. The leave-one-subject-out design and reporting of signed errors with p-values constitute clear strengths in the empirical approach.

major comments (2)
  1. [Methods (leave-one-subject-out cross-validation)] The leave-one-subject-out cross-validation (described in the Methods and used for the headline R² drop from 0.91 to 0.38) does not report any balancing or covariate adjustment for session-level factors such as fatigue across repeated pitches, equipment variations, or unrecorded competitive-level covariates. Without such controls, the observed performance collapse cannot be unambiguously attributed to irreducible inter-individual biomechanical differences rather than unmodeled within-session confounds.
  2. [Results (segment-specific analysis)] The segment-specific generalization results (trunk and pivot leg, R²>0.25 for pivot leg in weight-shift phase) rest on restrictions of spatiotemporal motion information whose exact implementation, feature-extraction pipeline, and whether selection occurred inside each CV fold are not detailed enough to rule out information leakage or inconsistent preprocessing across folds.
minor comments (2)
  1. [Abstract] The abstract omits the ML model architecture, hyperparameter tuning protocol, and precise data-exclusion rules; these should be summarized even at the abstract level for immediate reproducibility assessment.
  2. [Methods] Clarify the exact sample sizes and pitching counts per expertise group (expert vs. intermediate) and any exclusion criteria applied before the LOSO procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the interpretation of our cross-individual evaluation results. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Methods (leave-one-subject-out cross-validation)] The leave-one-subject-out cross-validation (described in the Methods and used for the headline R² drop from 0.91 to 0.38) does not report any balancing or covariate adjustment for session-level factors such as fatigue across repeated pitches, equipment variations, or unrecorded competitive-level covariates. Without such controls, the observed performance collapse cannot be unambiguously attributed to irreducible inter-individual biomechanical differences rather than unmodeled within-session confounds.

    Authors: We agree that the leave-one-subject-out design does not include explicit balancing or covariate adjustment for session-level factors such as fatigue, equipment variations, or unrecorded competitive-level details. The data collection occurred in standardized laboratory sessions with consistent equipment, but fatigue was not systematically recorded or modeled, and competitive level was used only for post-hoc group comparisons rather than as a covariate in the CV. This is a genuine limitation of the current analysis. We will revise the Methods and Discussion sections to explicitly acknowledge these unmodeled factors and note that the observed R² drop may partly reflect such confounds in addition to inter-individual biomechanical differences. We will also add a recommendation for future studies to incorporate session-level covariates where possible. revision: partial

  2. Referee: [Results (segment-specific analysis)] The segment-specific generalization results (trunk and pivot leg, R²>0.25 for pivot leg in weight-shift phase) rest on restrictions of spatiotemporal motion information whose exact implementation, feature-extraction pipeline, and whether selection occurred inside each CV fold are not detailed enough to rule out information leakage or inconsistent preprocessing across folds.

    Authors: We acknowledge that the current Methods section lacks sufficient detail on the segment-specific restrictions. The analysis restricted features to kinematic variables from the trunk and pivot leg during defined phases (e.g., weight-shift initiation), with feature extraction performed via standard inverse kinematics pipelines on the motion-capture data. Feature selection (if any) was conducted independently within each training fold of the leave-one-subject-out CV to prevent leakage, and all preprocessing steps were applied uniformly. We will expand the Methods section with a dedicated subsection describing the exact feature restriction criteria, phase definitions, extraction pipeline, and confirmation that selection occurred inside each CV fold. If space permits, we will include a supplementary table listing the retained features per segment. revision: yes

Circularity Check

0 steps flagged

Empirical LOSO cross-validation yields no circularity

full rationale

The paper's central claims consist of reported R² values and signed-error differences obtained by training ML models on 49 subjects and testing on the held-out subject (leave-one-subject-out). These are direct empirical performance statistics computed on unseen data; they do not reduce by any equation or self-citation to quantities defined in terms of the fitted parameters themselves. No self-definitional steps, fitted-input predictions, uniqueness theorems, or ansatzes appear in the derivation chain. The evaluation protocol is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper relies on standard supervised ML evaluation practices and domain assumptions about motion capture data without introducing new free parameters, axioms, or postulated entities beyond typical model training.

free parameters (1)
  • ML model hyperparameters
    Typical in any ML training pipeline; not specified in abstract but would be fitted during model development.
axioms (1)
  • domain assumption Pitching kinematics from the 50-pitcher sample are representative enough for cross-individual claims
    Invoked when interpreting leave-one-subject-out results as evidence of generalizability.

pith-pipeline@v0.9.0 · 5541 in / 1324 out tokens · 56792 ms · 2026-05-08T15:42:52.787455+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 7 canonical work pages

  1. [1]

    To address this issue, this study aimed to evaluate the cross-individual generalizability of ML models for predicting ball speed in baseball pitching

    Cross-individual generalizability of machine learning models for ball speed prediction in baseball pitching Ryota Takamido1, Chiharu Suzuki1, Hiroki Nakamoto2 1 Sports Innovation Organization, National Institute of Fitness and Sports in Kanoya, Kanoya, Kagoshima 891-2393, Japan 2 Faculty of Physical Education, National Institute of Fitness and Sports in K...

  2. [2]

    have employed LOSOCV to validate predictive or classification performance, several important questions remain underexplored. (1) Does skill level affect the cross-individual generalizability? If experts exhibit greater inter-individual variability than novices and intermediates (Takamido et al., 2019), cross-individual generalization may be more difficult...

  3. [3]

    That is, if skilled athletes achieve higher performance than others, despite similar motion inputs, ML models may consistently underestimate expert performance

    may influence the direction of prediction errors. That is, if skilled athletes achieve higher performance than others, despite similar motion inputs, ML models may consistently underestimate expert performance. (2) Which body segments and movement phases affect cross-individual generalizability? If the movements of specific body segments at specific phase...

  4. [4]

    The participants included four high school players, 10 collegiate athletes, 20 industrial league players, three independent league players, and 13 professional league players

    Materials and Methods 2.1 Dataset development The dataset comprised baseball pitching motions and corresponding ball velocity data collected from 50 pitchers and was identical to that used in our previous study (Takamido et al., 2025). The participants included four high school players, 10 collegiate athletes, 20 industrial league players, three independe...

  5. [5]

    The GNN-GRU model showed the best predictive performance in a previous study on ball speed prediction (Yang et al., 2025)

    were selected as candidate model architectures. The GNN-GRU model showed the best predictive performance in a previous study on ball speed prediction (Yang et al., 2025). Although that study did not clearly describe how the dataset was partitioned, we included this model as a candidate to evaluate the cross-individual generalization performance. The Trans...

  6. [6]

    To account for the effects of the hyperparameter settings, six models with different parameter configurations were prepared for each architecture (Table 1)

    did not include the transformer in its comparison with the GNN-GRU model, it was also selected in this study as another candidate model, given its strong capacity for modeling time-series data. To account for the effects of the hyperparameter settings, six models with different parameter configurations were prepared for each architecture (Table 1). For GN...

  7. [7]

    Results 3.1 Baseline model selection Table 2 summarizes the predictive performance of each model architecture across parameter settings. The Transformer model with four attention heads, 𝑑! = 64 and 𝑑" = 128 exhibited the highest generalization performance (R² = 0.38) and was therefore selected as the baseline model for subsequent analyses. In terms of pre...

  8. [8]

    Discussion The results of this study provide important insights into ML-based performance outcome prediction. First, the baseline model tests showed a substantial decline in predictive performance under cross-individual evaluation compared with within-individual evaluation, with R² decreasing from 0.91 to 0.38. This result is consistent with previous stud...

  9. [9]

    may limit the model’s ability to capture inter-individual variability; larger datasets could improve generalization performance. Third, the study relied on relatively simple input features (i.e., time series of three-dimensional joint positions), and incorporating more advanced feature representations may improve the generalization performance. Finally, a...

  10. [10]

    Conclusion In conclusion, the main findings of this study can be summarized as follows. Under cross-individual evaluation, (1) predictive performance decreases substantially under cross-individual evaluation, with R² decreasing from 0.91 to 0.38; (2) ML models tend to overestimate the performance of Intermediate pitchers relative to Experts; and (3) the t...

  11. [11]

    L., Slowik, J

    https://doi.org/10.3390/asi9040075 Crotin, R. L., Slowik, J. S., Brewer, G., Cain, E. L., Jr., & Fleisig, G. S. (2022). Determinants of biomechanical efficiency in collegiate and professional baseball pitchers. American Journal of Sports Medicine, 50(12), 3374–3380. https://doi.org/10.1177/03635465221119194 Freire-Obregón, D., Santana, O. J., Lorenzo-Nava...

  12. [12]

    https://doi.org/10.1186/s13102-025-01294-0 Kageyama, M., Sugiyama, T., Takai, Y ., Kanehisa, H., & Maeda, A. (2014). Kinematic and kinetic profiles of trunk and lower limbs during baseball pitching in collegiate pitchers. Journal of Sports Science and Medicine, 13, 742–750. Körding, K. P., & Wolpert, D. M. (2004). Bayesian integration in sensorimotor lear...

  13. [13]

    Y ., Abdiyeva, K., & Kot, A

    https://doi.org/10.3390/s22072560 Liu, J., Wang, G., Duan, L. Y ., Abdiyeva, K., & Kot, A. C. (2018). Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Transactions on Image Processing, 27(4), 1586–1599. https://doi.org/10.1109/TIP.2017.2785279 Manzi, J. E., Dowling, B., Krichevsky, S., Roberts, N. L. S., Suda...

  14. [14]

    N., Kaiser, Ł., & Polosukhin, I

    https://doi.org/10.3390/s21072288 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems,

  15. [15]

    Worsey, M. T. O., Espinosa, H. G., Shepherd, J. B., & Thiel, D. V. (2021). One size doesn't fit all: Supervised machine learning classification in athlete-monitoring. IEEE Sensors Letters, 5(3), 1–4. https://doi.org/10.1109/LSENS.2021.3060376 Yang, C., Jin, P., & Chen, Y. (2025). Leveraging graph neural networks and gate recurrent units for accurate and t...

  16. [16]

    https://doi.org/10.1038/s41598-025-88284-x Zhao, X., Chan, V ., & Graham, R. B. (2026). From classical models to attention-based transformers: A comparative study of injury prediction pipelines in female varsity soccer. Journal of Biomechanics, 201, 113278. https://doi.org/10.1016/j.jbiomech.2026.113278 Tables Table