Recognition: unknown
Multi-Camera Self-Calibration in Sports Motion Capture: Leveraging Human and Stick Poses
Pith reviewed 2026-05-10 05:54 UTC · model grok-4.3
The pith
A tool-free method calibrates multi-camera setups in stick sports by jointly using human body poses and the known length of implements like bats or clubs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Extrinsic calibration of multi-camera systems can be achieved accurately without dedicated tools by formulating a three-stage optimization pipeline that jointly exploits human body keypoints with unknown metric scale and a rigid stick-like implement of known length from synchronized videos, thereby refining camera extrinsics, reconstructing human and stick trajectories, and resolving global scale via the stick-length constraint.
What carries the argument
Three-stage optimization pipeline that refines camera extrinsics, reconstructs human and stick trajectories, and resolves global scale via the stick-length constraint.
If this is right
- Accurate extrinsic calibration is obtained without any dedicated calibration tools or patterns.
- The first benchmark dataset for this task supplies synthetic sequences across four sports categories with 3 to 10 cameras.
- Low rotation and translation errors are achieved on the introduced dataset, outperforming prior approaches.
- The pipeline supports varying numbers of cameras and multiple stick-based sports without retraining.
Where Pith is reading between the lines
- The same stick-length constraint could serve as a lightweight scale reference in other multi-view setups such as robotics or surveillance where a known rigid object is present.
- If pose detection noise is the main remaining error source, replacing the current keypoint estimator with a more robust network would directly lower calibration residuals.
- Extending the optimization to handle mild asynchrony between cameras would broaden the method to consumer-grade recording setups.
- The synthetic dataset could be used to train end-to-end networks that predict extrinsics directly from raw video clips.
Load-bearing premise
The method assumes access to synchronized multi-camera videos containing both detectable human body keypoints and a rigid stick-like implement of known length.
What would settle it
Apply the method to real multi-camera footage of golf swings where camera positions are independently measured with a traditional checkerboard procedure, and check whether the reported rotation and translation errors remain below a few degrees and centimeters.
Figures
read the original abstract
Multi-camera systems are widely employed in sports to capture the 3D motion of athletes and equipment, yet calibrating their extrinsic parameters remains costly and labor-intensive. We introduce an efficient, tool-free method for multi-camera extrinsic calibration tailored to sports involving stick-like implements (e.g., golf clubs, bats, hockey sticks). Our approach jointly exploits two complementary cues from synchronized multi-camera videos: (i) human body keypoints with unknown metric scale and (ii) a rigid stick-like implement of known length. We formulate a three-stage optimization pipeline that refines camera extrinsics, reconstructs human and stick trajectories, and resolves global scale via the stick-length constraint. Our method achieves accurate extrinsic calibration without dedicated calibration tools. To benchmark this task, we present the first dataset for multi-camera self-calibration in stick-based sports, consisting of synthetic sequences across four sports categories with 3 to 10 cameras. Comprehensive experiments demonstrate that our method delivers SOTA performance, achieving low rotation and translation errors. Our project page: https://fandulu.github.io/sport_stick_multi_cam_calib/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a three-stage optimization pipeline for multi-camera extrinsic self-calibration in stick-based sports (e.g., golf, baseball). It jointly exploits scale-ambiguous human body keypoints and rigid sticks of known length from synchronized videos to refine camera extrinsics, reconstruct human/stick trajectories, and resolve global scale via the stick-length constraint. A new synthetic dataset with 3–10 cameras across four sports is presented, and experiments claim SOTA performance with low rotation and translation errors.
Significance. If the accuracy claims hold under realistic conditions, the method offers a practical, tool-free alternative to traditional calibration for sports motion capture, potentially reducing setup costs. The introduction of the first benchmark dataset for this specific task is a clear positive contribution that could enable future comparisons.
major comments (2)
- [§4 (Experiments)] §4 (Experiments): All quantitative results (rotation/translation errors, SOTA comparisons) are reported exclusively on synthetic sequences with idealized, noise-free 2D detections. No real-world multi-camera sports footage, no ablation on keypoint detector noise (e.g., OpenPose/HRNet errors), and no occlusion/motion-blur tests are included. This is load-bearing for the central claim of 'accurate extrinsic calibration' because the joint optimization of extrinsics, human trajectories, and stick trajectories is sensitive to 2D errors that propagate into scale drift or local minima.
- [§3.2–3.3 (Optimization Pipeline)] §3.2–3.3 (Optimization Pipeline): The scale-resolution step applies the known stick length only after trajectory reconstruction; the manuscript provides no analysis of how modest 2D keypoint errors affect convergence or final extrinsic accuracy, nor any initialization sensitivity study. This directly affects whether the three-stage pipeline supports the stated low-error claims outside idealized conditions.
minor comments (2)
- [Abstract and §4] The abstract and results section use vague phrasing ('low rotation and translation errors') without immediately citing the exact numerical values or table rows for the best-performing configurations.
- [Figures in §4] Figure captions and axis labels in the qualitative results could more clearly distinguish camera counts and sports categories to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, clarifying the scope of our synthetic benchmark while committing to targeted revisions that strengthen the analysis of robustness.
read point-by-point responses
-
Referee: [§4 (Experiments)] All quantitative results (rotation/translation errors, SOTA comparisons) are reported exclusively on synthetic sequences with idealized, noise-free 2D detections. No real-world multi-camera sports footage, no ablation on keypoint detector noise (e.g., OpenPose/HRNet errors), and no occlusion/motion-blur tests are included. This is load-bearing for the central claim of 'accurate extrinsic calibration' because the joint optimization of extrinsics, human trajectories, and stick trajectories is sensitive to 2D errors that propagate into scale drift or local minima.
Authors: We agree that robustness to realistic 2D detection noise is essential for practical claims. The synthetic dataset was deliberately constructed with noise-free detections to isolate the calibration pipeline's behavior and to establish the first controlled benchmark for this task, enabling precise SOTA comparisons. In the revised manuscript we will add ablations that inject Gaussian noise calibrated to typical OpenPose/HRNet error distributions, as well as simulated occlusion and motion-blur patterns, and report the resulting extrinsic and scale errors. Real-world multi-view sports sequences with accurate ground-truth extrinsics remain difficult to acquire at scale; we will explicitly discuss this limitation and the synthetic results as an upper-bound reference. revision: partial
-
Referee: [§3.2–3.3 (Optimization Pipeline)] The scale-resolution step applies the known stick length only after trajectory reconstruction; the manuscript provides no analysis of how modest 2D keypoint errors affect convergence or final extrinsic accuracy, nor any initialization sensitivity study. This directly affects whether the three-stage pipeline supports the stated low-error claims outside idealized conditions.
Authors: We will add a dedicated sensitivity subsection (likely in §4) that quantifies the effect of increasing 2D keypoint noise on convergence rate, final rotation/translation errors, and scale accuracy. The study will also include an initialization sensitivity analysis by applying controlled perturbations to the initial camera poses and reporting success rates and accuracy statistics across multiple random seeds. These additions will directly address how the three-stage pipeline behaves beyond the noise-free setting. revision: yes
- Quantitative evaluation on real-world multi-camera sports footage with precise ground-truth extrinsics, owing to the substantial practical difficulties in collecting and annotating such data.
Circularity Check
No circularity; scale constraint is an external independent input.
full rationale
The claimed three-stage pipeline refines extrinsics and trajectories while using the known physical length of the stick as an external constraint to resolve the metric scale that is otherwise ambiguous from human keypoints alone. This length is supplied as a fixed, independent measurement rather than being fitted or derived from the same data being calibrated. No equations or steps in the provided description reduce the output calibration to a tautology or to a self-citation chain; the optimization is a standard bundle-adjustment-style procedure with an added rigid-length prior. Synthetic-data experiments do not alter the logical independence of the derivation.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Human body keypoints can be reliably detected from synchronized multi-view video with unknown metric scale.
- domain assumption The stick-like implement is rigid and has a known physical length.
Reference graph
Works this paper leans on
-
[1]
Multi-view hockey tracking with trajectory smoothing and camera selection,
L. Wu, “Multi-view hockey tracking with trajectory smoothing and camera selection,” Ph.D. dissertation, University of British Columbia, 2008
2008
-
[2]
Multi-camera video surveillance for real-time analysis and reconstruction of soccer games,
J. Ren, M. Xu, J. Orwell, and G. A. Jones, “Multi-camera video surveillance for real-time analysis and reconstruction of soccer games,” Machine Vision and Applications, vol. 21, no. 6, pp. 855–863, 2010
2010
-
[3]
Feature extraction and representation for distributed multi-view human action recognition,
J. Luo, W. Wang, and H. Qi, “Feature extraction and representation for distributed multi-view human action recognition,”IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 3, no. 2, pp. 145–154, 2013
2013
-
[4]
Multi-camera multi-player tracking with deep player identification in sports video,
R. Zhang, L. Wu, Y . Yang, W. Wu, Y . Chen, and M. Xu, “Multi-camera multi-player tracking with deep player identification in sports video,” Pattern Recognition, vol. 102, p. 107260, 2020
2020
-
[5]
Diffusion convolution neural network-based multiview gesture recognition for athletes in dynamic scenes,
Q. Wang and H. Li, “Diffusion convolution neural network-based multiview gesture recognition for athletes in dynamic scenes,”Journal of Circuits, Systems and Computers, vol. 33, no. 06, p. 2450114, 2024
2024
-
[6]
Enhancing multi-camera gymnast tracking through domain knowledge integration,
F. Yang, S. Odashima, S. Masui, I. Kusajima, S. Yamao, and S. Jiang, “Enhancing multi-camera gymnast tracking through domain knowledge integration,”IEEE Transactions on Circuits and Systems for Video Technology, 2024
2024
-
[7]
A conceptual framework and review of multi-method approaches for 3d markerless motion capture in sports and exercise,
H. Noorbhai, S. Moon, and T. Fukushima, “A conceptual framework and review of multi-method approaches for 3d markerless motion capture in sports and exercise,”Journal of sports sciences, vol. 43, no. 12, pp. 1167–1174, 2025
2025
-
[8]
Biomechanical golf swing analysis using markerless three-dimensional skeletal tracking through truncation-robust heatmaps,
B. F. Tayloret al., “Biomechanical golf swing analysis using markerless three-dimensional skeletal tracking through truncation-robust heatmaps,” Ph.D. dissertation, Massachusetts Institute of Technology, 2025
2025
-
[9]
Multi-camera calibration with pattern rigs, including for non-overlapping cameras: Calico,
A. Tabb, H. Medeiros, M. J. Feldmann, and T. T. Santos, “Multi-camera calibration with pattern rigs, including for non-overlapping cameras: Calico,”arXiv preprint arXiv:1903.06811, 2019
-
[10]
The double sphere camera model,
V . Usenko, N. Demmel, and D. Cremers, “The double sphere camera model,” in2018 International Conference on 3D Vision (3DV). IEEE, 2018, pp. 552–560
2018
-
[11]
A new calibration technique for multi-camera systems of limited overlapping field-of-views,
Z. Xing, J. Yu, and Y . Ma, “A new calibration technique for multi-camera systems of limited overlapping field-of-views,” in2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 5892–5899
2017
-
[12]
Extrinsic camera calibration from a moving person,
S.-E. Lee, K. Shibata, S. Nonaka, S. Nobuhara, and K. Nishino, “Extrinsic camera calibration from a moving person,”IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 10 344–10 351, 2022
2022
-
[13]
Wide-baseline multi- camera calibration using person re-identification,
Y . Xu, Y .-J. Li, X. Weng, and K. Kitani, “Wide-baseline multi- camera calibration using person re-identification,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 13 134–13 143
2021
-
[14]
Robust piecewise-planar 3d reconstruction and completion from large-scale unstructured point data,
A.-L. Chauve, P. Labatut, and J.-P. Pons, “Robust piecewise-planar 3d reconstruction and completion from large-scale unstructured point data,” in2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 2010, pp. 1261–1268
2010
-
[15]
Robust extrinsic calibration of multiple rgb-d cameras with body tracking and feature matching,
S.-h. Lee, J. Yoo, M. Park, J. Kim, and S. Kwon, “Robust extrinsic calibration of multiple rgb-d cameras with body tracking and feature matching,”Sensors, vol. 21, no. 3, p. 1013, 2021
2021
-
[16]
Yowo: You only walk once to jointly map an indoor scene and register ceiling-mounted cameras,
F. Yang, S. Yamao, I. Kusajima, A. Moteki, S. Masui, and S. Jiang, “Yowo: You only walk once to jointly map an indoor scene and register ceiling-mounted cameras,”IEEE Transactions on Circuits and Systems for Video Technology, 2024
2024
-
[17]
D. Allegro, M. Terreran, and S. Ghidoni, “Calib3r: A 3d foundation model for multi-camera to robot calibration and 3d metric-scaled scene reconstruction,”arXiv preprint arXiv:2509.08813, 2025
-
[18]
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
N. Keetha, N. M ¨uller, J. Sch ¨onberger, L. Porzi, Y . Zhang, T. Fischer, A. Knapitsch, D. Zauss, E. Weber, N. Antuneset al., “Mapanything: Universal feed-forward metric 3d reconstruction,”arXiv preprint arXiv:2509.13414, 2025
work page internal anchor Pith review arXiv 2025
-
[19]
ewand: An extrinsic calibration framework for wide baseline frame-based and event-based camera systems,
T. Gossard, A. Ziegler, L. Kolmar, J. Tebbe, and A. Zell, “ewand: An extrinsic calibration framework for wide baseline frame-based and event-based camera systems,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 14 534–14 540
2024
-
[20]
Charuco Board-Based Omni- directional Camera Calibration Method,
D.-Y . Kim, J.-H. Kim, and K.-T. Kim, “Charuco Board-Based Omni- directional Camera Calibration Method,”Sensors, vol. 18, no. 12, p. 421, 2018
2018
-
[21]
Caliscope: Gui based multicamera calibration and motion tracking,
D. Prible, “Caliscope: Gui based multicamera calibration and motion tracking,”Journal of Open Source Software, vol. 9, no. 102, p. 7155, 2024
2024
-
[22]
Skeleton-based continuous extrinsic calibration of multiple rgb-d kinect cameras,
K. Desai, B. Prabhakaran, and S. Raghuraman, “Skeleton-based continuous extrinsic calibration of multiple rgb-d kinect cameras,” in Proceedings of the 9th ACM multimedia systems conference, 2018, pp. 250–257
2018
-
[23]
Vggt: Visual geometry grounded transformer,
J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5294–5306
2025
-
[24]
Reconstructing people, places, and cameras,
L. M ¨uller, H. Choi, A. Zhang, B. Yi, J. Malik, and A. Kanazawa, “Reconstructing people, places, and cameras,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 21 948– 21 958
2025
-
[25]
Kineo: Calibration-free metric motion capture from sparse rgb cameras,
C. Javerliat, P. Raimbaud, and G. Lavou ´e, “Kineo: Calibration-free metric motion capture from sparse rgb cameras,”arXiv preprint arXiv:2510.24464, 2025
-
[26]
Fast automatic camera network calibration through human mesh recovery,
N. Garau, F. G. De Natale, and N. Conci, “Fast automatic camera network calibration through human mesh recovery,”Journal of Real- Time Image Processing, vol. 17, no. 6, pp. 1757–1768, 2020
2020
-
[27]
Spatiotemporal multi-camera calibration using freely moving people,
S.-E. Lee, K. Nishino, and S. Nobuhara, “Spatiotemporal multi-camera calibration using freely moving people,”IEEE Robotics and Automation Letters, 2025
2025
-
[28]
Global structure-from-motion revisited,
L. Pan, D. Bar ´ath, M. Pollefeys, and J. L. Sch ¨onberger, “Global structure-from-motion revisited,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 58–77
2024
-
[29]
Online marker-free extrinsic camera calibration using person keypoint detections,
B. P ¨atzold, S. Bultmann, and S. Behnke, “Online marker-free extrinsic camera calibration using person keypoint detections,” inDAGM German Conference on Pattern Recognition. Springer, 2022, pp. 300–316
2022
-
[30]
A device for capturing inward-looking spherical light fields,
Q. Bols ´ee, W. Darwish, D. Bonatto, G. Lafruit, and A. Munteanu, “A device for capturing inward-looking spherical light fields,” in2020 International Conference on 3D Immersion (IC3D). IEEE, 2020, pp. 1–5
2020
-
[31]
A survey on video action recognition in sports: Datasets, methods and applications,
F. Wu, Q. Wang, J. Bian, N. Ding, F. Lu, J. Cheng, D. Dou, and H. Xiong, “A survey on video action recognition in sports: Datasets, methods and applications,”IEEE Transactions on Multimedia, vol. 25, pp. 7943–7966, 2022
2022
-
[32]
Robust self-supervised extrinsic self-calibration,
T. Kanai, I. Vasiljevic, V . Guizilini, A. Gaidon, and R. Ambrus, “Robust self-supervised extrinsic self-calibration,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 1932–1939
2023
-
[33]
Pricosa: High-precision 3d camera calibration with non-overlapping field of views,
O. Kedilioglu, T. T. Nova, M. Landesberger, L. Wang, M. Hofmann, J. Franke, and S. Reitelsh ¨ofer, “Pricosa: High-precision 3d camera calibration with non-overlapping field of views,” inProceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications-(V olume 2). SciTePress, 2025, pp. 801–809
2025
-
[34]
A comparison between 2d plate calibration and wand calibration for 3d kinematic systems,
T. Pribani ´c, S. Peharec, and V . Medved, “A comparison between 2d plate calibration and wand calibration for 3d kinematic systems,” Kinesiology, vol. 41, no. 2., pp. 147–155, 2009
2009
-
[35]
Mc-calib: A generic and robust calibration toolbox for multi-camera systems,
F. Rameau, J. Park, O. Bailo, and I. S. Kweon, “Mc-calib: A generic and robust calibration toolbox for multi-camera systems,”Computer Vision and Image Understanding, vol. 217, p. 103353, 2022
2022
-
[36]
Multi-camera extrinsic calibration for real-time tracking in large outdoor environments,
P. Tripicchio, S. D’Avella, G. Camacho-Gonzalez, L. Landolfi, G. Baris, C. A. Avizzano, and A. Filippeschi, “Multi-camera extrinsic calibration for real-time tracking in large outdoor environments,”Journal of Sensor and Actuator Networks, vol. 11, no. 3, p. 40, 2022
2022
-
[37]
Multi-camera calibration using far-range dual-led wand and near-range chessboard fused in bundle adjustment,
P. Jatesiktat, G. M. Lim, and W. T. Ang, “Multi-camera calibration using far-range dual-led wand and near-range chessboard fused in bundle adjustment,”Sensors, vol. 24, no. 23, p. 7416, 2024
2024
-
[38]
Enhanced three-axis frame and wand-based multi-camera calibration method using adaptive iteratively reweighted least squares and comprehensive error integration,
O. Yuhai, Y . Cho, A. Choi, and J. H. Mun, “Enhanced three-axis frame and wand-based multi-camera calibration method using adaptive iteratively reweighted least squares and comprehensive error integration,” inPhotonics, vol. 11, no. 9. MDPI, 2024, p. 867
2024
-
[39]
Multicamera rig calibration by double-sided thick checkerboard,
M. Marcon, A. Sarti, and S. Tubaro, “Multicamera rig calibration by double-sided thick checkerboard,”IET Computer Vision, vol. 11, no. 6, pp. 448–454, 2017
2017
-
[40]
Caltag: High precision fiducial markers for camera calibration
B. Atcheson, F. Heide, and W. Heidrich, “Caltag: High precision fiducial markers for camera calibration.” inVMV, vol. 10, 2010, pp. 41–48
2010
-
[41]
Calibrating multiple cameras with non-overlapping views using coded checkerboard targets,
T. Strauß, J. Ziegler, and J. Beck, “Calibrating multiple cameras with non-overlapping views using coded checkerboard targets,” in17th international IEEE conference on intelligent transportation systems (ITSC). IEEE, 2014, pp. 2623–2628
2014
-
[42]
Sports camera calibration via synthetic data,
J. Chen and J. J. Little, “Sports camera calibration via synthetic data,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2019, pp. 0–0. 9
2019
-
[43]
Tvcalib: Camera calibration for sports field registration in soccer,
J. Theiner and R. Ewerth, “Tvcalib: Camera calibration for sports field registration in soccer,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2023, pp. 1166–1175
2023
-
[44]
Real-time camera pose estimation for sports fields,
L. Citraro, P. M ´arquez-Neila, S. Savare, V . Jayaram, C. Dubout, F. Renaut, A. Hasfura, H. Ben Shitrit, and P. Fua, “Real-time camera pose estimation for sports fields,”Machine Vision and Applications, vol. 31, no. 3, p. 16, 2020
2020
-
[45]
Openmmlab pose estimation toolbox and benchmark,
M. Contributors, “Openmmlab pose estimation toolbox and benchmark,” https://github.com/open-mmlab/mmpose, 2020
2020
-
[46]
Ultralytics yolov11,
G. Jocher and J. Qiu, “Ultralytics yolov11,” 2024. [Online]. Available: https://github.com/ultralytics/ultralytics
2024
-
[47]
Least-squares estimation of transformation parameters between two point patterns,
S. Umeyama, “Least-squares estimation of transformation parameters between two point patterns,”IEEE Transactions on pattern analysis and machine intelligence, vol. 13, no. 4, pp. 376–380, 2002. 10
2002
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.