pith. sign in

arxiv: 2605.29953 · v1 · pith:WW23EAHSnew · submitted 2026-05-28 · 💻 cs.CV

Mesh-Aware Epipolar Matching for Multi-View Multi-Person 3D Pose Estimation in Basketball

Pith reviewed 2026-06-29 08:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-view 3D pose estimationmulti-personepipolar matchingmesh recoverytraining-freecross-view associationbasketball
0
0 comments X

The pith

Mesh geometry from monocular recovery enables training-free cross-view association for multi-person 3D pose in basketball.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a training-free framework that takes monocular 3D human mesh outputs and applies a two-stage epipolar matching process to link the same player across camera views. This replaces reliance on 2D keypoints alone with denser geometric cues for clustering and triangulation, addressing occlusions and uniform appearances common in team sports. If the approach works, multi-view 3D reconstruction becomes feasible on existing mesh models without collecting new annotated multi-view data or retraining for each court. The method reports lower errors than prior training-free baselines on two public basketball datasets while matching some RGB-only learned systems.

Core claim

A monocular 3D mesh recovery frontend supplies dense surface geometry that supports two-stage epipolar matching; the first stage uses disjoint-set-union clustering on mesh-derived epipolar distances to group candidate views per person, and the second stage performs per-joint triangulation on the resulting consistent sets to produce final 3D poses.

What carries the argument

Two-stage mesh-aware epipolar matching that combines disjoint-set-union clustering of mesh points with per-joint triangulation to link identities across views.

If this is right

  • Cross-view association improves without 2D keypoint detectors alone, reducing failures from occlusions.
  • Team-uniform similarity no longer limits identity matching because mesh shape supplies an additional cue.
  • No target-domain fine-tuning or multi-view labels are required, so the pipeline applies directly to new courts.
  • Indoor and outdoor basketball scenarios both show gains over earlier training-free association methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mesh-to-epipolar pipeline could extend to other team sports that share occlusion and uniform problems.
  • Monocular mesh models might serve as drop-in replacements for 2D detectors in any multi-view geometry task.
  • If mesh consistency across views can be further enforced, reconstruction accuracy would rise without extra supervision.

Load-bearing premise

The monocular mesh recovery model must produce mesh outputs that remain sufficiently accurate and consistent across different camera views.

What would settle it

A dataset in which the same monocular mesh model yields visibly inconsistent 3D surfaces for the same player across synchronized views, causing the epipolar distances to group wrong players or inflate triangulation error.

read the original abstract

Multi-view multi-person 3D pose estimation in team sports scenarios remains challenging due to player occlusions, appearance similarity caused by team uniforms, and the scarcity of annotated multi-view data, all of which limit the effectiveness and generalization capability of learning-based methods. In contrast, the performance of training-free approaches is inherently constrained by the accuracy of 2D keypoint detection and the robustness of cross-view association. To address these challenges, we propose Mesh-Aware Epipolar Matching (MAEM), a training-free framework for multi-view multi-person 3D pose estimation. Our method employs a monocular 3D human mesh recovery model as the frontend and introduces a two-stage epipolar matching strategy based on the recovered mesh outputs. Specifically, the proposed framework combines disjoint-set-union-based clustering with per-joint triangulation to achieve robust cross-view association and accurate 3D pose reconstruction. Experiments on two public multi-view basketball datasets demonstrate that MAEM consistently outperforms existing training-free association baselines while achieving competitive RGB-only performance in both indoor and outdoor basketball scenarios. MAEM achieves MPJPE/PA-MPJPE scores of 59.8/40.7 mm on SportCenter EPFL and 74.0/51.8 mm on Human-M3 Basketball, highlighting the effectiveness of dense mesh geometry for cross-view association without requiring target-domain training or fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Mesh-Aware Epipolar Matching (MAEM), a training-free framework for multi-view multi-person 3D pose estimation in basketball. It uses a monocular 3D human mesh recovery model as frontend, followed by a two-stage epipolar matching strategy that combines disjoint-set-union clustering with per-joint triangulation for cross-view association and 3D reconstruction. Experiments on SportCenter EPFL and Human-M3 Basketball report MPJPE/PA-MPJPE of 59.8/40.7 mm and 74.0/51.8 mm, claiming consistent outperformance over training-free baselines and competitive RGB-only performance.

Significance. If the results hold, this would be a meaningful contribution by demonstrating that dense mesh geometry can improve cross-view association in team sports without target-domain training, addressing challenges of occlusions and uniform appearances. The training-free nature and reliance on standard epipolar geometry plus an external mesh model are strengths for reproducibility and generalization.

major comments (2)
  1. [Method] Method section: The outperformance claim over 2D-keypoint baselines rests on the precondition that the monocular mesh recovery frontend produces sufficiently accurate and view-consistent 3D meshes across views. No quantitative validation, ablation, or consistency analysis of the mesh outputs is provided on the basketball datasets with occlusions and identical uniforms, making this assumption load-bearing for the central claim.
  2. [Experiments] Experiments section: The reported MPJPE/PA-MPJPE scores (59.8/40.7 mm and 74.0/51.8 mm) are presented without error bars, details on baseline re-implementations, data splits, or failure-case analysis, which undermines verification of the 'consistently outperforms' claim.
minor comments (1)
  1. [Abstract] Abstract: Dataset names (SportCenter EPFL, Human-M3 Basketball) appear only in the results sentence; moving them earlier would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method] Method section: The outperformance claim over 2D-keypoint baselines rests on the precondition that the monocular mesh recovery frontend produces sufficiently accurate and view-consistent 3D meshes across views. No quantitative validation, ablation, or consistency analysis of the mesh outputs is provided on the basketball datasets with occlusions and identical uniforms, making this assumption load-bearing for the central claim.

    Authors: We agree this is a valid concern and that the mesh frontend's accuracy is central to the claims. The framework relies on an off-the-shelf monocular mesh model without target-domain fine-tuning. To address the gap, the revised manuscript will include quantitative validation such as 2D reprojection errors of the recovered meshes against detected keypoints, cross-view mesh consistency metrics, and qualitative examples on the basketball datasets. revision: yes

  2. Referee: [Experiments] Experiments section: The reported MPJPE/PA-MPJPE scores (59.8/40.7 mm and 74.0/51.8 mm) are presented without error bars, details on baseline re-implementations, data splits, or failure-case analysis, which undermines verification of the 'consistently outperforms' claim.

    Authors: We acknowledge that additional experimental details would improve verifiability. The revised version will add error bars (where multiple evaluations are feasible), explicit descriptions of baseline re-implementations including hyperparameters and code references, data split details, and a brief failure-case analysis to better support the performance comparisons. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a training-free pipeline that invokes an external monocular mesh recovery model as frontend, then applies standard epipolar geometry plus DSU clustering for association. No equations are shown that define any output quantity in terms of itself, no fitted parameters are relabeled as predictions, and no load-bearing uniqueness claims rest on self-citations. Reported MPJPE numbers are empirical results on held-out public datasets rather than quantities forced by the method's own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach depends on the domain assumption that an off-the-shelf monocular mesh recovery model supplies usable 3D geometry; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption A monocular 3D human mesh recovery model supplies sufficiently accurate 3D geometry from single views to support cross-view matching.
    The entire pipeline treats the mesh recovery output as a reliable frontend.

pith-pipeline@v0.9.1-grok · 5798 in / 1341 out tokens · 32915 ms · 2026-06-29T08:27:48.130907+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 3 canonical work pages

  1. [1]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2019)

    Bridgeman, L., Volino, M., Guillemaut, J.-Y., Hilton, A.: Multi-person 3D pose estimation and tracking in sports. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2019)

  2. [2]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp

    Yeung, C., Suzuki, T., Tanaka, R., Yin, Z., Fujii, K.: Athletepose3D: A bench- mark dataset for 3D human pose estimation and kinematic validation in athletic movements. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 5991–6002 (2025) 21

  3. [3]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pp

    Dong, J., Jiang, W., Huang, Q., Bao, H., Zhou, X.: Fast and robust multi-person 3D pose estimation from multiple views. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pp. 7792–7801 (2019)

  4. [4]

    In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp

    Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara, S., Sheikh, Y.: Panoptic studio: A massively multiview system for social motion capture. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3334–3342 (2015)

  5. [5]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., Ilic, S.: 3D pictorial structures for multiple human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1669–1676 (2014)

  6. [6]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Zhang, Y., An, L., Yu, T., Li, X., Li, K., Liu, Y.: 4D association graph for realtime multi-person motion capture using multiple video cameras. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1324–1333 (2020)

  7. [7]

    In: Proceedings of the 31st ACM International Conference on Multimedia, pp

    He, L., Liao, X., Liu, W., Liu, X., Cheng, P., Mei, T.: FastReID: A pytorch toolbox for general instance re-identification. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 9664–9667 (2023)

  8. [8]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp

    He, S., Luo, H., Wang, P., Wang, F., Li, H., Jiang, W.: TransReID: Transformer- based object re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15013–15022 (2021)

  9. [9]

    In: Proceedings of the European Conference on Computer Vision (ECCV), pp

    Tu, H., Wang, C., Zeng, W.: Voxelpose: Towards multi-camera 3D human pose estimation in wild environment. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 197–212 (2020)

  10. [10]

    In: Advances in Neural Information Processing Systems, pp

    Wang, T., Zhang, J., Cai, Y., Yan, S., Feng, J.: Direct multi-view multi-person 3D pose estimation. In: Advances in Neural Information Processing Systems, pp. 13153–13164 (2021)

  11. [11]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Liao, Z., Zhu, J., Wang, C., Hu, H., Waslander, S.L.: Multiple view geometry transformers for 3D human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 708–717 (2024)

  12. [12]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Qiu, Z., Yang, Q., Wang, J., Feng, H., Han, J., Ding, E., Xu, C., Fu, D., Wang, J.: PSVT: End-to-end multi-person 3D pose and shape estimation with progressive video transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21254–21263 (2023)

  13. [13]

    In: Proceedings of the European Conference on Computer Vision (ECCV), pp

    Huang, C., Jiang, S., Li, Y., Zhang, Z., Traish, J., Deng, C., Ferguson, S., Da Xu, 22 R.Y.: End-to-end dynamic matching network for multi-view multi-person 3D pose estimation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 477–493 (2020)

  14. [14]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Lin, J., Lee, G.H.: Multi-view multi-person 3D pose estimation with plane sweep stereo. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11886–11895 (2021)

  15. [15]

    In: Proceedings of the European Conference on Computer Vision (ECCV), pp

    Ye, H., Zhu, W., Wang, C., Wu, R., Wang, Y.: Faster voxelpose: Real-time 3D human pose estimation by orthographic projection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 142–159 (2022)

  16. [16]

    https: //www.epfl.ch/labs/cvlab/data/sportcenter-dataset/

    EPFL CVLAB: SportCenter Multi-View Human Pose Estimation Dataset. https: //www.epfl.ch/labs/cvlab/data/sportcenter-dataset/. Accessed: April 19, 2026 (2022)

  17. [17]

    arXiv preprint arXiv:2308.00628 (2023)

    Fan, B., Wang, S., Zheng, W., Feng, J., Zhou, J.: Human-M3: A multi-view multi- modal dataset for 3D human pose estimation in outdoor scenes. arXiv preprint arXiv:2308.00628 (2023)

  18. [18]

    In: Proceedings of the European Conference on Computer Vision (ECCV), pp

    Jiang, T., Billingham, J., Müksch, S., Zarate, J., Evans, N., Oswald, M.R., Pol- leyfeys, M., Hilliges, O., Kaufmann, M., Song, J.: Worldpose: A world cup dataset for global 3D human pose estimation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 343–362 (2025)

  19. [19]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp

    Dong, Z., Song, J., Chen, X., Guo, C., Hilliges, O.: Shape-aware multi-person pose estimation from multi-view images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11158–11168 (2021)

  20. [20]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Hokari, Y., Hori, R., Saito, H.: Human mesh reconstruction of sports players with multiple dynamic cameras. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6049–6059 (2025)

  21. [21]

    In: ACM SIGGRAPH 2022 Conference Proceedings, pp

    Zhou, Z., Shuai, Q., Wang, Y., Fang, Q., Ji, X., Li, F., Bao, H., Zhou, X.: Quick- pose: Real-time multi-view multi-person pose estimation in crowded scenes. In: ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–9 (2022)

  22. [22]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Wandt, B., Rudolph, M., Zell, P., Rhodin, H., Rosenhahn, B.: Canonpose: Self- supervised monocular 3D human pose estimation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13294–13304 (2021)

  23. [23]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Srivastav, V., Chen, K., Padoy, N.: Selfpose3D: Self-supervised multi-person multi-view 3D pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2502–2512 (2024)

  24. [24]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Bartol, K., Bojanić, D., Petković, T., Pribanić, T.: Generalizable human pose 23 triangulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11028–11037 (2022)

  25. [25]

    ACM Trans

    Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia)34(6), 248–124816 (2015)

  26. [26]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp

    Yin, J.O., Li, T., Wang, J., Zhang, Y., Yuille, A.: Easyret3D: Uncalibrated multi-view multi-human 3D reconstruction and tracking. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 3128–3137 (2025)

  27. [27]

    In: Proceedings of the European Conference on Computer Vision (ECCV), pp

    Lu, F., Dong, Z., Song, J., Hilliges, O.: Avatarpose: Avatar-guided 3D pose esti- mation of close human interaction from sparse multi-view videos. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 215–233 (2025)

  28. [28]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10975–10985 (2019)

  29. [29]

    Sam 3d body: Robust full-body human mesh recovery

    Yang, X., Kukreja, D., Pinkus, D., Sagar, A., Fan, T., Park, J., Shin, S., Cao, J., Liu, J., Ugrinovic, N., Feiszli, M., Malik, J., Dollar, P., Kitani, K.: SAM 3D Body: Robust full-body human mesh recovery. arXiv preprint arXiv:2602.15989 (2026)

  30. [30]

    A.; Bescos, B.; Stoll, C.; Twigg, C.; Lassner, C.; Otte, D.; Vignola, E.; Prada, F.; Bogo, F.; et al

    Ferguson, A., Osman, A.A.A., Bescos, B., Stoll, C., Twigg, C., Lassner, C., Otte, D., Vignola, E., Prada, F., Bogo, F., Santesteban, I., Romero, J., Zarate, J., Lee, J., Park, J., Yang, J., Doublestein, J., Venkateshan, K., Kitani, K., Kavan, L., Farra, M.D., Hu, M., Cioffi, M., Fabris, M., Ranieri, M., Modarres, M., Kadlecek, P., Khirodkar, R., Abdrashit...

  31. [31]

    5219–5228 (2023)

    Ingwersen, C.K., Mikkelstrup, C.M., Jensen, J.N., Hannemose, M.R., Dahl, A.B.: Sportspose–adynamic3Dsportsposedataset.In:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 5219–5228 (2023)

  32. [32]

    In: Proceedings of the 8th International ACM Workshop on Multimedia Content Analysis in Sports, pp

    Suzuki, T., Tanaka, R., Yeung, C., Fujii, K.: Athleticspose: Authentic sports motion dataset on athletic field and evaluation of monocular 3D pose estimation ability. In: Proceedings of the 8th International ACM Workshop on Multimedia Content Analysis in Sports, pp. 8–17 (2025)

  33. [33]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp

    Yeung, C., Ide, K., Fujii, K.: Autosoccerpose: Automated 3D posture analy- sis of soccer shot movements. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 3214–3224 24 (2024)

  34. [34]

    In: Proceedings of the 8th International ACM Workshop on Multimedia Content Analysis in Sports, pp

    Yamada, K., Yin, L., Hu, Q., Ding, N., Iwashita, S., Ichikawa, J., Kotani, K., Yeung, C., Fujii, K.: TrackID3x3: A dataset and algorithm for multi-player track- ing with identification and pose estimation in 3x3 basketball full-court videos. In: Proceedings of the 8th International ACM Workshop on Multimedia Content Analysis in Sports, pp. 163–173 (2025)

  35. [35]

    Sports Engineering29(1), 12 (2026)

    Yin, L., Yeung, C., Hu, Q., Ichikawa, J., Azechi, H., Takahashi, S., Fujii, K.: Enhanced multi-object tracking using pose-based virtual markers in 3x3 basketball. Sports Engineering29(1), 12 (2026)

  36. [36]

    Naval Research Logistics Quarterly2(1-2), 83–97 (1955)

    Kuhn, H.W.: The hungarian method for the assignment problem. Naval Research Logistics Quarterly2(1-2), 83–97 (1955)

  37. [37]

    Cambridge University Press, Cambridge (2003)

    Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2003)

  38. [38]

    Communications of the ACM7(5), 301–303 (1964)

    Galler, B.A., Fisher, M.J.: An improved equivalence algorithm. Communications of the ACM7(5), 301–303 (1964)

  39. [39]

    Commu- nications of the ACM24(6), 381–395 (1981)

    Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commu- nications of the ACM24(6), 381–395 (1981)

  40. [40]

    https://github.com/openxrlab/xrmocap (2022)

    XRMoCap Contributors: OpenXRLab Multi-View Motion Capture Toolbox and Benchmark. https://github.com/openxrlab/xrmocap (2022)

  41. [41]

    In: Advances in Neural Information Processing Systems, vol

    Xu, Y., Zhang, J., Zhang, Q., Tao, D.: ViTPose: Simple vision transformer base- lines for human pose estimation. In: Advances in Neural Information Processing Systems, vol. 35, pp. 38571–38584 (2022)

  42. [42]

    In: 2022 International Conference on 3D Vision (3DV), pp

    Roy, S.K., Citraro, L., Honari, S., Fua, P.: On triangulation as a form of self- supervision for 3D human pose estimation. In: 2022 International Conference on 3D Vision (3DV), pp. 1–10 (2022) 25 Fig. S1: Detailed flowchart of the MAEM pipeline, where each processing step is described textually. Given multi-view images, Stage 1 recovers per-person 3D mesh...