SoccerMaster: A Vision Foundation Model for Soccer Understanding

Haolin Yang; Haoning Wu; Jiayuan Rao; Weidi Xie

arxiv: 2512.11016 · v2 · submitted 2025-12-11 · 💻 cs.CV · cs.AI

SoccerMaster: A Vision Foundation Model for Soccer Understanding

Haolin Yang , Jiayuan Rao , Haoning Wu , Weidi Xie This is my paper

Pith reviewed 2026-05-16 23:05 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords soccer understandingvision foundation modelmulti-task pretrainingsports video analysisathlete detectionevent classificationautomated data curationSoccerFactory

0 comments

The pith

A single soccer-specific vision foundation model unifies detection, identification, and event reasoning while outperforming separate expert models on each task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SoccerMaster as the first vision foundation model built specifically for soccer, trained through supervised multi-task pretraining on a mix of existing video datasets. It introduces SoccerFactory, an automated pipeline that generates large-scale spatial annotations without manual labeling, to support this unified training. The model is shown to handle both low-level perception tasks such as athlete detection and high-level reasoning tasks such as event classification within one network. Evaluations indicate that this single model surpasses task-specific expert models across the tested downstream tasks. The result suggests that domain-focused multi-task pretraining can reduce the need for separate specialized systems in sports video analysis.

Core claim

SoccerMaster is a unified vision foundation model that performs supervised multi-task pretraining on soccer data curated by the SoccerFactory pipeline and integrated public datasets; this single model then outperforms dedicated task-specific expert models on a range of downstream soccer visual understanding tasks that span fine-grained perception such as athlete detection and identification to high-level semantic reasoning such as event classification.

What carries the argument

SoccerMaster, a vision foundation model trained via supervised multi-task pretraining on soccer-specific spatial annotations generated by the automated SoccerFactory pipeline.

If this is right

One pretrained network can replace multiple separate models for detection, identification, and event classification in soccer footage.
Automated spatial annotation pipelines can supply the volume of labeled data needed for multi-task pretraining without proportional manual effort.
Integrating several existing soccer datasets yields a richer pretraining resource than any single dataset alone.
The same architecture demonstrates measurable gains on both low-level perception and high-level reasoning tasks after the shared pretraining stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same data-curation approach could be adapted to create foundation models for other team sports that share similar visual structure and event semantics.
A deployed SoccerMaster could reduce engineering overhead in broadcast analysis systems by serving multiple query types from one model checkpoint.
Extending the pretraining to include temporal video clips rather than single frames might further improve performance on action and event tasks that depend on motion.

Load-bearing premise

The SoccerFactory automated annotation pipeline produces labels of high enough quality and without systematic biases that would prevent effective multi-task pretraining or downstream generalization.

What would settle it

A controlled evaluation on a held-out soccer video dataset where SoccerMaster achieves lower accuracy than the best task-specific expert model on at least two of the reported downstream tasks.

Figures

Figures reproduced from arXiv: 2512.11016 by Haolin Yang, Haoning Wu, Jiayuan Rao, Weidi Xie.

**Figure 1.** Figure 1: SoccerMaster is a unified soccer-specific vision foundation model that leverages diverse soccer content, including images and videos, to support a wide range of soccer understanding tasks, such as commentary generation, detection, tracking, classification, etc. Abstract Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges. Unlike … view at source ↗

**Figure 2.** Figure 2: Automated Data Curation Pipeline. Our pipeline processes input videos through three stages: (i) field registration establishes geometric correspondences between image and canonical pitch coordinates via keypoint detection; (ii) tracking and identification transforms frames into athlete trajectories through detection, role and team classification, and ReID-based tracking; and (iii) post-processing refineme… view at source ↗

**Figure 3.** Figure 3: SoccerMaster Architecture. (a) The architecture of SoccerMaster, which encodes both soccer videos and images through spatial and temporal attention modules to generate semantically rich representations. (b) The pretraining tasks and downstream adaptations of SoccerMaster across both spatial perception and semantic understanding tasks. SigLIP 2 [62], and obtain the final semantic features, denoted as Fsem … view at source ↗

**Figure 4.** Figure 4: Qualitative Results of our Automatic Curation Pipeline. Comparison between our predictions (left) and ground truth annotations (right) on the SoccerNet-GSR test set. Our pipeline demonstrates robust performance across diverse scenarios. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: Top-view Pitch Visualization of Pipeline Results. Athlete positions are mapped to standardized pitch coordinates via estimated camera parameters. Each row is organized as: input image (left), our predictions (middle), and ground truth annotations (right). Athletes are color-coded by role: referees (orange, labeled “RE”), left team (red), and right team (blue). Non-referee athletes are labeled with arbitrar… view at source ↗

**Figure 6.** Figure 6: Qualitative Results of SoccerMaster. SoccerMaster can simultaneously execute multiple soccer understanding tasks on a video clip, including athlete detection, pitch registration, multiple object tracking, event classification, and commentary generation. Frames are arranged in temporal order from left to right and top to bottom. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

read the original abstract

Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges. Unlike prior works that typically rely on isolated, task-specific expert models, this work aims to propose a unified model to handle diverse soccer visual understanding tasks, ranging from fine-grained perception (e.g., athlete detection and identification) to high-level semantic reasoning (e.g., event classification). Concretely, our contributions are threefold: (i) we present SoccerMaster, the first soccer-specific vision foundation model that unifies diverse tasks within a single framework via supervised multi-task pretraining; (ii) we develop an automated data curation pipeline, SoccerFactory, to generate scalable spatial annotations, and integrate multiple existing soccer video datasets as a comprehensive pretraining data resource for multi-task pretraining; and (iii) we conduct extensive evaluations demonstrating that SoccerMaster consistently outperforms task-specific expert models across diverse downstream tasks, highlighting its breadth and superiority. The data, code, and model will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SoccerMaster applies standard multi-task pretraining to a merged soccer dataset but leaves the quality of its automated labels unverified.

read the letter

The core of this paper is SoccerMaster, a single vision model pretrained with supervised multi-task learning on soccer video. They introduce SoccerFactory as an automated pipeline that pulls together existing datasets to create annotations for player detection, identification, and event classification, then train one backbone that reportedly beats separate expert models on downstream tasks. Releasing the data, code, and weights is a clear positive for anyone who wants to build on it. The breadth claim is straightforward: one model covering both low-level perception and higher-level reasoning in the same domain is a practical direction if the numbers check out. The main gap is that the abstract gives no quantitative checks on SoccerFactory label quality. No human agreement scores, precision on held-out frames, or breakdown of errors like occlusion or motion blur appear in the summary. Because every reported gain rests on those labels being accurate and unbiased, the absence of that validation makes it hard to know whether the outperformance is real or an artifact of how the data was generated. The training recipe itself follows the usual multi-task supervised pattern with no new equations or formal guarantees. This work is aimed at researchers in sports computer vision or people experimenting with domain-specific foundation models. A reader who needs a starting point for unified soccer analysis would get some value from the dataset integration idea. I would send it for peer review so that referees can examine the full evaluation tables, ablations, and any label validation that exists in the manuscript.

Referee Report

3 major / 2 minor

Summary. The paper introduces SoccerMaster, the first soccer-specific vision foundation model that unifies diverse tasks (athlete detection/identification, event classification, etc.) via supervised multi-task pretraining. It contributes an automated SoccerFactory pipeline to curate scalable spatial annotations by integrating existing soccer video datasets, and reports extensive evaluations showing consistent outperformance over task-specific expert models on downstream tasks. The data, code, and model are promised to be released publicly.

Significance. If the outperformance claims hold after rigorous validation of annotation quality and ablations, the work would provide a useful unified baseline for soccer video understanding, demonstrating the value of multi-task pretraining in a specialized domain and potentially reducing reliance on separate expert models for perception and reasoning tasks.

major comments (3)

[§3] §3 (SoccerFactory pipeline): the description of generating 'scalable spatial annotations' by integrating existing datasets provides no quantitative validation such as precision/recall on held-out frames, human agreement scores, or error typology for detection, jersey identification, or event labels. This is load-bearing for the central claim because any systematic biases (e.g., under-detection of occluded players) could artifactually drive the reported downstream gains.
[Experiments section] Experiments section and results tables: the manuscript asserts 'consistent outperformance' and 'breadth and superiority' but the summary provides no detailed quantitative tables, ablation studies on task weighting or data scale, or error analysis. Without these, attribution of gains specifically to the unified multi-task approach versus data volume or task selection remains unverifiable.
[§4] §4 (Evaluation protocol): potential data leakage or inconsistent train/test splits between the SoccerFactory pretraining corpus and downstream benchmarks is not addressed, nor are details on whether task-specific baselines were trained with equivalent data volume and augmentation. This directly affects the fairness of the superiority claims.

minor comments (2)

[Abstract] Abstract: specify the exact number of tasks, datasets, and downstream benchmarks to allow readers to assess the scope of the 'extensive evaluations' claim.
[Methods] Notation and methods: provide the explicit multi-task loss formulation and how per-task weights are set or learned, as this is needed for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications and additions, which will strengthen the presentation of SoccerFactory, the experimental results, and the evaluation protocol.

read point-by-point responses

Referee: [§3] §3 (SoccerFactory pipeline): the description of generating 'scalable spatial annotations' by integrating existing datasets provides no quantitative validation such as precision/recall on held-out frames, human agreement scores, or error typology for detection, jersey identification, or event labels. This is load-bearing for the central claim because any systematic biases (e.g., under-detection of occluded players) could artifactually drive the reported downstream gains.

Authors: We agree that quantitative validation of the automated annotations is necessary to substantiate the pipeline's reliability. In the revised manuscript we will add a dedicated validation subsection reporting precision/recall on held-out frames, inter-annotator agreement scores on a sampled subset, and a categorized error analysis covering detection misses, jersey mis-identifications, and event label inaccuracies. These metrics will be computed against manual ground truth and will explicitly discuss potential biases such as occlusion handling. revision: yes
Referee: [Experiments section] Experiments section and results tables: the manuscript asserts 'consistent outperformance' and 'breadth and superiority' but the summary provides no detailed quantitative tables, ablation studies on task weighting or data scale, or error analysis. Without these, attribution of gains specifically to the unified multi-task approach versus data volume or task selection remains unverifiable.

Authors: We will expand the Experiments section with full per-task quantitative tables (including all baselines), new ablation studies varying task weights and pretraining data scale, and a systematic error analysis. These additions will allow readers to isolate the contribution of multi-task pretraining from data volume effects and will be placed in the main text or a comprehensive appendix. revision: yes
Referee: [§4] §4 (Evaluation protocol): potential data leakage or inconsistent train/test splits between the SoccerFactory pretraining corpus and downstream benchmarks is not addressed, nor are details on whether task-specific baselines were trained with equivalent data volume and augmentation. This directly affects the fairness of the superiority claims.

Authors: We will add an explicit evaluation-protocol subsection that documents the exact train/test splits, confirms temporal and video-level separation between the SoccerFactory pretraining corpus and all downstream benchmarks to preclude leakage, and states that every task-specific baseline was re-trained using the same data volume and augmentation pipeline as SoccerMaster. These details will be summarized in a table for transparency. revision: yes

Circularity Check

0 steps flagged

No circularity: standard empirical multi-task pretraining with independent evaluation

full rationale

The paper's derivation consists of curating data via SoccerFactory, performing supervised multi-task pretraining, and reporting empirical outperformance on downstream tasks. No equations, fitted parameters, or self-referential definitions appear in the provided text. The central claim reduces to measured accuracy on held-out evaluations rather than any quantity defined by construction from the inputs. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. This is a conventional empirical ML contribution whose validity hinges on data quality and experimental controls, not on internal circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of supervised multi-task pretraining for cross-task generalization and on the assumption that SoccerFactory produces reliable spatial annotations at scale; both are standard domain assumptions in computer vision rather than new axioms or invented entities.

axioms (1)

domain assumption Supervised multi-task pretraining on combined soccer datasets improves performance on individual downstream tasks compared with single-task training
Invoked in the description of the pretraining stage and the claim of outperformance

pith-pipeline@v0.9.0 · 5471 in / 1234 out tokens · 30481 ms · 2026-05-16T23:05:32.783006+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SoccerMaster... unifies diverse tasks within a single framework via supervised multi-task pretraining... automated data curation pipeline, SoccerFactory
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hierarchical vision transformer... spatial attention in the first Ls layers and... spatiotemporal attention only in the final Lst layers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy
cs.CV 2026-05 unverdicted novelty 7.0

SoccerLens benchmark shows state-of-the-art soccer VLMs achieve strong classification accuracy yet fail to exceed 50% grounding performance on annotated visual cues and underutilize temporal information.
SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy
cs.CV 2026-05 unverdicted novelty 7.0

SoccerLens benchmark shows state-of-the-art soccer VLMs achieve high classification accuracy yet fail to exceed 50% visual grounding performance and underutilize temporal information.
Towards Temporal Compositional Reasoning in Long-Form Sports Videos
cs.CV 2026-04 unverdicted novelty 7.0

SportsTime benchmark and CoTR method improve multimodal AI's temporal compositional reasoning and evidence grounding in long-form sports videos.

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · cited by 2 Pith papers · 3 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3

work page Pith review Pith/arXiv arXiv 2025
[2]

Jersey number recognition using keyframe identification from low-resolution broadcast videos

Bavesh Balaji, Jerrin Bright, Harish Prakash, Yuhao Chen, David A Clausi, and John Zelek. Jersey number recognition using keyframe identification from low-resolution broadcast videos. InProceedings of the 6th International Workshop on Multimedia Content Analysis in Sports, 2023. 1, 2

work page 2023
[3]

Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005. 9

work page 2005
[4]

Evaluating mul- tiple object tracking performance: the clear mot metrics

Keni Bernardin and Rainer Stiefelhagen. Evaluating mul- tiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008. 8

work page 2008
[5]

Is space-time attention all you need for video understanding? InProceedings of the International Conference on Machine Learning, 2021

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InProceedings of the International Conference on Machine Learning, 2021. 5

work page 2021
[6]

Observation-centric sort: Rethinking sort for robust multi-object tracking

Jinkun Cao, Jiangmiao Pang, Xinshuo Weng, Rawal Khirod- kar, and Kris Kitani. Observation-centric sort: Rethinking sort for robust multi-object tracking. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2023. 8

work page 2023
[7]

Camera calibration and player local- ization in soccernet-v2 and investigation of their representa- tions for action spotting

Anthony Cioppa, Adrien Deliege, Floriane Magera, Sil- vio Giancola, Olivier Barnich, Bernard Ghanem, and Marc Van Droogenbroeck. Camera calibration and player local- ization in soccernet-v2 and investigation of their representa- tions for action spotting. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition Work- shops, 2021. 1

work page 2021
[8]

Scaling up soccer- net with multi-view spatial localization and re-identification

Anthony Cioppa, Adrien Deliege, Silvio Giancola, Bernard Ghanem, and Marc Van Droogenbroeck. Scaling up soccer- net with multi-view spatial localization and re-identification. Scientific Data, 2022. 1, 15

work page 2022
[9]

Soccernet-tracking: Multiple object tracking dataset and benchmark in soccer videos

Anthony Cioppa, Silvio Giancola, Adrien Deliege, Le Kang, Xin Zhou, Zhiyu Cheng, Bernard Ghanem, and Marc Van Droogenbroeck. Soccernet-tracking: Multiple object tracking dataset and benchmark in soccer videos. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2022. 1, 2, 8

work page 2022
[10]

ArXiv abs/2409.10587(2024), https://api.semanticscholar.org/CorpusID:272693834

Anthony Cioppa, Silvio Giancola, Vladimir Somers, Vic- tor Joos, Floriane Magera, Jan Held, Seyed Abol- fazl Ghasemzadeh, Xin Zhou, Karolina Seweryn, Mateusz Kowalczyk, et al. Soccernet 2024 challenges results.arXiv preprint arXiv:2409.10587, 2024. 2

work page arXiv 2024
[11]

Sportsmot: A large multi- object tracking dataset in multiple sports scenes

Yutao Cui, Chenkai Zeng, Xiaoyu Zhao, Yichun Yang, Gangshan Wu, and Limin Wang. Sportsmot: A large multi- object tracking dataset in multiple sports scenes. InProceed- ings of the International Conference on Computer Vision,

work page
[12]

Soccernet-v2: A dataset and benchmarks for holis- tic understanding of broadcast soccer videos

Adrien Deliege, Anthony Cioppa, Silvio Giancola, Meisam J Seikavandi, Jacob V Dueholm, Kamal Nasrollahi, Bernard Ghanem, Thomas B Moeslund, and Marc Van Droogen- broeck. Soccernet-v2: A dataset and benchmarks for holis- tic understanding of broadcast soccer videos. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. 1,...

work page 2021
[13]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InProceedings of the International Conference on Learning Representati...

work page 2021
[14]

Strongsort: Make deep- sort great again.IEEE Transactions on Multimedia, 2023

Yunhao Du, Zhicheng Zhao, Yang Song, Yanyun Zhao, Fei Su, Tao Gong, and Hongying Meng. Strongsort: Make deep- sort great again.IEEE Transactions on Multimedia, 2023. 3, 8

work page 2023
[15]

Enhancing soccer camera calibration through keypoint exploitation

Nikolay S Falaleev and Ruilong Chen. Enhancing soccer camera calibration through keypoint exploitation. InACM Multimedia Workshops, 2024. 2, 3

work page 2024
[16]

Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 2025

Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, et al. Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 2025. 2

work page 2025
[17]

A survey for founda- tion models in autonomous driving

Haoxiang Gao, Zhongruo Wang, Yaqian Li, Kaiwen Long, Ming Yang, and Yiqing Shen. A survey for foundation mod- els in autonomous driving.arXiv preprint arXiv:2402.01105,

work page arXiv
[18]

Multiple object track- ing as id prediction

Ruopeng Gao, Ji Qi, and Limin Wang. Multiple object track- ing as id prediction. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025. 7, 8

work page 2025
[19]

Soccernet: A scalable dataset for action spotting in soccer videos

Silvio Giancola, Mohieddine Amine, Tarek Dghaily, and Bernard Ghanem. Soccernet: A scalable dataset for action spotting in soccer videos. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition Work- shops, 2018. 1, 2

work page 2018
[20]

Soccernet 2022 challenges results

Silvio Giancola, Anthony Cioppa, Adrien Deli `ege, Floriane Magera, Vladimir Somers, Le Kang, Xin Zhou, Olivier Bar- nich, Christophe De Vleeschouwer, Alexandre Alahi, et al. Soccernet 2022 challenges results. InProceedings of the 5th International ACM Workshop on Multimedia Content Anal- ysis in Sports, 2022. 2, 8

work page 2022
[21]

ArXiv abs/2508.19182(2025), https://api.semanticscholar.org/CorpusID:280870241

Silvio Giancola, Anthony Cioppa, et al. Soccernet 2025 chal- lenges results.arXiv preprint arXiv:2508.19182, 2025. 2, 4

work page arXiv 2025
[22]

From broadcast to min- imap: Achieving state-of-the-art soccernet game state recon- struction

Vladimir Golovkin, Nikolay Nemtsev, Vasyl Shandyba, Oleg Udin, Nikita Kasatkin, Pavel Kononov, Anton Afanasiev, Sergey Ulasen, and Andrei Boiarov. From broadcast to min- imap: Achieving state-of-the-art soccernet game state recon- struction. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition Workshops, 2025. 4

work page 2025
[23]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Pnlcalib: Sports field registration via points and lines optimization.arXiv preprint arXiv:2404.08401, 2024

Marc Guti ´errez-P´erez and Antonio Agudo. Pnlcalib: Sports field registration via points and lines optimization.arXiv preprint arXiv:2404.08401, 2024. 2, 3, 6, 7, 8, 15 10

work page arXiv 2024
[25]

Vars: Video assistant referee system for automated soccer decision making from multiple views

Jan Held, Anthony Cioppa, Silvio Giancola, Abdullah Hamdi, Bernard Ghanem, and Marc Van Droogenbroeck. Vars: Video assistant referee system for automated soccer decision making from multiple views. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion Workshops, 2023. 1, 2

work page 2023
[26]

X-vars: Introducing explainability in football refereeing with multi- modal large language models

Jan Held, Hani Itani, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, and Marc Van Droogenbroeck. X-vars: Introducing explainability in football refereeing with multi- modal large language models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2024. 1, 2

work page 2024
[27]

Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori

Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. A hierarchical deep temporal model for group activity recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2016. 2

work page 2016
[28]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic opti- mization.arXiv preprint arXiv:1412.6980, 2014. 17

work page internal anchor Pith review Pith/arXiv arXiv 2014
[29]

Maria Koshkina and James H. Elder. A general framework for jersey number recognition in sports video. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2024. 1, 4, 16

work page 2024
[30]

Sports-qa: A large-scale video question answering bench- mark for complex and professional sports,

Haopeng Li, Andong Deng, Qiuhong Ke, Jun Liu, Hos- sein Rahmani, Yulan Guo, Bernt Schiele, and Chen Chen. Sports-qa: A large-scale video question answering bench- mark for complex and professional sports.arXiv preprint arXiv:2401.01505, 2024. 2

work page arXiv 2024
[31]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InProceed- ings of the International Conference on Machine Learning,

work page
[32]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InPro- ceedings of the International Conference on Machine Learn- ing, 2023. 2, 7

work page 2023
[33]

Multisports: A multi-person video dataset of spatio-temporally localized sports actions

Yixuan Li, Lei Chen, Runyu He, Zhenzhi Wang, Gangshan Wu, and Limin Wang. Multisports: A multi-person video dataset of spatio-temporally localized sports actions. InPro- ceedings of the International Conference on Computer Vi- sion, 2021. 2

work page 2021
[34]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText Summarization Branches Out, 2004. 9

work page 2004
[35]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 6

work page 2017
[36]

F3set: Towards analyzing fast, frequent, and fine-grained events from videos

Zhaoyu Liu, Kan Jiang, Murong Ma, Zhe Hou, Yun Lin, and Jin Song Dong. F3set: Towards analyzing fast, frequent, and fine-grained events from videos. InProceedings of the In- ternational Conference on Learning Representations, 2025. 2

work page 2025
[37]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of the International Confer- ence on Learning Representations, 2019. 17

work page 2019
[38]

Hota: A higher order metric for evaluating multi-object tracking.International Journal of Computer Vision, 2021

Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Torr, Andreas Geiger, Laura Leal-Taix´e, and Bastian Leibe. Hota: A higher order metric for evaluating multi-object tracking.International Journal of Computer Vision, 2021. 8

work page 2021
[39]

Npu rgbd dataset and a feature-enhanced lstm-dgcn method for action recognition of basketball players+.Applied Sci- ences, 2021

Chunyan Ma, Jizhuang Fan, Jing-Yue Yao, and Tao Zhang. Npu rgbd dataset and a feature-enhanced lstm-dgcn method for action recognition of basketball players+.Applied Sci- ences, 2021. 2

work page 2021
[40]

A universal protocol to bench- mark camera calibration for sports

Floriane Magera, Thomas Hoyoux, Olivier Barnich, and Marc Van Droogenbroeck. A universal protocol to bench- mark camera calibration for sports. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion Workshops, 2024. 1, 2, 8

work page 2024
[41]

Broadtrack: Broadcast camera tracking for soccer

Floriane Magera, Thomas Hoyoux, Olivier Barnich, and Marc Van Droogenbroeck. Broadtrack: Broadcast camera tracking for soccer. InWinter Conference on Applications of Computer Vision, 2025. 1

work page 2025
[42]

Multi-task learning for joint re-identification, team affiliation, and role classifi- cation for sports visual tracking

Amir M Mansourian, Vladimir Somers, Christophe De Vleeschouwer, and Shohreh Kasaei. Multi-task learning for joint re-identification, team affiliation, and role classifi- cation for sports visual tracking. InProceedings of the 6th International Workshop on Multimedia Content Analysis in Sports, 2023. 1, 2, 3, 8

work page 2023
[43]

Leapfrog diffusion model for stochastic trajec- tory prediction

Weibo Mao, Chenxin Xu, Qi Zhu, Siheng Chen, and Yan- feng Wang. Leapfrog diffusion model for stochastic trajec- tory prediction. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. 2

work page 2023
[44]

Soccernet- caption: Dense video captioning for soccer broadcasts com- mentaries

Hassan Mkhallati, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, and Marc Van Droogenbroeck. Soccernet- caption: Dense video captioning for soccer broadcasts com- mentaries. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition Workshops, 2023. 1, 2, 16

work page 2023
[45]

Dinov2: Learning robust visual features without supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2024. 2

work page 2024
[46]

Basket: A large- scale video dataset for fine-grained skill estimation

Yulu Pan, Ce Zhang, and Gedas Bertasius. Basket: A large- scale video dataset for fine-grained skill estimation. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025. 2

work page 2025
[47]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InAssociation for Computational Linguistics,

work page
[48]

AJ Piergiovanni and Michael S. Ryoo. Fine-grained activ- ity recognition in baseball videos. InProceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition Workshops, 2018. 2

work page 2018
[49]

Goal: A challenging knowledge-grounded video captioning bench- mark for real-time soccer commentary generation

Ji Qi, Jifan Yu, Teng Tu, Kunyu Gao, Yifan Xu, Xinyu Guan, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li, et al. Goal: A challenging knowledge-grounded video captioning bench- mark for real-time soccer commentary generation. InPro- ceedings of the ACM International Conference on Informa- tion and Knowledge Management, 2023. 1, 2 11

work page 2023
[50]

Sports video captioning via attentive motion representation and group relationship modeling.IEEE Transactions on Cir- cuits and Systems for Video Technology, 2020

Mengshi Qi, Yunhong Wang, Annan Li, and Jiebo Luo. Sports video captioning via attentive motion representation and group relationship modeling.IEEE Transactions on Cir- cuits and Systems for Video Technology, 2020. 2

work page 2020
[51]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InProceedings of the International Conference on Machine Learning, 2021. 2

work page 2021
[52]

Matchtime: Towards automatic soccer game commentary generation

Jiayuan Rao, Haoning Wu, Chang Liu, Yanfeng Wang, and Weidi Xie. Matchtime: Towards automatic soccer game commentary generation. InProceedings of the Confer- ence on Empirical Methods in Natural Language Processing,

work page
[53]

1, 2, 3, 4, 5, 7, 8, 9, 15, 16, 17

work page
[54]

Multi-agent system for comprehensive soccer understanding

Jiayuan Rao, Zifeng Li, Haoning Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie. Multi-agent system for comprehensive soccer understanding. InACM Multimedia, 2025. 1, 2

work page 2025
[55]

Towards universal soccer video under- standing

Jiayuan Rao, Haoning Wu, Hao Jiang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards universal soccer video under- standing. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2025. 1, 2, 3, 4, 5, 6, 7, 15, 16, 17

work page 2025
[56]

Sam 2: Seg- ment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos. InProceedings of the International Conference on Learning Representations,

work page
[57]

Performance measures and a data set for multi-target, multi-camera tracking

Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. InProceedings of the European Conference on Computer Vision, 2016. 8

work page 2016
[58]

Finegym: A hierarchical video dataset for fine-grained action understand- ing

Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A hierarchical video dataset for fine-grained action understand- ing. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 2

work page 2020
[59]

Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

Wenzhe Shi, Jose Caballero, Ferenc Husz ´ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 6

work page 2016
[60]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2, 7

work page Pith review Pith/arXiv arXiv 2025
[61]

Soccernet game state reconstruction: End-to- end athlete tracking and identification on a minimap

Vladimir Somers, Victor Joos, Anthony Cioppa, Silvio Gian- cola, Seyed Abolfazl Ghasemzadeh, Floriane Magera, Bap- tiste Standaert, Amir M Mansourian, Xin Zhou, Shohreh Kasaei, et al. Soccernet game state reconstruction: End-to- end athlete tracking and identification on a minimap. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Reco...

work page 2024
[62]

Computer vision for sports: Current applications and research topics.Computer Vision and Image Understanding, 2017

Graham Thomas, Rikke Gade, Thomas B Moeslund, Peter Carr, and Adrian Hilton. Computer vision for sports: Current applications and research topics.Computer Vision and Image Understanding, 2017. 2

work page 2017
[63]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2, 3, 6, 7, 9, 17

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Semi-supervised training to im- prove player and ball detection in soccer

Renaud Vandeghen, Anthony Cioppa, and Marc Van Droogenbroeck. Semi-supervised training to im- prove player and ball detection in soccer. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2022. 1, 2

work page 2022
[65]

Yolov8: A novel object detection algorithm with enhanced performance and robust- ness

Rejin Varghese and M Sambath. Yolov8: A novel object detection algorithm with enhanced performance and robust- ness. InInternational Conference on Advances in Data En- gineering and Intelligent Computing Systems, 2024. 3

work page 2024
[66]

Cider: Consensus-based image description evalua- tion

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 9

work page 2015
[67]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2025. 2

work page 2025
[68]

Tacticai: an ai assistant for football tactics.Nature Commu- nications, 2024

Zhe Wang, Petar Veli ˇckovi´c, Daniel Hennes, Nenad Tomaˇsev, Laurel Prince, Michael Kaisers, Yoram Bachrach, Romuald Elie, Li Kevin Wenliang, Federico Piccinini, et al. Tacticai: an ai assistant for football tactics.Nature Commu- nications, 2024. 19

work page 2024
[69]

Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Hui Hui, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data. Nature Communications, 2025. 2

work page 2025
[70]

Sports video analysis on large-scale data

Dekun Wu, He Zhao, Xingce Bao, and Richard P Wildes. Sports video analysis on large-scale data. InProceedings of the European Conference on Computer Vision, 2022. 2

work page 2022
[71]

A simple yet effective knowl- edge guided method for entity-aware video captioning on a basketball benchmark.Neurocomputing, 2025

Zeyu Xi, Ge Shi, Xuefen Li, Junchi Yan, Zun Li, Lifang Wu, Zilin Liu, and Liang Wang. A simple yet effective knowl- edge guided method for entity-aware video captioning on a basketball benchmark.Neurocomputing, 2025

work page 2025
[72]

Eika: Explicit & implicit knowledge-augmented network for entity-aware sports video captioning.Expert Systems with Applications, 2025

Zeyu Xi, Ge Shi, Haoying Sun, Bowen Zhang, Shuyi Li, and Lifang Wu. Eika: Explicit & implicit knowledge-augmented network for entity-aware sports video captioning.Expert Systems with Applications, 2025

work page 2025
[73]

Player-centric multimodal prompt generation for large lan- guage model based identity-aware basketball video caption- ing

Zeyu Xi, Haoying Sun, Yaofei Wu, Junchi Yan, Haoran Zhang, Lifang Wu, Liang Wang, and Changwen Chen. Player-centric multimodal prompt generation for large lan- guage model based identity-aware basketball video caption- ing. InProceedings of the International Conference on Com- puter Vision, 2025. 2

work page 2025
[74]

Sportqa: A benchmark for sports understanding in large language models

Haotian Xia, Zhengbang Yang, Yuqing Wang, Rhys Tracy, Yun Zhao, Dongdong Huang, Zezhi Chen, Yan Zhu, Yuan- fang Wang, and Weining Shen. Sportqa: A benchmark for sports understanding in large language models. InProceed- 12 ings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2024. 2

work page 2024
[75]

Sportu: A comprehensive sports understanding benchmark for multimodal large language models

Haotian Xia, Zhengbang Yang, Junbo Zou, Rhys Tracy, Yuqing Wang, Chi Lu, Christopher Lai, Yanjun He, Xun Shao, Zhuoqing Xie, et al. Sportu: A comprehensive sports understanding benchmark for multimodal large language models. InProceedings of the International Conference on Learning Representations, 2025. 2

work page 2025
[76]

Language-guided audio-visual learn- ing for long-term sports assessment

Huangbiao Xu, Xiao Ke, Huanqi Wu, Rui Xu, Yuezhou Li, and Wenzhong Guo. Language-guided audio-visual learn- ing for long-term sports assessment. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2025. 2

work page 2025
[77]

Finediving: A fine-grained dataset for procedure-aware action quality assessment

Jinglin Xu, Yongming Rao, Xumin Yu, Guangyi Chen, Jie Zhou, and Jiwen Lu. Finediving: A fine-grained dataset for procedure-aware action quality assessment. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. 2

work page 2022
[78]

Depth any- thing v2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2. InConference on Neural Information Processing Systems, 2024. 2

work page 2024
[79]

Timesoccer: An end-to-end multimodal large language model for soccer commentary generation

Ling You, Wenxuan Huang, Xinni Xie, Xiangyi Wei, Bangyan Li, Shaohui Lin, Yang Li, and Changbo Wang. Timesoccer: An end-to-end multimodal large language model for soccer commentary generation. InACM Multi- media, 2025. 2

work page 2025
[80]

Scaling vision transformers

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2022. 2, 5

work page 2022

Showing first 80 references.

[1] [1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3

work page Pith review Pith/arXiv arXiv 2025

[2] [2]

Jersey number recognition using keyframe identification from low-resolution broadcast videos

Bavesh Balaji, Jerrin Bright, Harish Prakash, Yuhao Chen, David A Clausi, and John Zelek. Jersey number recognition using keyframe identification from low-resolution broadcast videos. InProceedings of the 6th International Workshop on Multimedia Content Analysis in Sports, 2023. 1, 2

work page 2023

[3] [3]

Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005. 9

work page 2005

[4] [4]

Evaluating mul- tiple object tracking performance: the clear mot metrics

Keni Bernardin and Rainer Stiefelhagen. Evaluating mul- tiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008. 8

work page 2008

[5] [5]

Is space-time attention all you need for video understanding? InProceedings of the International Conference on Machine Learning, 2021

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InProceedings of the International Conference on Machine Learning, 2021. 5

work page 2021

[6] [6]

Observation-centric sort: Rethinking sort for robust multi-object tracking

Jinkun Cao, Jiangmiao Pang, Xinshuo Weng, Rawal Khirod- kar, and Kris Kitani. Observation-centric sort: Rethinking sort for robust multi-object tracking. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2023. 8

work page 2023

[7] [7]

Camera calibration and player local- ization in soccernet-v2 and investigation of their representa- tions for action spotting

Anthony Cioppa, Adrien Deliege, Floriane Magera, Sil- vio Giancola, Olivier Barnich, Bernard Ghanem, and Marc Van Droogenbroeck. Camera calibration and player local- ization in soccernet-v2 and investigation of their representa- tions for action spotting. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition Work- shops, 2021. 1

work page 2021

[8] [8]

Scaling up soccer- net with multi-view spatial localization and re-identification

Anthony Cioppa, Adrien Deliege, Silvio Giancola, Bernard Ghanem, and Marc Van Droogenbroeck. Scaling up soccer- net with multi-view spatial localization and re-identification. Scientific Data, 2022. 1, 15

work page 2022

[9] [9]

Soccernet-tracking: Multiple object tracking dataset and benchmark in soccer videos

Anthony Cioppa, Silvio Giancola, Adrien Deliege, Le Kang, Xin Zhou, Zhiyu Cheng, Bernard Ghanem, and Marc Van Droogenbroeck. Soccernet-tracking: Multiple object tracking dataset and benchmark in soccer videos. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2022. 1, 2, 8

work page 2022

[10] [10]

ArXiv abs/2409.10587(2024), https://api.semanticscholar.org/CorpusID:272693834

Anthony Cioppa, Silvio Giancola, Vladimir Somers, Vic- tor Joos, Floriane Magera, Jan Held, Seyed Abol- fazl Ghasemzadeh, Xin Zhou, Karolina Seweryn, Mateusz Kowalczyk, et al. Soccernet 2024 challenges results.arXiv preprint arXiv:2409.10587, 2024. 2

work page arXiv 2024

[11] [11]

Sportsmot: A large multi- object tracking dataset in multiple sports scenes

Yutao Cui, Chenkai Zeng, Xiaoyu Zhao, Yichun Yang, Gangshan Wu, and Limin Wang. Sportsmot: A large multi- object tracking dataset in multiple sports scenes. InProceed- ings of the International Conference on Computer Vision,

work page

[12] [12]

Soccernet-v2: A dataset and benchmarks for holis- tic understanding of broadcast soccer videos

Adrien Deliege, Anthony Cioppa, Silvio Giancola, Meisam J Seikavandi, Jacob V Dueholm, Kamal Nasrollahi, Bernard Ghanem, Thomas B Moeslund, and Marc Van Droogen- broeck. Soccernet-v2: A dataset and benchmarks for holis- tic understanding of broadcast soccer videos. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. 1,...

work page 2021

[13] [13]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InProceedings of the International Conference on Learning Representati...

work page 2021

[14] [14]

Strongsort: Make deep- sort great again.IEEE Transactions on Multimedia, 2023

Yunhao Du, Zhicheng Zhao, Yang Song, Yanyun Zhao, Fei Su, Tao Gong, and Hongying Meng. Strongsort: Make deep- sort great again.IEEE Transactions on Multimedia, 2023. 3, 8

work page 2023

[15] [15]

Enhancing soccer camera calibration through keypoint exploitation

Nikolay S Falaleev and Ruilong Chen. Enhancing soccer camera calibration through keypoint exploitation. InACM Multimedia Workshops, 2024. 2, 3

work page 2024

[16] [16]

Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 2025

Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, et al. Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 2025. 2

work page 2025

[17] [17]

A survey for founda- tion models in autonomous driving

Haoxiang Gao, Zhongruo Wang, Yaqian Li, Kaiwen Long, Ming Yang, and Yiqing Shen. A survey for foundation mod- els in autonomous driving.arXiv preprint arXiv:2402.01105,

work page arXiv

[18] [18]

Multiple object track- ing as id prediction

Ruopeng Gao, Ji Qi, and Limin Wang. Multiple object track- ing as id prediction. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025. 7, 8

work page 2025

[19] [19]

Soccernet: A scalable dataset for action spotting in soccer videos

Silvio Giancola, Mohieddine Amine, Tarek Dghaily, and Bernard Ghanem. Soccernet: A scalable dataset for action spotting in soccer videos. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition Work- shops, 2018. 1, 2

work page 2018

[20] [20]

Soccernet 2022 challenges results

Silvio Giancola, Anthony Cioppa, Adrien Deli `ege, Floriane Magera, Vladimir Somers, Le Kang, Xin Zhou, Olivier Bar- nich, Christophe De Vleeschouwer, Alexandre Alahi, et al. Soccernet 2022 challenges results. InProceedings of the 5th International ACM Workshop on Multimedia Content Anal- ysis in Sports, 2022. 2, 8

work page 2022

[21] [21]

ArXiv abs/2508.19182(2025), https://api.semanticscholar.org/CorpusID:280870241

Silvio Giancola, Anthony Cioppa, et al. Soccernet 2025 chal- lenges results.arXiv preprint arXiv:2508.19182, 2025. 2, 4

work page arXiv 2025

[22] [22]

From broadcast to min- imap: Achieving state-of-the-art soccernet game state recon- struction

Vladimir Golovkin, Nikolay Nemtsev, Vasyl Shandyba, Oleg Udin, Nikita Kasatkin, Pavel Kononov, Anton Afanasiev, Sergey Ulasen, and Andrei Boiarov. From broadcast to min- imap: Achieving state-of-the-art soccernet game state recon- struction. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition Workshops, 2025. 4

work page 2025

[23] [23]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Pnlcalib: Sports field registration via points and lines optimization.arXiv preprint arXiv:2404.08401, 2024

Marc Guti ´errez-P´erez and Antonio Agudo. Pnlcalib: Sports field registration via points and lines optimization.arXiv preprint arXiv:2404.08401, 2024. 2, 3, 6, 7, 8, 15 10

work page arXiv 2024

[25] [25]

Vars: Video assistant referee system for automated soccer decision making from multiple views

Jan Held, Anthony Cioppa, Silvio Giancola, Abdullah Hamdi, Bernard Ghanem, and Marc Van Droogenbroeck. Vars: Video assistant referee system for automated soccer decision making from multiple views. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion Workshops, 2023. 1, 2

work page 2023

[26] [26]

X-vars: Introducing explainability in football refereeing with multi- modal large language models

Jan Held, Hani Itani, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, and Marc Van Droogenbroeck. X-vars: Introducing explainability in football refereeing with multi- modal large language models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2024. 1, 2

work page 2024

[27] [27]

Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori

Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. A hierarchical deep temporal model for group activity recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2016. 2

work page 2016

[28] [28]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic opti- mization.arXiv preprint arXiv:1412.6980, 2014. 17

work page internal anchor Pith review Pith/arXiv arXiv 2014

[29] [29]

Maria Koshkina and James H. Elder. A general framework for jersey number recognition in sports video. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2024. 1, 4, 16

work page 2024

[30] [30]

Sports-qa: A large-scale video question answering bench- mark for complex and professional sports,

Haopeng Li, Andong Deng, Qiuhong Ke, Jun Liu, Hos- sein Rahmani, Yulan Guo, Bernt Schiele, and Chen Chen. Sports-qa: A large-scale video question answering bench- mark for complex and professional sports.arXiv preprint arXiv:2401.01505, 2024. 2

work page arXiv 2024

[31] [31]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InProceed- ings of the International Conference on Machine Learning,

work page

[32] [32]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InPro- ceedings of the International Conference on Machine Learn- ing, 2023. 2, 7

work page 2023

[33] [33]

Multisports: A multi-person video dataset of spatio-temporally localized sports actions

Yixuan Li, Lei Chen, Runyu He, Zhenzhi Wang, Gangshan Wu, and Limin Wang. Multisports: A multi-person video dataset of spatio-temporally localized sports actions. InPro- ceedings of the International Conference on Computer Vi- sion, 2021. 2

work page 2021

[34] [34]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText Summarization Branches Out, 2004. 9

work page 2004

[35] [35]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 6

work page 2017

[36] [36]

F3set: Towards analyzing fast, frequent, and fine-grained events from videos

Zhaoyu Liu, Kan Jiang, Murong Ma, Zhe Hou, Yun Lin, and Jin Song Dong. F3set: Towards analyzing fast, frequent, and fine-grained events from videos. InProceedings of the In- ternational Conference on Learning Representations, 2025. 2

work page 2025

[37] [37]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of the International Confer- ence on Learning Representations, 2019. 17

work page 2019

[38] [38]

Hota: A higher order metric for evaluating multi-object tracking.International Journal of Computer Vision, 2021

Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Torr, Andreas Geiger, Laura Leal-Taix´e, and Bastian Leibe. Hota: A higher order metric for evaluating multi-object tracking.International Journal of Computer Vision, 2021. 8

work page 2021

[39] [39]

Npu rgbd dataset and a feature-enhanced lstm-dgcn method for action recognition of basketball players+.Applied Sci- ences, 2021

Chunyan Ma, Jizhuang Fan, Jing-Yue Yao, and Tao Zhang. Npu rgbd dataset and a feature-enhanced lstm-dgcn method for action recognition of basketball players+.Applied Sci- ences, 2021. 2

work page 2021

[40] [40]

A universal protocol to bench- mark camera calibration for sports

Floriane Magera, Thomas Hoyoux, Olivier Barnich, and Marc Van Droogenbroeck. A universal protocol to bench- mark camera calibration for sports. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion Workshops, 2024. 1, 2, 8

work page 2024

[41] [41]

Broadtrack: Broadcast camera tracking for soccer

Floriane Magera, Thomas Hoyoux, Olivier Barnich, and Marc Van Droogenbroeck. Broadtrack: Broadcast camera tracking for soccer. InWinter Conference on Applications of Computer Vision, 2025. 1

work page 2025

[42] [42]

Multi-task learning for joint re-identification, team affiliation, and role classifi- cation for sports visual tracking

Amir M Mansourian, Vladimir Somers, Christophe De Vleeschouwer, and Shohreh Kasaei. Multi-task learning for joint re-identification, team affiliation, and role classifi- cation for sports visual tracking. InProceedings of the 6th International Workshop on Multimedia Content Analysis in Sports, 2023. 1, 2, 3, 8

work page 2023

[43] [43]

Leapfrog diffusion model for stochastic trajec- tory prediction

Weibo Mao, Chenxin Xu, Qi Zhu, Siheng Chen, and Yan- feng Wang. Leapfrog diffusion model for stochastic trajec- tory prediction. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. 2

work page 2023

[44] [44]

Soccernet- caption: Dense video captioning for soccer broadcasts com- mentaries

Hassan Mkhallati, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, and Marc Van Droogenbroeck. Soccernet- caption: Dense video captioning for soccer broadcasts com- mentaries. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition Workshops, 2023. 1, 2, 16

work page 2023

[45] [45]

Dinov2: Learning robust visual features without supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2024. 2

work page 2024

[46] [46]

Basket: A large- scale video dataset for fine-grained skill estimation

Yulu Pan, Ce Zhang, and Gedas Bertasius. Basket: A large- scale video dataset for fine-grained skill estimation. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025. 2

work page 2025

[47] [47]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InAssociation for Computational Linguistics,

work page

[48] [48]

AJ Piergiovanni and Michael S. Ryoo. Fine-grained activ- ity recognition in baseball videos. InProceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition Workshops, 2018. 2

work page 2018

[49] [49]

Goal: A challenging knowledge-grounded video captioning bench- mark for real-time soccer commentary generation

Ji Qi, Jifan Yu, Teng Tu, Kunyu Gao, Yifan Xu, Xinyu Guan, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li, et al. Goal: A challenging knowledge-grounded video captioning bench- mark for real-time soccer commentary generation. InPro- ceedings of the ACM International Conference on Informa- tion and Knowledge Management, 2023. 1, 2 11

work page 2023

[50] [50]

Sports video captioning via attentive motion representation and group relationship modeling.IEEE Transactions on Cir- cuits and Systems for Video Technology, 2020

Mengshi Qi, Yunhong Wang, Annan Li, and Jiebo Luo. Sports video captioning via attentive motion representation and group relationship modeling.IEEE Transactions on Cir- cuits and Systems for Video Technology, 2020. 2

work page 2020

[51] [51]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InProceedings of the International Conference on Machine Learning, 2021. 2

work page 2021

[52] [52]

Matchtime: Towards automatic soccer game commentary generation

Jiayuan Rao, Haoning Wu, Chang Liu, Yanfeng Wang, and Weidi Xie. Matchtime: Towards automatic soccer game commentary generation. InProceedings of the Confer- ence on Empirical Methods in Natural Language Processing,

work page

[53] [53]

1, 2, 3, 4, 5, 7, 8, 9, 15, 16, 17

work page

[54] [54]

Multi-agent system for comprehensive soccer understanding

Jiayuan Rao, Zifeng Li, Haoning Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie. Multi-agent system for comprehensive soccer understanding. InACM Multimedia, 2025. 1, 2

work page 2025

[55] [55]

Towards universal soccer video under- standing

Jiayuan Rao, Haoning Wu, Hao Jiang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards universal soccer video under- standing. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2025. 1, 2, 3, 4, 5, 6, 7, 15, 16, 17

work page 2025

[56] [56]

Sam 2: Seg- ment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos. InProceedings of the International Conference on Learning Representations,

work page

[57] [57]

Performance measures and a data set for multi-target, multi-camera tracking

Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. InProceedings of the European Conference on Computer Vision, 2016. 8

work page 2016

[58] [58]

Finegym: A hierarchical video dataset for fine-grained action understand- ing

Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A hierarchical video dataset for fine-grained action understand- ing. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 2

work page 2020

[59] [59]

Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

Wenzhe Shi, Jose Caballero, Ferenc Husz ´ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 6

work page 2016

[60] [60]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2, 7

work page Pith review Pith/arXiv arXiv 2025

[61] [61]

Soccernet game state reconstruction: End-to- end athlete tracking and identification on a minimap

Vladimir Somers, Victor Joos, Anthony Cioppa, Silvio Gian- cola, Seyed Abolfazl Ghasemzadeh, Floriane Magera, Bap- tiste Standaert, Amir M Mansourian, Xin Zhou, Shohreh Kasaei, et al. Soccernet game state reconstruction: End-to- end athlete tracking and identification on a minimap. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Reco...

work page 2024

[62] [62]

Computer vision for sports: Current applications and research topics.Computer Vision and Image Understanding, 2017

Graham Thomas, Rikke Gade, Thomas B Moeslund, Peter Carr, and Adrian Hilton. Computer vision for sports: Current applications and research topics.Computer Vision and Image Understanding, 2017. 2

work page 2017

[63] [63]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2, 3, 6, 7, 9, 17

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [64]

Semi-supervised training to im- prove player and ball detection in soccer

Renaud Vandeghen, Anthony Cioppa, and Marc Van Droogenbroeck. Semi-supervised training to im- prove player and ball detection in soccer. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2022. 1, 2

work page 2022

[65] [65]

Yolov8: A novel object detection algorithm with enhanced performance and robust- ness

Rejin Varghese and M Sambath. Yolov8: A novel object detection algorithm with enhanced performance and robust- ness. InInternational Conference on Advances in Data En- gineering and Intelligent Computing Systems, 2024. 3

work page 2024

[66] [66]

Cider: Consensus-based image description evalua- tion

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 9

work page 2015

[67] [67]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2025. 2

work page 2025

[68] [68]

Tacticai: an ai assistant for football tactics.Nature Commu- nications, 2024

Zhe Wang, Petar Veli ˇckovi´c, Daniel Hennes, Nenad Tomaˇsev, Laurel Prince, Michael Kaisers, Yoram Bachrach, Romuald Elie, Li Kevin Wenliang, Federico Piccinini, et al. Tacticai: an ai assistant for football tactics.Nature Commu- nications, 2024. 19

work page 2024

[69] [69]

Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Hui Hui, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data. Nature Communications, 2025. 2

work page 2025

[70] [70]

Sports video analysis on large-scale data

Dekun Wu, He Zhao, Xingce Bao, and Richard P Wildes. Sports video analysis on large-scale data. InProceedings of the European Conference on Computer Vision, 2022. 2

work page 2022

[71] [71]

A simple yet effective knowl- edge guided method for entity-aware video captioning on a basketball benchmark.Neurocomputing, 2025

Zeyu Xi, Ge Shi, Xuefen Li, Junchi Yan, Zun Li, Lifang Wu, Zilin Liu, and Liang Wang. A simple yet effective knowl- edge guided method for entity-aware video captioning on a basketball benchmark.Neurocomputing, 2025

work page 2025

[72] [72]

Eika: Explicit & implicit knowledge-augmented network for entity-aware sports video captioning.Expert Systems with Applications, 2025

Zeyu Xi, Ge Shi, Haoying Sun, Bowen Zhang, Shuyi Li, and Lifang Wu. Eika: Explicit & implicit knowledge-augmented network for entity-aware sports video captioning.Expert Systems with Applications, 2025

work page 2025

[73] [73]

Player-centric multimodal prompt generation for large lan- guage model based identity-aware basketball video caption- ing

Zeyu Xi, Haoying Sun, Yaofei Wu, Junchi Yan, Haoran Zhang, Lifang Wu, Liang Wang, and Changwen Chen. Player-centric multimodal prompt generation for large lan- guage model based identity-aware basketball video caption- ing. InProceedings of the International Conference on Com- puter Vision, 2025. 2

work page 2025

[74] [74]

Sportqa: A benchmark for sports understanding in large language models

Haotian Xia, Zhengbang Yang, Yuqing Wang, Rhys Tracy, Yun Zhao, Dongdong Huang, Zezhi Chen, Yan Zhu, Yuan- fang Wang, and Weining Shen. Sportqa: A benchmark for sports understanding in large language models. InProceed- 12 ings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2024. 2

work page 2024

[75] [75]

Sportu: A comprehensive sports understanding benchmark for multimodal large language models

Haotian Xia, Zhengbang Yang, Junbo Zou, Rhys Tracy, Yuqing Wang, Chi Lu, Christopher Lai, Yanjun He, Xun Shao, Zhuoqing Xie, et al. Sportu: A comprehensive sports understanding benchmark for multimodal large language models. InProceedings of the International Conference on Learning Representations, 2025. 2

work page 2025

[76] [76]

Language-guided audio-visual learn- ing for long-term sports assessment

Huangbiao Xu, Xiao Ke, Huanqi Wu, Rui Xu, Yuezhou Li, and Wenzhong Guo. Language-guided audio-visual learn- ing for long-term sports assessment. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2025. 2

work page 2025

[77] [77]

Finediving: A fine-grained dataset for procedure-aware action quality assessment

Jinglin Xu, Yongming Rao, Xumin Yu, Guangyi Chen, Jie Zhou, and Jiwen Lu. Finediving: A fine-grained dataset for procedure-aware action quality assessment. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. 2

work page 2022

[78] [78]

Depth any- thing v2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2. InConference on Neural Information Processing Systems, 2024. 2

work page 2024

[79] [79]

Timesoccer: An end-to-end multimodal large language model for soccer commentary generation

Ling You, Wenxuan Huang, Xinni Xie, Xiangyi Wei, Bangyan Li, Shaohui Lin, Yang Li, and Changbo Wang. Timesoccer: An end-to-end multimodal large language model for soccer commentary generation. InACM Multi- media, 2025. 2

work page 2025

[80] [80]

Scaling vision transformers

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2022. 2, 5

work page 2022