arxiv: 2605.09598 · v2 · submitted 2026-05-10 · 💻 cs.CV

Recognition: no theorem link

SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy

Ismael Elsharkawi , Ahmed Sait , Silvio Giancola , Bernard Ghanem , Hossam Sharara , Abdelrahman Eldesokey

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:29 UTC · model grok-4.3

classification 💻 cs.CV

keywords soccer video understandingvisual groundingvision-language modelsbenchmarkattention alignmentspatio-temporal analysisevent classification

0 comments

The pith

Soccer video models reach high classification accuracy yet ground predictions on relevant visual cues less than half the time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SoccerLens as a benchmark that tests whether vision-language models for soccer videos actually attend to meaningful visual evidence or rely on shortcuts. It supplies annotated segments covering 13 events together with visual cues grouped into three levels of semantic relevance. An extended attribution method tracks both spatial and temporal attention, and new metrics quantify how closely that attention matches the cues versus irrelevant areas. Evaluation of existing models shows grounding stays below 50 percent even with loose cue definitions and that temporal information is rarely used. The work therefore establishes a clear separation between predictive success and actual visual grounding in complex video settings.

Core claim

SoccerLens supplies annotated video segments for 13 soccer events equipped with structured visual cues at three levels of relevance. An extension of the Chefer attribution method jointly models spatial and temporal attention to produce metrics that compare model focus against the annotated cues and against spurious regions. When applied to current state-of-the-art soccer vision-language models, the evaluation finds grounding performance below 50 percent under the loosest cue definitions together with consistent under-use of temporal information despite strong classification accuracy.

What carries the argument

The SoccerLens benchmark, consisting of three-level structured visual cues and an extended Chefer attribution method that measures alignment of model attention with those cues across space and time.

If this is right

Evaluation protocols for video understanding must incorporate grounding checks in addition to accuracy.
Models must be trained to attend more reliably to relevant spatial and temporal evidence.
Current approaches risk exploiting shortcuts that fail when scenes become cluttered or viewpoints change.
Temporal modeling requires explicit attention in future soccer video architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Grounding benchmarks of this form could expose similar limitations in other spatio-temporal domains such as traffic or surveillance video.
Training objectives that directly supervise cue alignment might close the observed gap between accuracy and grounding.
Practical sports analytics systems would gain reliability once models demonstrate consistent visual grounding.

Load-bearing premise

The three-level structured visual cues correctly identify the meaningful evidence models should attend to and the extended attribution method accurately measures alignment with those cues.

What would settle it

A state-of-the-art model that scores above 60 percent on the strictest grounding metrics while retaining high classification accuracy on the SoccerLens benchmark would challenge the reported gap.

Figures

Figures reproduced from arXiv: 2605.09598 by Abdelrahman Eldesokey, Ahmed Sait, Bernard Ghanem, Hossam Sharara, Ismael Elsharkawi, Silvio Giancola.

**Figure 1.** Figure 1: SoccerLens introduces a benchmark suite for evaluating visual grounding in soccer VLMs, i.e., whether model predictions are based on semantically relevant regions or spurious cues. Abstract Vision-language models (VLMs) have recently shown strong potential in soccer video understanding. However, given the high complexity of soccer videos due to large viewpoint variations, rapid shot transitions, and clutte… view at source ↗

**Figure 2.** Figure 2: State-of-the-art soccer VLMs such as MatchVision [ [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: There are three types of annotations: Primary Cue, Secondary Cue and Common Cue. This [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Tpred frames are the frames with the average energy greater than 0.5 × max(¯s), where s¯ is the per-frame mean. 2. Pointing evaluates whether the most salient pixel identified by the model aligns with the annotated regions. For each frame, it is defined as a binary indicator that takes the value 1 if the pixel with maximum attribution lies within any bounding box, and 0 otherwise. 3. Spatial Intersection o… view at source ↗

**Figure 5.** Figure 5: Relevancy map overlays for different classes. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Relevancy map overlays for the 30 seconds of a Goal class using Chefer explainability on [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

read the original abstract

Vision-language models (VLMs) have recently shown strong potential in soccer video understanding. However, given the high complexity of soccer videos due to large viewpoint variations, rapid shot transitions, and cluttered scenes, it remains unclear on whether VLMs rely on meaningful visual evidence or exploit spurious correlations and shortcut learning. Existing evaluation protocols focus primarily on classification accuracy and do not assess visual grounding. To address this limitation, we introduce SoccerLens, a benchmark for grounded soccer video understanding. The benchmark contains annotated video segments spanning $13$ common soccer events, with structured visual cues organized into three levels of semantic relevance. We further extend the attribution method of Chefer [arXiv:2103.15679] to jointly model spatial and temporal attention, and introduce evaluation metrics that measure whether model attention aligns with annotated cues or drifts toward spurious regions. Our evaluation of state-of-the-art soccer VLMs shows that, despite strong classification accuracy, current models fail to exceed $50\%$ grounding performance even under the loosest cue definitions and consistently underutilize temporal information. These results reveal a substantial gap between predictive performance and true visual grounding, highlighting the need for grounded evaluation in complex spatio-temporal domains such as soccer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows VLMs classify soccer events accurately but ground on relevant cues below 50% and underuse time, via a new benchmark and Chefer extension.

read the letter

The main takeaway is that top soccer VLMs hit strong classification numbers yet still ground below 50% on the actual visual evidence even with loose cue definitions, and they barely draw on temporal signals across the video. This gap between prediction and grounding is the core result. The work introduces SoccerLens with annotated segments for 13 events and three levels of semantic cues, plus an extension of Chefer's attribution to joint spatial-temporal attention and metrics that score alignment against those cues versus background or shortcuts. That combination is new and gives a practical protocol for testing whether models attend to meaningful parts of complex, fast-changing video. It does a clean job showing why accuracy alone is not enough in domains with viewpoint shifts and shot cuts. The soft spot sits in the temporal attribution. Chefer was built for static images, and the extension via per-frame rollout plus aggregation is not shown to hold up across shot boundaries or avoid picking up frame-to-frame motion noise instead of event dynamics. If the temporal scores are noisy, the under-utilization finding becomes an artifact of the measure rather than a model property. Dataset size, annotation details, and any statistical tests on the 50% number would also strengthen the central claim. This is for people working on video VLMs or sports analytics who need evaluation beyond accuracy. A reader interested in grounded benchmarks for spatio-temporal tasks will find usable tools and a clear problem statement. It deserves a serious referee because the benchmark and the reported gap are concrete enough to check, even if the attribution needs extra validation experiments.

Referee Report

2 major / 2 minor

Summary. The paper introduces SoccerLens, a benchmark for grounded soccer video understanding consisting of annotated segments for 13 soccer events with three-level structured visual cues. It extends the Chefer et al. (2021) attribution method to jointly model spatial-temporal attention and reports that state-of-the-art VLMs achieve strong classification accuracy but fail to exceed 50% grounding performance (even under loosest cue definitions) while underutilizing temporal information.

Significance. If the grounding metrics and attribution extension prove reliable, the work demonstrates a clear gap between predictive accuracy and visual grounding in complex spatio-temporal video domains. This could shift evaluation practices away from accuracy-only protocols toward grounded assessments, with potential impact on VLM development for sports analytics and similar applications.

major comments (2)

[Methods (attribution extension)] The central claim of temporal underutilization and the <50% grounding ceiling depends on the extended Chefer attribution method producing faithful spatio-temporal heatmaps. The extension (per-frame rollout plus temporal aggregation) is not shown to preserve gradient flow across shot boundaries or avoid leakage from background motion, which directly risks making the temporal finding an artifact of the measurement rather than a model property.
[Abstract and Evaluation] The abstract states specific quantitative findings (50% grounding ceiling, temporal underutilization) but the provided text supplies no details on dataset size, annotation process for the three-level cues, inter-annotator agreement, or statistical significance testing. This makes the load-bearing quantitative claims difficult to verify or reproduce.

minor comments (2)

[References] Ensure the full bibliographic entry for Chefer et al. (arXiv:2103.15679) appears in the references section.
[Benchmark Description] Clarify the exact definitions and annotation guidelines for the three levels of semantic relevance in the benchmark to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, indicating planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Methods (attribution extension)] The central claim of temporal underutilization and the <50% grounding ceiling depends on the extended Chefer attribution method producing faithful spatio-temporal heatmaps. The extension (per-frame rollout plus temporal aggregation) is not shown to preserve gradient flow across shot boundaries or avoid leakage from background motion, which directly risks making the temporal finding an artifact of the measurement rather than a model property.

Authors: We appreciate this valid methodological concern. Our extension applies the Chefer et al. (2021) rollout independently per frame before performing a temporal aggregation of the resulting attribution maps. While the per-frame application follows the original method's gradient-flow properties, we acknowledge that explicit validation for video-specific issues such as shot boundaries and background motion was not provided. In the revised manuscript we will add: (i) the precise aggregation formula, (ii) qualitative visualizations demonstrating attention continuity across shot changes, and (iii) a controlled experiment on synthetic videos containing known motion patterns to quantify leakage. These additions will directly support that the reported temporal underutilization is a model property rather than an artifact. revision: yes
Referee: [Abstract and Evaluation] The abstract states specific quantitative findings (50% grounding ceiling, temporal underutilization) but the provided text supplies no details on dataset size, annotation process for the three-level cues, inter-annotator agreement, or statistical significance testing. This makes the load-bearing quantitative claims difficult to verify or reproduce.

Authors: We agree that the abstract would benefit from additional context to improve verifiability. The full manuscript already contains a dedicated benchmark-construction section describing the three-level cue annotation process and a results section reporting evaluation metrics. To address the referee's point we will revise the abstract to include a concise clause on dataset scale and annotation reliability, and we will explicitly add inter-annotator agreement statistics together with significance testing in the evaluation section. These changes will make the quantitative claims easier to assess without exceeding abstract length limits. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation metrics are independent measurements on new annotations

full rationale

The paper introduces a new benchmark with three-level human-annotated visual cues for soccer events and extends the external Chefer et al. (2021) attribution method to spatio-temporal heatmaps, then computes alignment metrics against those cues. No equations or definitions reduce the reported grounding scores (<50%) or temporal under-utilization findings to fitted parameters, self-referential inputs, or self-citation chains by construction. The central results are produced by applying the defined benchmark and extended method to VLMs, constituting independent evaluation rather than tautological reduction. Minor self-citation is absent and not load-bearing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that human-annotated visual cues represent ground-truth evidence and that attention attribution faithfully reflects model decision processes; no free parameters, new physical entities, or ad-hoc axioms beyond standard VLM attention mechanisms are introduced.

axioms (1)

domain assumption Vision-language models for video rely on attention mechanisms whose spatial and temporal focus can be attributed using gradient-based methods.
Invoked when extending the Chefer et al. attribution technique to joint spatio-temporal modeling.

pith-pipeline@v0.9.0 · 5526 in / 1303 out tokens · 98434 ms · 2026-05-13T06:29:58.634348+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 2 internal anchors

[1]

Abnar and W

S. Abnar and W. Zuidema. Quantifying attention flow in transformers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, 2020

work page 2020
[2]

Alayrac, J

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, J.-B. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[3]

Chefer, S

H. Chefer, S. Gur, and L. Wolf. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 397–406, October 2021

work page 2021
[4]

Cioppa, A

A. Cioppa, A. Deliège, S. Giancola, B. Ghanem, and M. Van Droogenbroeck. Scaling up soccernet with multi-view spatial localization and re-identification.Scientific Data, 9(1):355, 2022

work page 2022
[5]

Dalal, A

M. Dalal, A. Xarles, A. Cioppa, S. Giancola, M. Van Droogenbroeck, B. Ghanem, A. Clapés, S. Escalera, and T. B. Moeslund. Action anticipation from soccernet football video broadcasts. arXiv preprint arXiv:2504.12021, 2025

work page arXiv 2025
[6]

Deliège, A

A. Deliège, A. Cioppa, S. Giancola, M. J. Seikavandi, J. V . Dueholm, K. Nasrollahi, B. Ghanem, T. B. Moeslund, and M. Van Droogenbroeck. Soccernet-v2: A dataset and benchmarks for holistic understanding of broadcast soccer videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 4508–4519, 2021

work page 2021
[7]

D’Orazio, M

T. D’Orazio, M. Leo, N. Mosca, P. Spagnolo, and P. L. Mazzeo. A ball-tracking system for real-time soccer analysis.The Visual Computer, 25:1037–1048, 2009

work page 2009
[8]

Gautam, C

S. Gautam, C. Midoglu, V . Thambawita, M. A. Riegler, P. Halvorsen, and M. Shah. Soccer- chat: Integrating multimodal data for enhanced soccer game understanding.arXiv preprint arXiv:2505.16630, 2025

work page arXiv 2025
[9]

Giancola, M

S. Giancola, M. Amine, T. Dghaily, and B. Ghanem. Soccernet: A scalable dataset for action spotting in soccer videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1711–1721, 2018

work page 2018
[10]

Giancola, A

S. Giancola, A. Cioppa, A. Deliège, F. Magera, V . Somers, L. Kang, X. Zhou, O. Barnich, C. De Vleeschouwer, A. Alahi, B. Ghanem, M. Van Droogenbroeck, et al. Soccernet 2022 challenges results. InProceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports, pages 75–86. ACM, 2022

work page 2022
[11]

J. Held, A. Cioppa, S. Giancola, A. Hamdi, B. Ghanem, and M. Van Droogenbroeck. Vars: Video assistant referee system for automated soccer decision making from multiple views. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 5086–5097, 2023

work page 2023
[12]

J. Held, H. Itani, A. Cioppa, S. Giancola, B. Ghanem, and M. Van Droogenbroeck. X-vars: Introducing explainability in football refereeing with multi-modal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 3267–3279, 2024

work page 2024
[13]

Pixels or Positions? Benchmarking Modalities in Group Activity Recognition

D. Karki, M. Ramazanova, A. Cioppa, S. Giancola, and B. Ghanem. Pixels or positions? benchmarking modalities in group activity recognition.arXiv preprint arXiv:2511.12606, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Kim and J

K. Kim and J. Choi. An effective and fast soccer ball detection and tracking method. In Proceedings of the 17th International Conference on Pattern Recognition (ICPR), 2004

work page 2004
[15]

J. Li, D. Li, C. Xiong, and S. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational Conference on Machine Learning (ICML), 2022

work page 2022
[16]

Y . Li, H. Wang, Y . Duan, and X. Li. CLIP surgery for better explainability with enhancement in open-vocabulary tasks.arXiv preprint arXiv:2304.05653, 2023

work page arXiv 2023
[17]

W.-L. Lu, J. J. Little, and K. P. Murphy. Learning to track and identify players from broadcast sports videos.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(7):1704– 1716, 2013

work page 2013
[18]

Mkhallati, A

H. Mkhallati, A. Cioppa, S. Giancola, B. Ghanem, and M. Van Droogenbroeck. Soccernet- caption: Dense video captioning for soccer broadcasts commentaries. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 5073–5084, 2023

work page 2023
[19]

B. T. Naik, M. F. Hashmi, and N. D. Bokde. A comprehensive review of computer vision in sports: Open issues, future trends and research directions.Applied Sciences, 12(9):4429, 2022

work page 2022
[20]

Petsiuk, A

V . Petsiuk, A. Das, and K. Saenko. Rise: Randomized input sampling for explanation of black-box models. InProceedings of the British Machine Vision Conference, 2018

work page 2018
[21]

J. Qi, J. Yu, T. Tu, K. Gao, Y . Xu, X. Guan, X. Wang, Y . Dong, B. Xu, L. Hou, J. Li, J. Tang, W. Guo, H. Liu, and Y . Xu. GOAL: A challenging knowledge-grounded video captioning benchmark for real-time soccer commentary generation. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM), 2023

work page 2023
[22]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), 2021

work page 2021
[23]

J. Rao, H. Wu, H. Jiang, Y . Zhang, Y . Wang, and W. Xie. Towards universal soccer video understanding, 2025

work page 2025
[24]

J. Rao, H. Wu, C. Liu, Y . Wang, and W. Xie. Matchtime: Towards automatic soccer game commentary generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024
[25]

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 618–626, 2017

work page 2017
[26]

Shrestha, K

R. Shrestha, K. Kafle, and C. Kanan. A negative case analysis of visual grounding methods for VQA. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8172–8181, Online, July 2020. Association for Computational Linguistics

work page 2020
[27]

A. T. Strand, S. Gautam, C. Midoglu, and P. Halvorsen. Soccerrag: Multimodal soccer information retrieval via natural queries.arXiv preprint arXiv:2406.01273, 2024

work page arXiv 2024
[28]

Sundararajan, A

M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. InProceedings of the 34th International Conference on Machine Learning, pages 3319–3328. PMLR, 2017

work page 2017
[29]

Thomas, R

G. Thomas, R. Gade, T. B. Moeslund, P. Carr, and A. Hilton. Computer vision for sports: Current applications and research topics.Computer Vision and Image Understanding, 159:3–18, 2017

work page 2017
[30]

H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu. Score-cam: Score-weighted visual explanations for convolutional neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 24–25, 2020. 11

work page 2020
[31]

Wu and Y

F. Wu and Y . Cai. SegDebias: Test-time bias mitigation for ViT-based CLIP via segmentation. arXiv preprint arXiv:2511.00523, 2025

work page arXiv 2025
[32]

H. Yang, J. Rao, H. Wu, and W. Xie. Soccermaster: A vision foundation model for soccer understanding.arXiv preprint arXiv:2512.11016, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Yuksekgonul, F

M. Yuksekgonul, F. Bianchi, P. Kalluri, D. Jurafsky, and J. Zou. When and why vision-language models behave like bags-of-words, and what to do about it? InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[34]

X. Zhai, B. Basilico, M. Dehghani, J.-B. Alayrac, C. Cheng, N. Houlsby, and L. Beyer. Sigmoid loss for language-image pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22397–22407, 2023

work page 2023
[35]

C. Zhao, C. Wang, G. Hu, H. Chen, C. Liu, and J. Tang. Istvt: Interpretable spatial-temporal video transformer for deepfake detection.IEEE Transactions on Information Forensics and Security, 18:1335–1348, 2023. 12 A Labeling Criteria This Appendix presents the criteria used to label the three tiers of visual cues described in Section 3 in Table 3. Table 3...

work page 2023