AVTrack: Audio-Visual Tracking in Human-centric Complex Scenes

Henghui Ding; Yaoting Wang; Yun Zhou; Zipei Zhang

arxiv: 2606.02724 · v1 · pith:IHE23QM6new · submitted 2026-06-01 · 💻 cs.CV · cs.AI

AVTrack: Audio-Visual Tracking in Human-centric Complex Scenes

Yaoting Wang , Yun Zhou , Zipei Zhang , Henghui Ding This is my paper

Pith reviewed 2026-06-28 14:58 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords audio-visual trackingspeaker trackingaudio-visual instance segmentationdatasetbenchmarkcomplex scenescamera motionvisual occlusions

0 comments

The pith

A new dataset for audio-visual instance segmentation shows existing methods degrade sharply in scenes with camera motion and occlusions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates AVTrack to move beyond existing datasets that use only simple, static audio-visual scenes with coarse labels. It supplies annotations for tracking active speakers across frames in human-centric videos that include moving cameras, occlusions, and changing positions. When representative audio-visual instance segmentation methods are tested on AVTrack, their accuracy falls substantially compared with prior benchmarks. The authors also supply a baseline model. The central purpose is to force future work to develop trackers that perform genuine spatiotemporal and cross-modal reasoning rather than exploit easy co-occurrence cues.

Core claim

AVTrack is a human-centric audio-visual instance segmentation dataset built for dynamic real-world conditions. Evaluations of current AVIS methods on this dataset produce large performance drops, demonstrating that prior benchmarks have overstated robustness by testing only on homogeneous scenes.

What carries the argument

The AVTrack dataset, which supplies fine-grained audio-visual instance segmentation annotations under camera motion, visual occlusions, and speaker position changes.

If this is right

Methods must incorporate explicit handling of camera motion and occlusions to maintain tracking accuracy.
Future benchmarks should prioritize dynamic, multi-speaker scenes over static co-occurrence settings.
The provided baseline offers a starting point for developing models that reason jointly across audio and visual streams over time.
Applications such as surveillance and video editing will require algorithms validated on conditions like those in AVTrack.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the performance gap persists across multiple independent implementations, research attention will shift from single-frame detection to long-term cross-modal association.
The dataset construction choices could be reused to create similar test sets for other multi-modal tasks such as audio-visual action recognition.
Improved methods developed on AVTrack may transfer to related problems like multi-speaker diarization in video.

Load-bearing premise

The chosen conditions of camera motion, occlusions, and position changes are representative of real-world difficulty and that observed performance drops arise from these factors rather than from how the dataset or metrics were constructed.

What would settle it

A re-run of the same methods on AVTrack after correcting any annotation or evaluation artifacts that produces no significant accuracy drop relative to prior datasets.

Figures

Figures reproduced from arXiv: 2606.02724 by Henghui Ding, Yaoting Wang, Yun Zhou, Zipei Zhang.

**Figure 1.** Figure 1: Illustrative samples from our proposed AVTrack benchmark. Audio signals are omitted for visual clarity. AVTrack features challenging human-centric audio-visual scenarios, such as instance scale dynamics, visual occlusion, camera motion change, and relative position change. In contrast, previous datasets are typically dominated by simple settings, such as static cameras and single-instance scenes. A more de… view at source ↗

**Figure 2.** Figure 2: Comparison of AVTrack and AVISeg data distributions across different challenging conditions. Percentages indicate the proportion of each category relative to the total samples. The masked frames shown are sourced from AVTrack. Zoom-in for better visibility of details. 2022) are explicitly designed for human-centric understanding, while differing substantially in task formulation and supervision granularit… view at source ↗

**Figure 3.** Figure 3: Video source distribution in AVTrack. ments and local tracklets within temporally local windows, and then progressively associates tracklets of the same instance across windows to recover global speaker trajectories. AVTracker is implemented as a modular three-stage framework explicitly designed for extensibility, allowing new tools and functional modules to be seamlessly integrated with minimal architec… view at source ↗

**Figure 4.** Figure 4: Overview of AVTracker, a three-stage framework for human-centric AVIS. Stage 1 (Speaker Chunks Aggregation): Speech clips are transcribed with Whisper, and timestamp-aligned transcripts together with speaker embeddings are used to group clips into speaker chunks, reducing redundancy and cost. Speech separation is ignored for clearer visualization. Stage 2 (Local Window Process): For each chunk, SAM3 segmen… view at source ↗

**Figure 5.** Figure 5: Comparison of windowing paradigms. Unlike fixedsize windows, dynamic windows preserve complete semantic units and contextual temporal continuity, enabling more effective and semantically coherent audio-visual correlation. 4.4. Global Window Process While local windows resolve short-term audio-visual correspondence, the same speaker may appear in multiple disjoint chunks. The global window process aggreg… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison between AVTracker predictions and ground truth labels. with M3, with HOTA decreasing from 24.01 to 16.88, showing the importance of compact and informative local tracklets for scalable global association. In addition, disabling dynamic windowing (C2) also results in a performance decline relative to the Base setting (HOTA: 28.85 → 27.45). 5.3. Qualitative Comparison [PITH_FULL_IM… view at source ↗

**Figure 7.** Figure 7: AVTrack dataset construction pipeline. yet comprehensive spatial coverage and significantly reduces the manual annotation burden, while preserving the flexibility required for subsequent human refinement. Human-in-the-Loop Annotation and Verification. To ensure annotation consistency and correctness, all annotators undergo structured training on a held-out subset of the data. Each annotator is required t… view at source ↗

**Figure 8.** Figure 8: An overview of AVTrackFormer: similar to AVISM, it first performs pixel-level cross-modal fusion through AV-PFM. However, AVTrackFormer enables bidirectional interaction between object tokens and audio features in AV-OFM, rather than a unidirectional one. the n-th AV-OFM ×n blocks feed in ×(n-1) feed in ×(n-1) 0 CA 𝑓𝑖 𝐴 𝑄𝑜 𝐴𝑉 𝑄𝑜' 𝑄𝑜 𝑗 𝑄𝐴 𝑗 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Audio-Visual Object-level Fusion Module (AV-OFM). AV-OFM is designed to model audio–visual correlations in an interleaved and bidirectional manner. by an object encoder using windowed self-attention (Liu et al., 2021b), producing encoded tokens Qo ′ . Audio cues are incorporated into the temporal modeling via the AudioVisual Object-level Fusion Module (AV-OFM) applied to Qo ′ , resulting in audio-conditio… view at source ↗

**Figure 10.** Figure 10: Qualitative comparison between AVTracker and AVISM. As shown in [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

read the original abstract

Audio-visual speaker tracking aims to localize and track active speakers by leveraging auditory and visual cues, enabling fine-grained, human-centric scene understanding. This capability is essential for real-world applications such as intelligent video editing, surveillance, and human-computer interaction. However, existing datasets are largely limited to simple or homogeneous audio-visual scenes with coarse annotations. Such oversimplified settings bias evaluation toward static audio-visual co-occurrence, rather than rigorously assessing robust spatiotemporal modeling and cross-modal reasoning in complex, dynamic scenes. To address these limitations, we introduce AVTrack, a human-centric audio-visual instance segmentation (AVIS) dataset designed for dynamic real-world scenarios. AVTrack features diverse and challenging conditions, including camera motion, visual occlusions, and position changes. Evaluations of representative AVIS methods on AVTrack reveal substantial performance degradation, establishing AVTrack as a challenging benchmark for robust human-centric audio-visual scene understanding in complex environments. We further provide a simple yet effective baseline to facilitate future research. Project website: https://FudanCVL.github.io/AVTrack/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AVTrack is a dataset paper that fills a gap with complex-scene audio-visual tracking examples and shows method degradation, but it needs stronger evidence on construction and evaluation to land the central claim.

read the letter

The paper's main contribution is AVTrack, a new dataset for audio-visual instance segmentation and tracking aimed at human-centric scenes with camera motion, occlusions, and position changes. It reports that representative methods degrade on this data compared to simpler prior sets and includes a basic baseline to get started.

This addresses a real limitation: most existing audio-visual datasets stay too static and homogeneous, which lets models rely on easy co-occurrence instead of handling dynamics or cross-modal reasoning. Releasing data that forces harder conditions is a step forward, and the degradation result, if it holds, gives a clearer picture of where current approaches fall short.

The soft spot is the lack of detail on how the dataset was built. Dataset papers stand or fall on collection methods, annotation process, split statistics, and checks that the performance drop comes from the intended factors rather than labeling artifacts or evaluation choices. The abstract states the intent and the outcome but does not show tables, ablations on controlled subsets, or inter-annotator agreement. The stress-test note flags the same gap. If the full manuscript supplies those elements with reproducible protocols, the claim strengthens; otherwise the benchmark value stays provisional.

This is for groups already working on audio-visual models who need tougher testbeds. A reader focused on surveillance or HCI applications might get practical value from the examples. It deserves peer review because the motivation is sound and the direction matters, even if the current write-up requires more grounding on the data side before it can be used confidently.

Referee Report

2 major / 1 minor

Summary. The paper introduces AVTrack, a new dataset for human-centric audio-visual instance segmentation (AVIS) in complex dynamic scenes characterized by camera motion, visual occlusions, and position changes. It argues that existing datasets are too simple and bias evaluations toward static co-occurrence, evaluates representative AVIS methods on AVTrack to demonstrate substantial performance degradation, and supplies a simple baseline to support future work.

Significance. If the reported degradation holds under rigorous, artifact-free evaluation and the dataset conditions genuinely isolate real-world complexities, AVTrack would provide a valuable benchmark for advancing robust cross-modal spatiotemporal reasoning in audio-visual tracking. The inclusion of a baseline method is a constructive step for reproducibility and follow-on research.

major comments (2)

[Abstract] Abstract: The central claim that 'evaluations of representative AVIS methods on AVTrack reveal substantial performance degradation' is load-bearing for positioning AVTrack as a challenging benchmark, yet the abstract supplies no quantitative results, specific methods or metrics, dataset statistics, or evaluation protocol details. This prevents verification that observed drops arise from the stated factors (camera motion, occlusions, position changes) rather than construction artifacts.
[Dataset and Evaluation sections] Dataset and Evaluation sections: No information is given on the annotation process, train/test splits, controlled ablations isolating individual complexity factors, or comparisons against simpler subsets. Without these, it is impossible to confirm that the dataset faithfully represents real-world conditions or that performance drops are attributable to the intended challenges.

minor comments (1)

The project website link is provided, but the manuscript should explicitly state data availability, licensing, and whether annotations follow standard formats for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that additional details are needed to strengthen the manuscript and will revise accordingly to address the concerns about the abstract and dataset/evaluation sections.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'evaluations of representative AVIS methods on AVTrack reveal substantial performance degradation' is load-bearing for positioning AVTrack as a challenging benchmark, yet the abstract supplies no quantitative results, specific methods or metrics, dataset statistics, or evaluation protocol details. This prevents verification that observed drops arise from the stated factors (camera motion, occlusions, position changes) rather than construction artifacts.

Authors: We agree that including quantitative highlights in the abstract would improve verifiability. In the revision, we will update the abstract to incorporate key dataset statistics (e.g., number of sequences and instances), the specific AVIS methods evaluated, main metrics used, and representative performance degradation numbers. This will allow readers to better assess the claims while maintaining the abstract's conciseness. revision: yes
Referee: [Dataset and Evaluation sections] Dataset and Evaluation sections: No information is given on the annotation process, train/test splits, controlled ablations isolating individual complexity factors, or comparisons against simpler subsets. Without these, it is impossible to confirm that the dataset faithfully represents real-world conditions or that performance drops are attributable to the intended challenges.

Authors: We acknowledge the absence of these details in the current version. We will expand the Dataset section to describe the annotation process (including tools, annotators, and quality control) and the train/test splits. In the Evaluation section, we will add controlled analyses or ablations isolating factors such as camera motion and occlusions, along with comparisons to simpler subsets where feasible. These additions will help attribute performance drops to the intended complexities. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a dataset contribution paper whose central claims concern the construction of AVTrack and the empirical observation that existing AVIS methods degrade under its conditions (camera motion, occlusions, position changes). No derivation chain, equations, fitted parameters, or predictions appear in the provided text. No self-citation is invoked as a uniqueness theorem or load-bearing premise, and the argument does not reduce any result to its own inputs by construction. The evaluation claim is therefore independent of the dataset design itself and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified assumption that the introduced dataset accurately captures complex real-world conditions and that method degradation is attributable to those conditions.

axioms (1)

domain assumption Existing datasets are limited to simple or homogeneous scenes with coarse annotations.
Stated in abstract as motivation for new dataset.

pith-pipeline@v0.9.1-grok · 5713 in / 1027 out tokens · 20057 ms · 2026-06-28T14:58:48.852357+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 5 canonical work pages

[1]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Audio-visual instance segmentation , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[2]

Proceedings of the AAAI conference on artificial intelligence , volume=

Avsegformer: Audio-visual segmentation with transformer , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[3]

European Conference on Computer Vision , pages=

Audio--visual segmentation , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022
[4]

International Journal of Computer Vision , volume=

Audio-visual segmentation with semantics , author=. International Journal of Computer Vision , volume=. 2025 , publisher=

2025
[5]

European Conference on Computer Vision , pages=

Stepping stones: A progressive training strategy for audio-visual semantic segmentation , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[6]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Segment anything , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[7]

European Conference on Computer Vision , pages=

Ref-avs: Refer and segment objects in audio-visual scenes , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[8]

2024 , isbn =

Wang, Yaoting and Liu, Weisong and Li, Guangyao and Ding, Jian and Hu, Di and Li, Xi , title =. 2024 , isbn =. doi:10.1609/aaai.v38i6.28378 , booktitle =

work page doi:10.1609/aaai.v38i6.28378 2024
[9]

European Conference on Computer Vision , pages=

Can Textual Semantics Mitigate Sounding Object Segmentation Preference? , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[10]

2024 , isbn =

Hao, Dawei and Mao, Yuxin and He, Bowen and Han, Xiaodong and Dai, Yuchao and Zhong, Yiran , title =. 2024 , isbn =. doi:10.1609/aaai.v38i3.27978 , booktitle =

work page doi:10.1609/aaai.v38i3.27978 2024
[11]

ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Bi-directional modality fusion network for audio-visual event localization , author=. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2022 , organization=

2022
[12]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Cooperation does matter: Exploring multi-order bilateral relations for audio-visual segmentation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[13]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Video instance segmentation , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[14]

European Conference on Computer Vision , pages=

Video mask transfiner for high-quality video instance segmentation , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022
[15]

International Journal of Computer Vision , volume=

Occluded video instance segmentation: A benchmark , author=. International Journal of Computer Vision , volume=. 2022 , publisher=

2022
[16]

European conference on computer vision , pages=

Stem-seg: Spatio-temporal embeddings for instance segmentation in videos , author=. European conference on computer vision , pages=. 2020 , organization=

2020
[17]

European Conference on Computer Vision , pages=

Sipmask: Spatial information preservation for fast image and video instance segmentation , author=. European Conference on Computer Vision , pages=. 2020 , organization=

2020
[18]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Sg-net: Spatial granularity network for one-stage video instance segmentation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[19]

proceedings of the IEEE/CVF international conference on computer vision , pages=

Crossover learning for fast online video instance segmentation , author=. proceedings of the IEEE/CVF international conference on computer vision , pages=
[20]

Proceedings of the IEEE international conference on computer vision , pages=

Mask r-cnn , author=. Proceedings of the IEEE international conference on computer vision , pages=
[21]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Sotr: Segmenting objects with transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[22]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

End-to-end video instance segmentation with transformers , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[23]

European Conference on Computer Vision , pages=

Seqformer: Sequential transformer for video instance segmentation , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022
[24]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Masked-attention mask transformer for universal image segmentation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[25]

Advances in Neural Information Processing Systems , volume=

Video instance segmentation using inter-frame communication transformers , author=. Advances in Neural Information Processing Systems , volume=
[26]

Advances in neural information processing systems , volume=

Vita: Video instance segmentation via object token association , author=. Advances in neural information processing systems , volume=
[27]

IEEE Transactions on Image Processing , volume=

Toward achieving robust low-level and high-level scene parsing , author=. IEEE Transactions on Image Processing , volume=. 2018 , publisher=

2018
[28]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Learning instance occlusion for panoptic segmentation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[29]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Self-supervised scene de-occlusion , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[30]

International Journal of Computer Vision , volume=

Compositional convolutional neural networks: A robust and interpretable model for object recognition under occlusion , author=. International Journal of Computer Vision , volume=. 2021 , publisher=

2021
[31]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

MOSE: A new dataset for video object segmentation in complex scenes , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[32]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Implicit Counterfactual Learning for Audio-Visual Segmentation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[33]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Learning to localize sound source in visual scenes , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[34]

arXiv preprint arXiv:2310.14778 , year=

Audio-visual speaker tracking: Progress, challenges, and future directions , author=. arXiv preprint arXiv:2310.14778 , year=

arXiv
[35]

Proceedings of the AAAI conference on artificial intelligence , year=

Multi-modal perception attention network with self-supervised learning for audio-visual speaker tracking , author=. Proceedings of the AAAI conference on artificial intelligence , year=
[36]

ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Multi-Stage Multimodal Distillation for Audio-Visual Speaker Tracking , author=. ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2025 , organization=

2025
[37]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

CAVIS: Context-Aware Video Instance Segmentation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[38]

3: An audio-visual corpus for speaker localization and tracking , author=

AV16. 3: An audio-visual corpus for speaker localization and tracking , author=. International Workshop on Machine Learning for Multimodal Interaction , pages=. 2004 , organization=

2004
[39]

IEEE Transactions on Multimedia , volume=

Multi-speaker tracking from an audio--visual sensing device , author=. IEEE Transactions on Multimedia , volume=. 2019 , publisher=

2019
[40]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

Audio-visual cross-attention network for robotic speaker tracking , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2022 , publisher=

2022
[41]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Deep multimodal clustering for unsupervised audiovisual learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[42]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Localizing visual sounds the hard way , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[43]

arXiv preprint arXiv:2401.14159 , year=

Grounded sam: Assembling open-world models for diverse visual tasks , author=. arXiv preprint arXiv:2401.14159 , year=

Pith/arXiv arXiv
[44]

ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=

Ava active speaker: An audio-visual dataset for active speaker detection , author=. ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2020 , organization=

2020
[45]

IEEE Transactions on Circuits and Systems for Video Technology , year=

Learning better video query with sam for video instance segmentation , author=. IEEE Transactions on Circuits and Systems for Video Technology , year=
[46]

arXiv preprint arXiv:2509.22740 , year=

Learning What To Hear: Boosting Sound-Source Association For Robust Audiovisual Instance Segmentation , author=. arXiv preprint arXiv:2509.22740 , year=

arXiv
[47]

International journal of computer vision , volume=

Hota: A higher order metric for evaluating multi-object tracking , author=. International journal of computer vision , volume=. 2021 , publisher=

2021
[48]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Actor and action video segmentation from a sentence , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[49]

SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos , url=

Giancola, Silvio and Amine, Mohieddine and Dghaily, Tarek and Ghanem, Bernard , year=. SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos , url=. doi:10.1109/cvprw.2018.00223 , booktitle=

work page doi:10.1109/cvprw.2018.00223 2018
[50]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Soccernet-v2: A dataset and benchmarks for holistic understanding of broadcast soccer videos , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[51]

Roadsaw: A large-scale dataset for camera- based road surface and wetness estimation,

Cioppa, Anthony and Giancola, Silvio and Deliege, Adrien and Kang, Le and Zhou, Xin and Cheng, Zhiyu and Ghanem, Bernard and Van Droogenbroeck, Marc , year=. SoccerNet-Tracking: Multiple Object Tracking Dataset and Benchmark in Soccer Videos , url=. doi:10.1109/cvprw56347.2022.00393 , booktitle=

work page doi:10.1109/cvprw56347.2022.00393 2022
[52]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Youmvos: an actor-centric multi-shot video object segmentation dataset , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[53]

arXiv preprint arXiv:2504.10174 , year=

LLaVA-ReID: Selective Multi-image Questioner for Interactive Person Re-Identification , author=. arXiv preprint arXiv:2504.10174 , year=

arXiv
[54]

Advances in Neural Information Processing Systems , volume=

Empowering visible-infrared person re-identification with large foundation models , author=. Advances in Neural Information Processing Systems , volume=
[55]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Harnessing the power of mllms for transferable text-to-image person reid , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[56]

arXiv preprint arXiv:2401.13201 , year=

MLLMReID: multimodal large language model-based person re-identification , author=. arXiv preprint arXiv:2401.13201 , year=

arXiv
[57]

5-vl technical report , author=

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

Pith/arXiv arXiv
[58]

International conference on machine learning , pages=

Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=

2023
[59]

arXiv preprint arXiv:2005.07143 , year=

Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification , author=. arXiv preprint arXiv:2005.07143 , year=

arXiv 2005
[60]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Mossformer2: Combining transformer and rnn-free recurrent network for enhanced time-domain monaural speech separation , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

2024
[61]

arXiv preprint arXiv:2511.16719 , year=

Sam 3: Segment anything with concepts , author=. arXiv preprint arXiv:2511.16719 , year=

Pith/arXiv arXiv
[62]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Transreid: Transformer-based object re-identification , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[63]

2014 22nd international conference on pattern recognition , pages=

Deep metric learning for person re-identification , author=. 2014 22nd international conference on pattern recognition , pages=. 2014 , organization=

2014
[64]

Proceedings of the IEEE international conference on computer vision , pages=

Deeply-learned part-aligned representations for person re-identification , author=. Proceedings of the IEEE international conference on computer vision , pages=
[65]

Computer Vision and Image Understanding , volume=

Person re-identification with part prediction alignment , author=. Computer Vision and Image Understanding , volume=. 2021 , publisher=

2021
[66]

A benchmark of facial recognition pipelines and co-usability performances of modules , author=. Bili. 2024 , publisher=

2024
[67]

ACM Transactions on Information Systems , volume=

A survey on the memory mechanism of large language model-based agents , author=. ACM Transactions on Information Systems , volume=. 2025 , publisher=

2025
[68]

arXiv preprint arXiv:2502.12110 , year=

A-mem: Agentic memory for llm agents , author=. arXiv preprint arXiv:2502.12110 , year=

Pith/arXiv arXiv
[69]

Advances in Neural Information Processing Systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
[70]

arXiv preprint arXiv:2601.12538 , year=

Agentic Reasoning for Large Language Models , author=. arXiv preprint arXiv:2601.12538 , year=

Pith/arXiv arXiv
[71]

arXiv preprint arXiv:2308.03188 , year=

Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies , author=. arXiv preprint arXiv:2308.03188 , year=

arXiv
[72]

arXiv preprint arXiv:2305.11738 , year=

Critic: Large language models can self-correct with tool-interactive critiquing , author=. arXiv preprint arXiv:2305.11738 , year=

Pith/arXiv arXiv
[73]

2024 , issue_date =

Li, Hongsheng and Zhu, Guangming and Zhang, Liang and Jiang, Youliang and Dang, Yixuan and Hou, Haoran and Shen, Peiyi and Zhao, Xia and Shah, Syed Afaq Ali and Bennamoun, Mohammed , title =. 2024 , issue_date =. doi:10.1016/j.neucom.2023.127052 , journal =

work page doi:10.1016/j.neucom.2023.127052 2024
[74]

arXiv preprint arXiv:2501.03230 , year=

Video-of-thought: Step-by-step video reasoning from perception to cognition , author=. arXiv preprint arXiv:2501.03230 , year=

arXiv
[75]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Instance relation graph guided source-free domain adaptive object detection , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[76]

, booktitle=

VS, Vibashan and Oza, Poojan and Patel, Vishal M. , booktitle=. 2023 , doi=

2023
[77]

European conference on computer vision , pages=

End-to-end object detection with transformers , author=. European conference on computer vision , pages=. 2020 , organization=

2020
[78]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Swin transformer: Hierarchical vision transformer using shifted windows , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

[1] [1]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Audio-visual instance segmentation , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

[2] [2]

Proceedings of the AAAI conference on artificial intelligence , volume=

Avsegformer: Audio-visual segmentation with transformer , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

[3] [3]

European Conference on Computer Vision , pages=

Audio--visual segmentation , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022

[4] [4]

International Journal of Computer Vision , volume=

Audio-visual segmentation with semantics , author=. International Journal of Computer Vision , volume=. 2025 , publisher=

2025

[5] [5]

European Conference on Computer Vision , pages=

Stepping stones: A progressive training strategy for audio-visual semantic segmentation , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024

[6] [6]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Segment anything , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

[7] [7]

European Conference on Computer Vision , pages=

Ref-avs: Refer and segment objects in audio-visual scenes , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024

[8] [8]

2024 , isbn =

Wang, Yaoting and Liu, Weisong and Li, Guangyao and Ding, Jian and Hu, Di and Li, Xi , title =. 2024 , isbn =. doi:10.1609/aaai.v38i6.28378 , booktitle =

work page doi:10.1609/aaai.v38i6.28378 2024

[9] [9]

European Conference on Computer Vision , pages=

Can Textual Semantics Mitigate Sounding Object Segmentation Preference? , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024

[10] [10]

2024 , isbn =

Hao, Dawei and Mao, Yuxin and He, Bowen and Han, Xiaodong and Dai, Yuchao and Zhong, Yiran , title =. 2024 , isbn =. doi:10.1609/aaai.v38i3.27978 , booktitle =

work page doi:10.1609/aaai.v38i3.27978 2024

[11] [11]

ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Bi-directional modality fusion network for audio-visual event localization , author=. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2022 , organization=

2022

[12] [12]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Cooperation does matter: Exploring multi-order bilateral relations for audio-visual segmentation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[13] [13]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Video instance segmentation , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

[14] [14]

European Conference on Computer Vision , pages=

Video mask transfiner for high-quality video instance segmentation , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022

[15] [15]

International Journal of Computer Vision , volume=

Occluded video instance segmentation: A benchmark , author=. International Journal of Computer Vision , volume=. 2022 , publisher=

2022

[16] [16]

European conference on computer vision , pages=

Stem-seg: Spatio-temporal embeddings for instance segmentation in videos , author=. European conference on computer vision , pages=. 2020 , organization=

2020

[17] [17]

European Conference on Computer Vision , pages=

Sipmask: Spatial information preservation for fast image and video instance segmentation , author=. European Conference on Computer Vision , pages=. 2020 , organization=

2020

[18] [18]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Sg-net: Spatial granularity network for one-stage video instance segmentation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[19] [19]

proceedings of the IEEE/CVF international conference on computer vision , pages=

Crossover learning for fast online video instance segmentation , author=. proceedings of the IEEE/CVF international conference on computer vision , pages=

[20] [20]

Proceedings of the IEEE international conference on computer vision , pages=

Mask r-cnn , author=. Proceedings of the IEEE international conference on computer vision , pages=

[21] [21]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Sotr: Segmenting objects with transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

[22] [22]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

End-to-end video instance segmentation with transformers , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[23] [23]

European Conference on Computer Vision , pages=

Seqformer: Sequential transformer for video instance segmentation , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022

[24] [24]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Masked-attention mask transformer for universal image segmentation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[25] [25]

Advances in Neural Information Processing Systems , volume=

Video instance segmentation using inter-frame communication transformers , author=. Advances in Neural Information Processing Systems , volume=

[26] [26]

Advances in neural information processing systems , volume=

Vita: Video instance segmentation via object token association , author=. Advances in neural information processing systems , volume=

[27] [27]

IEEE Transactions on Image Processing , volume=

Toward achieving robust low-level and high-level scene parsing , author=. IEEE Transactions on Image Processing , volume=. 2018 , publisher=

2018

[28] [28]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Learning instance occlusion for panoptic segmentation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[29] [29]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Self-supervised scene de-occlusion , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[30] [30]

International Journal of Computer Vision , volume=

Compositional convolutional neural networks: A robust and interpretable model for object recognition under occlusion , author=. International Journal of Computer Vision , volume=. 2021 , publisher=

2021

[31] [31]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

MOSE: A new dataset for video object segmentation in complex scenes , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

[32] [32]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Implicit Counterfactual Learning for Audio-Visual Segmentation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[33] [33]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Learning to localize sound source in visual scenes , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

[34] [34]

arXiv preprint arXiv:2310.14778 , year=

Audio-visual speaker tracking: Progress, challenges, and future directions , author=. arXiv preprint arXiv:2310.14778 , year=

arXiv

[35] [35]

Proceedings of the AAAI conference on artificial intelligence , year=

Multi-modal perception attention network with self-supervised learning for audio-visual speaker tracking , author=. Proceedings of the AAAI conference on artificial intelligence , year=

[36] [36]

ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Multi-Stage Multimodal Distillation for Audio-Visual Speaker Tracking , author=. ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2025 , organization=

2025

[37] [37]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

CAVIS: Context-Aware Video Instance Segmentation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[38] [38]

3: An audio-visual corpus for speaker localization and tracking , author=

AV16. 3: An audio-visual corpus for speaker localization and tracking , author=. International Workshop on Machine Learning for Multimodal Interaction , pages=. 2004 , organization=

2004

[39] [39]

IEEE Transactions on Multimedia , volume=

Multi-speaker tracking from an audio--visual sensing device , author=. IEEE Transactions on Multimedia , volume=. 2019 , publisher=

2019

[40] [40]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

Audio-visual cross-attention network for robotic speaker tracking , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2022 , publisher=

2022

[41] [41]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Deep multimodal clustering for unsupervised audiovisual learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[42] [42]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Localizing visual sounds the hard way , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[43] [43]

arXiv preprint arXiv:2401.14159 , year=

Grounded sam: Assembling open-world models for diverse visual tasks , author=. arXiv preprint arXiv:2401.14159 , year=

Pith/arXiv arXiv

[44] [44]

ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=

Ava active speaker: An audio-visual dataset for active speaker detection , author=. ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2020 , organization=

2020

[45] [45]

IEEE Transactions on Circuits and Systems for Video Technology , year=

Learning better video query with sam for video instance segmentation , author=. IEEE Transactions on Circuits and Systems for Video Technology , year=

[46] [46]

arXiv preprint arXiv:2509.22740 , year=

Learning What To Hear: Boosting Sound-Source Association For Robust Audiovisual Instance Segmentation , author=. arXiv preprint arXiv:2509.22740 , year=

arXiv

[47] [47]

International journal of computer vision , volume=

Hota: A higher order metric for evaluating multi-object tracking , author=. International journal of computer vision , volume=. 2021 , publisher=

2021

[48] [48]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Actor and action video segmentation from a sentence , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

[49] [49]

SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos , url=

Giancola, Silvio and Amine, Mohieddine and Dghaily, Tarek and Ghanem, Bernard , year=. SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos , url=. doi:10.1109/cvprw.2018.00223 , booktitle=

work page doi:10.1109/cvprw.2018.00223 2018

[50] [50]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Soccernet-v2: A dataset and benchmarks for holistic understanding of broadcast soccer videos , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[51] [51]

Roadsaw: A large-scale dataset for camera- based road surface and wetness estimation,

Cioppa, Anthony and Giancola, Silvio and Deliege, Adrien and Kang, Le and Zhou, Xin and Cheng, Zhiyu and Ghanem, Bernard and Van Droogenbroeck, Marc , year=. SoccerNet-Tracking: Multiple Object Tracking Dataset and Benchmark in Soccer Videos , url=. doi:10.1109/cvprw56347.2022.00393 , booktitle=

work page doi:10.1109/cvprw56347.2022.00393 2022

[52] [52]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Youmvos: an actor-centric multi-shot video object segmentation dataset , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[53] [53]

arXiv preprint arXiv:2504.10174 , year=

LLaVA-ReID: Selective Multi-image Questioner for Interactive Person Re-Identification , author=. arXiv preprint arXiv:2504.10174 , year=

arXiv

[54] [54]

Advances in Neural Information Processing Systems , volume=

Empowering visible-infrared person re-identification with large foundation models , author=. Advances in Neural Information Processing Systems , volume=

[55] [55]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Harnessing the power of mllms for transferable text-to-image person reid , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[56] [56]

arXiv preprint arXiv:2401.13201 , year=

MLLMReID: multimodal large language model-based person re-identification , author=. arXiv preprint arXiv:2401.13201 , year=

arXiv

[57] [57]

5-vl technical report , author=

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

Pith/arXiv arXiv

[58] [58]

International conference on machine learning , pages=

Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=

2023

[59] [59]

arXiv preprint arXiv:2005.07143 , year=

Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification , author=. arXiv preprint arXiv:2005.07143 , year=

arXiv 2005

[60] [60]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Mossformer2: Combining transformer and rnn-free recurrent network for enhanced time-domain monaural speech separation , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

2024

[61] [61]

arXiv preprint arXiv:2511.16719 , year=

Sam 3: Segment anything with concepts , author=. arXiv preprint arXiv:2511.16719 , year=

Pith/arXiv arXiv

[62] [62]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Transreid: Transformer-based object re-identification , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

[63] [63]

2014 22nd international conference on pattern recognition , pages=

Deep metric learning for person re-identification , author=. 2014 22nd international conference on pattern recognition , pages=. 2014 , organization=

2014

[64] [64]

Proceedings of the IEEE international conference on computer vision , pages=

Deeply-learned part-aligned representations for person re-identification , author=. Proceedings of the IEEE international conference on computer vision , pages=

[65] [65]

Computer Vision and Image Understanding , volume=

Person re-identification with part prediction alignment , author=. Computer Vision and Image Understanding , volume=. 2021 , publisher=

2021

[66] [66]

A benchmark of facial recognition pipelines and co-usability performances of modules , author=. Bili. 2024 , publisher=

2024

[67] [67]

ACM Transactions on Information Systems , volume=

A survey on the memory mechanism of large language model-based agents , author=. ACM Transactions on Information Systems , volume=. 2025 , publisher=

2025

[68] [68]

arXiv preprint arXiv:2502.12110 , year=

A-mem: Agentic memory for llm agents , author=. arXiv preprint arXiv:2502.12110 , year=

Pith/arXiv arXiv

[69] [69]

Advances in Neural Information Processing Systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

[70] [70]

arXiv preprint arXiv:2601.12538 , year=

Agentic Reasoning for Large Language Models , author=. arXiv preprint arXiv:2601.12538 , year=

Pith/arXiv arXiv

[71] [71]

arXiv preprint arXiv:2308.03188 , year=

Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies , author=. arXiv preprint arXiv:2308.03188 , year=

arXiv

[72] [72]

arXiv preprint arXiv:2305.11738 , year=

Critic: Large language models can self-correct with tool-interactive critiquing , author=. arXiv preprint arXiv:2305.11738 , year=

Pith/arXiv arXiv

[73] [73]

2024 , issue_date =

Li, Hongsheng and Zhu, Guangming and Zhang, Liang and Jiang, Youliang and Dang, Yixuan and Hou, Haoran and Shen, Peiyi and Zhao, Xia and Shah, Syed Afaq Ali and Bennamoun, Mohammed , title =. 2024 , issue_date =. doi:10.1016/j.neucom.2023.127052 , journal =

work page doi:10.1016/j.neucom.2023.127052 2024

[74] [74]

arXiv preprint arXiv:2501.03230 , year=

Video-of-thought: Step-by-step video reasoning from perception to cognition , author=. arXiv preprint arXiv:2501.03230 , year=

arXiv

[75] [75]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Instance relation graph guided source-free domain adaptive object detection , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[76] [76]

, booktitle=

VS, Vibashan and Oza, Poojan and Patel, Vishal M. , booktitle=. 2023 , doi=

2023

[77] [77]

European conference on computer vision , pages=

End-to-end object detection with transformers , author=. European conference on computer vision , pages=. 2020 , organization=

2020

[78] [78]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Swin transformer: Hierarchical vision transformer using shifted windows , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=