pith. machine review for the scientific record. sign in

arxiv: 2605.02912 · v1 · submitted 2026-04-07 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Reasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords video anomaly detectionvision-language modelschain-of-thought reasoningspatial groundingmultimodal learninginterpretable AIUCF-Crimezero-shot transfer
0
0 comments X

The pith

VANGUARD unifies video anomaly classification, spatial grounding, and chain-of-thought reasoning in a single vision-language model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VANGUARD, a framework that trains a multimodal large language model to classify anomalies in video while also producing spatial localizations of anomalous objects and generating chain-of-thought explanations for its decisions. Earlier video anomaly detection approaches delivered only binary labels or outlier scores without any spatial or logical justification. A three-stage curriculum first warms up classification on frozen features, then adapts for grounding via LoRA, and finally adds reasoning generation, with annotations supplied by a teacher-student pipeline that uses Qwen3-VL-4B for reasoning trajectories and GroundingDINO for boxes. On the UCF-Crime benchmark this yields 94% ROC-AUC and 84% F1 together with the new interpretability features, and the same model transfers zero-shot to XD-Violence and ShanghaiTech. The staged approach and reasoning objective are shown to act as regularizers that improve balance over classification-only training.

Core claim

VANGUARD is a single-VLM framework that unifies anomaly classification, spatial grounding, and chain-of-thought reasoning through a three-stage curriculum consisting of classifier warmup on frozen backbone features, LoRA-adapted spatial grounding, and chain-of-thought generation, trained with structured reasoning trajectories from a Qwen3-VL-4B teacher and bounding-box supervision from GroundingDINO, attaining 94% ROC-AUC and 84% F1 on UCF-Crime while also enabling zero-shot cross-domain generalization.

What carries the argument

The three-stage curriculum that progressively layers classifier warmup, LoRA-adapted spatial grounding, and chain-of-thought generation on a vision-language model, supported by teacher-student annotation for reasoning trajectories.

Load-bearing premise

The teacher-student pipeline using Qwen3-VL-4B generates sufficiently accurate and unbiased structured reasoning trajectories and bounding-box supervision to train the student model without introducing systematic errors.

What would settle it

A controlled experiment on UCF-Crime showing that VANGUARD produces geometrically invalid bounding boxes at high rates or that its ROC-AUC drops below 85% when the reasoning stage is removed would falsify the claim that the unified staged training reliably delivers accurate grounding and classification together.

Figures

Figures reproduced from arXiv: 2605.02912 by Aishik Konwer, Ankit Parag Shah, Sakshi Agarwal.

Figure 1
Figure 1. Figure 1: VANGUARD-Bench dataset creation pipeline and sample annotations. Left: Starting from UCF-Crime surveillance videos, we extract CLIP-based keyframes to segment each video into temporally distinct subclips. A vision-language model (Qwen3- VL) generates per-object annotations with event classification (Normal/Abnormal) and natural language reasoning. GroundingDINO then localizes each object with bounding boxe… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of VANGUARD’s three-stage curriculum training procedure. Stage 1 trains only the classification head on video-level labels using Lbce. Stage 2 unfreezes LoRA adapters and introduces mixed data (80% image-level detection, 20% video-level CoT) with spatial grounding losses Llm and Lgiou. Stage 3 refines reasoning on video-level CoT data only, dropping the GIoU loss. Each stage loads the checkpoint f… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative spatial grounding results on UCF-Crime test samples. Our trained model accurately localizes objects in surveil￾lance frames by predicting bounding boxes (blue) that closely align with ground-truth annotations (green). Overlapping labels indicate correctly identified object categories. The model demonstrates robust grounding across diverse scenes including residential areas, streets, fire incide… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative CoT-grounded reasoning for video anomaly detection on UCF-Crime test samples. Our model localizes objects in surveillance frames by predicting bounding boxes and provides per-object reasoning while recording observations. VANGUARD identifies anomalous regions in its analysis and predicts the final answer as ”Abnormal” or ”Normal.” The results demonstrate robust explainability and interpretabili… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of spatial grounding across methods on UCF-Crime test samples. Each row shows a different test video frame. Columns from left to right: ground-truth annotations (green boxes with object labels), SpatialVLM, Kosmos-2, and VANGUARD (red boxes). VANGUARD produces tighter, more accurate bounding boxes with correct object-level anomaly attribution— localizing anomalous entities and their … view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of CoT-grounded reasoning for video anomaly detection on UCF-Crime test samples. Each row shows sampled frames from a different test video. Columns display the generated reasoning from VANGUARD (Ours), HolmesVAD, and Vad￾R1. VANGUARD produces structured, spatially-grounded reasoning—identifying anomalous objects with bounding box coordinates and per-object explanations (e.g., intrude… view at source ↗
Figure 7
Figure 7. Figure 7: Pipeline output for a subclip from Arrest002. The VLM detects two men engaged in physical aggression (Abnormal) alongside passive scene objects—a ladder, floor, and wall (Normal). GroundingDINO localizes each object with bounding boxes. signs one unique bounding box per object: detections are ranked by confidence and assigned in order, with any can￾didate whose IoU with an already-assigned box exceeds 0.5 … view at source ↗
Figure 8
Figure 8. Figure 8: Spatial grounding output for an abnormal subclip from Arrest002. GroundingDINO localizes VLM-detected objects on the keyframe: two men involved in a physical altercation (red, Abnormal) and a stationary ladder in the background (green, Normal). Scene￾level objects such as “floor” and “wall” were excluded as their bounding boxes exceeded 50% of the frame area. A greedy deduplication step ensures each detect… view at source ↗
read the original abstract

Video Anomaly Detection (VAD) has traditionally been framed as binary classification or outlier detection, providing neither interpretable reasoning nor precise spatial localization of anomalous events. While Vision-Language Models (VLMs) offer rich scene understanding, they struggle with reliable spatial grounding - often producing hallucinated or geometrically invalid bounding boxes when asked to localize objects. We propose VANGUARD (Video Anomaly Understanding through Reasoning and Grounding), a framework that unifies anomaly classification, spatial grounding, and chain-of-thought reasoning within a single VLM. VANGUARD introduces a three-stage curriculum that progressively layers training objectives: (1) classifier warmup on frozen backbone features, (2) LoRA-adapted spatial grounding, and (3) chain-of-thought generation. To overcome the sparse annotation typical of VAD benchmarks, we employ a teacher-student annotation pipeline in which a VLM (Qwen3-VL-4B) generates structured per-subclip reasoning trajectories based on manual annotations available from the UCA Dataset. Further, GroundingDINO provides bounding box supervision. On UCF-Crime, VANGUARD achieves 94% ROC-AUC with 84% F1 while simultaneously producing interpretable chain-of-thought explanations and spatial grounding of anomalous objects - capabilities absent from prior VAD methods. Ablations confirm that staged training outperforms monolithic optimization, and that structured reasoning acts as an implicit regularizer yielding more balanced predictions than classification-only fine-tuning. Zero-shot transfer to XD-Violence and ShanghaiTech demonstrates cross-domain generalization without target-domain adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes VANGUARD, a VLM-based framework for video anomaly detection (VAD) that unifies binary classification, spatial grounding of anomalous objects, and chain-of-thought (CoT) reasoning. It introduces a three-stage curriculum (classifier warmup on frozen features, LoRA-adapted grounding, and CoT generation) and a teacher-student annotation pipeline that uses Qwen3-VL-4B to generate structured per-subclip reasoning trajectories plus GroundingDINO for bounding-box labels, derived from UCA Dataset manual annotations. On UCF-Crime the method reports 94% ROC-AUC and 84% F1 while producing interpretable explanations and localizations; ablations are said to show that staged training outperforms monolithic optimization and that reasoning acts as an implicit regularizer. Zero-shot transfer to XD-Violence and ShanghaiTech is also claimed.

Significance. If the reported metrics and the fidelity of the generated supervision hold, the work meaningfully extends VAD beyond binary classification by adding spatial grounding and interpretable reasoning—capabilities absent from prior methods. The curriculum design and the claim that structured CoT serves as regularization are potentially influential for future multimodal VAD research, and the cross-domain zero-shot results suggest practical generalization.

major comments (2)
  1. [§3 (Method), teacher-student annotation pipeline] §3 (Method), teacher-student annotation pipeline: The 94% ROC-AUC / 84% F1 claims on UCF-Crime and the assertion that reasoning acts as an implicit regularizer both depend on the quality of the Qwen3-VL-4B-generated reasoning trajectories and GroundingDINO bounding boxes used as supervision. No quantitative validation (inter-annotator agreement, error rates, hallucination analysis, or human ratings of the generated labels) is reported; systematic errors in the teacher outputs would be reproduced by the student, directly undermining both performance numbers and the regularization claim.
  2. [§4 (Experiments), ablation and baseline tables] §4 (Experiments), ablation and baseline tables: The statement that staged training outperforms monolithic optimization and yields more balanced predictions is load-bearing for the curriculum contribution, yet the ablations lack controls that isolate the effect of teacher-label noise (e.g., comparison against human-annotated subsets or noise-injection experiments). Without such controls it is impossible to attribute gains to genuine anomaly understanding versus artifacts in the generated supervision.
minor comments (3)
  1. The abstract and results section should explicitly reference the specific tables or figures that report the 94% ROC-AUC / 84% F1 numbers, the ablation comparisons, and the zero-shot transfer metrics.
  2. Include error bars, standard deviations across runs, or statistical significance tests for all reported metrics to support the performance claims.
  3. [§3] Notation for the three-stage curriculum (e.g., loss terms for each stage) should be defined consistently between the method description and the experimental implementation details.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments identify important gaps in validation and experimental controls that we agree warrant strengthening. We respond to each major comment below and commit to the corresponding revisions.

read point-by-point responses
  1. Referee: [§3 (Method), teacher-student annotation pipeline] §3 (Method), teacher-student annotation pipeline: The 94% ROC-AUC / 84% F1 claims on UCF-Crime and the assertion that reasoning acts as an implicit regularizer both depend on the quality of the Qwen3-VL-4B-generated reasoning trajectories and GroundingDINO bounding boxes used as supervision. No quantitative validation (inter-annotator agreement, error rates, hallucination analysis, or human ratings of the generated labels) is reported; systematic errors in the teacher outputs would be reproduced by the student, directly undermining both performance numbers and the regularization claim.

    Authors: We agree that quantitative validation of the teacher-generated supervision is essential to support the reported performance and the regularization claim. In the revised manuscript we will add a dedicated human evaluation subsection. We will sample 300 sub-clips from the UCF-Crime training set and obtain independent ratings from three human annotators on factual accuracy, logical coherence of the reasoning trajectories, and spatial precision of the GroundingDINO boxes relative to the original UCA manual annotations. We will report inter-annotator agreement (Cohen’s kappa), error rates, and a hallucination analysis. These results will be used to qualify the reliability of the supervision and to strengthen the interpretation of the CoT regularization effect. revision: yes

  2. Referee: [§4 (Experiments), ablation and baseline tables] §4 (Experiments), ablation and baseline tables: The statement that staged training outperforms monolithic optimization and yields more balanced predictions is load-bearing for the curriculum contribution, yet the ablations lack controls that isolate the effect of teacher-label noise (e.g., comparison against human-annotated subsets or noise-injection experiments). Without such controls it is impossible to attribute gains to genuine anomaly understanding versus artifacts in the generated supervision.

    Authors: We acknowledge that the existing ablations do not isolate the potential influence of label noise from the teacher pipeline. In the revision we will add two new controlled experiments: (1) systematic noise-injection ablations in which we deliberately corrupt a controlled fraction of the reasoning steps and bounding-box labels and measure the resulting degradation in ROC-AUC and F1; (2) a comparison, on any available human-annotated subset, of models trained with teacher-generated versus human labels. These results will be incorporated into the ablation tables and discussion to demonstrate that the benefits of staged training and CoT are not attributable solely to supervision artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical framework with no derivations

full rationale

The paper presents an empirical VLM-based framework for video anomaly detection using a three-stage curriculum and a teacher-student pipeline that generates labels via Qwen3-VL-4B and GroundingDINO from UCA Dataset annotations. No mathematical equations, derivations, predictions, or first-principles results appear in the provided text. Performance claims (e.g., 94% ROC-AUC on UCF-Crime) are experimental outcomes on standard benchmarks, not reductions of outputs to fitted inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing elements. The method is self-contained as an applied ML contribution without circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the reliability of automatically generated annotations from a larger VLM and on the assumption that staged optimization improves both accuracy and interpretability over joint training. No free parameters are explicitly named in the abstract.

axioms (1)
  • domain assumption Standard machine-learning assumptions that the training distribution is representative and that LoRA adaptation preserves useful features from the frozen backbone
    Implicit in any fine-tuning of large vision-language models on downstream tasks.
invented entities (1)
  • VANGUARD three-stage curriculum no independent evidence
    purpose: Progressively train classification, grounding, and reasoning capabilities
    New training procedure introduced by the paper

pith-pipeline@v0.9.0 · 5596 in / 1516 out tokens · 45449 ms · 2026-05-10T18:59:08.703981+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 11 canonical work pages · 5 internal anchors

  1. [1]

    2025 , eprint=

    Flashback: Memory-Driven Zero-shot, Real-time Video Anomaly Detection , author=. 2025 , eprint=

  2. [2]

    2025 , eprint=

    Qwen3-VL Technical Report , author=. 2025 , eprint=

  3. [3]

    2025 , eprint=

    Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting , author=. 2025 , eprint=

  4. [4]

    2023 , eprint=

    GPT-4 Technical Report , author=. 2023 , eprint=

  5. [5]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Real-world anomaly detection in surveillance videos , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  6. [6]

    Learning to prompt for vision-language models.Int

    Zhou, Kaiyang and Yang, Jingkang and Loy, Chen Change and Liu, Ziwei , year=. Learning to Prompt for Vision-Language Models , volume=. International Journal of Computer Vision , publisher=. doi:10.1007/s11263-022-01653-1 , number=

  7. [7]

    arXiv preprint arXiv:2412.07183 , year=

    Exploring What Why and How: A Multifaceted Benchmark for Causation Understanding of Video Anomaly , author=. arXiv preprint arXiv:2412.07183 , year=

  8. [8]

    2024 , eprint=

    Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection , author=. 2024 , eprint=

  9. [9]

    Sensors , volume=

    Unsupervised Video Anomaly Detection Based on Similarity with Predefined Text Descriptions , author=. Sensors , volume=. 2023 , publisher=. doi:10.3390/s23146256 , url=

  10. [10]

    2024 , eprint=

    Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning , author=. 2024 , eprint=

  11. [11]

    2023 , eprint=

    Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges , author=. 2023 , eprint=

  12. [12]

    Wu, Peng and Zhou, Xuerong and Pang, Guansong and Zhou, Lingru and Yan, Qingsen and Wang, Peng and Zhang, Yanning , booktitle=

  13. [13]

    International Conference on Learning Representations (ICLR) , year=

    Pix2Seq: A Language Modeling Framework for Object Detection , author=. International Conference on Learning Representations (ICLR) , year=

  14. [14]

    URLhttps://openreview.net/forum?id=Qh7or7JRFI

    Numerical Coordinate Regression with Convolutional Neural Networks , author=. arXiv preprint arXiv:1801.07372 , year=

  15. [15]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

    Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

  16. [16]

    Proceedings of the 38th International Conference on Machine Learning (ICML) , pages =

    Learning Transferable Visual Models From Natural Language Supervision , author =. Proceedings of the 38th International Conference on Machine Learning (ICML) , pages =. 2021 , publisher =

  17. [17]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    Self-Training Multi-Sequence Learning with Transformer for Weakly Supervised Video Anomaly Detection , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2022 , month=. doi:10.1609/aaai.v36i2.20028 , abstractNote=

  18. [18]

    Huang, Chao and Shi, Yushu and Wen, Jie and Wang, Wei and Xu, Yong and Cao, Xiaochun , booktitle =. Ex-. 2025 , editor =

  19. [19]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Weakly-supervised video anomaly detection with robust temporal feature magnitude learning , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  20. [20]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    MGFN: Magnitude-contrastive glance-and-focus network for weakly-supervised video anomaly detection , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  21. [21]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Dual Memory Units with Uncertainty Regulation for Weakly Supervised Video Anomaly Detection , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  22. [22]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

    Future frame prediction for anomaly detection--a new baseline , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

  23. [23]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

    Learning temporal regularity in video sequences , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

  24. [24]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Learning memory-guided normality for anomaly detection , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  25. [25]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  26. [26]

    IEEE Transactions on Multimedia , volume=

    Contrastive Attention for Video Anomaly Detection , author=. IEEE Transactions on Multimedia , volume=. 2022 , publisher=

  27. [27]

    Scientific Reports , volume=

    Multimodal and multiscale feature fusion for weakly supervised video anomaly detection , author=. Scientific Reports , volume=. 2024 , publisher=

  28. [28]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Harnessing large language models for training-free video anomaly detection , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  29. [29]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

    VERA: Explainable video anomaly detection via verbalized learning of vision-language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

  30. [30]

    Li, Changkang and Jiang, Yalong , booktitle=

  31. [31]

    Zhang, Huaxin and others , journal=

  32. [32]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond , author=. arXiv preprint arXiv:2308.12966 , year=

  33. [33]

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Sultani, Waqas and Chen, Chen and Shah, Mubarak , title =. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

  34. [34]

    Proceedings of the IEEE International Conference on Computer Vision (ICCV) , month =

    Lu, Cewu and Shi, Jianping and Jia, Jiaya , title =. Proceedings of the IEEE International Conference on Computer Vision (ICCV) , month =

  35. [35]

    Proceedings of the IEEE International Conference on Computer Vision (ICCV) , month =

    Luo, Weixin and Liu, Wen and Gao, Shenghua , title =. Proceedings of the IEEE International Conference on Computer Vision (ICCV) , month =

  36. [36]

    IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS , year=

    Anomaly detection in crowd scene , author=. IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS , year=

  37. [37]

    European Conference on Computer Vision (ECCV) , year=

    Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision , author=. European Conference on Computer Vision (ECCV) , year=

  38. [38]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Qwen2-VL: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=

  39. [39]

    Advances in Neural Information Processing Systems , volume=

    Visual instruction tuning , author=. Advances in Neural Information Processing Systems , volume=

  40. [40]

    International Conference on Machine Learning , pages=

    Learning transferable visual models from natural language supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  41. [41]

    2023 , organization=

    Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , booktitle=. 2023 , organization=

  42. [42]

    Li, Hongyu and others , booktitle=

  43. [43]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , pages=

    Video-ChatGPT: Towards detailed video understanding via large vision and language models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , pages=

  44. [44]

    Videollm: Modeling video sequence with large language models

    VideoLLM: Modeling video sequence with large language models , author=. arXiv preprint arXiv:2305.13292 , year=

  45. [45]

    Grounding

    Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and others , booktitle=. Grounding. 2024 , organization=

  46. [46]

    SAM 2: Segment Anything in Images and Videos

    Ravi, Nikhila and Gabeur, Valentin and Hu, Yuan-Ting and Hu, Ronghang and Ryali, Chaitanya and Ma, Tengyu and Khedr, Haitham and R. arXiv preprint arXiv:2408.00714 , year=

  47. [47]

    Grounded

    Ren, Tianhe and Liu, Shilong and others , journal=. Grounded

  48. [48]

    2025 , eprint=

    Qwen2.5-VL Technical Report , author=. 2025 , eprint=

  49. [49]

    Rasheed, Hanoona and Maaz, Muhammad and Shaker, Abdelrahman and Khan, Salman and Cholakal, Hisham and Anwer, Rao Muhammad and Xia, Gui-Song and Shi, Dianbing and Khan, Fahad Shahbaz , booktitle=

  50. [50]

    International Conference on Learning Representations , year=

    Kosmos-2: Grounding multimodal large language models to the world , author=. International Conference on Learning Representations , year=

  51. [51]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Chen, Keqin and Zhang, Zhao and Zeng, Weili and Zhang, Richong and Zhu, Feng and Zhao, Rui , year=. Shikra: Unleashing multimodal. 2306.15195 , archivePrefix=

  52. [52]

    Wu, Size and others , journal=

  53. [53]

    Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei , booktitle=

  54. [54]

    and Oswald, Martin R

    Bhowmik, Aritra and Derakhshani, Mohammad Mahdi and Koelma, Dennis and Asano, Yuki M. and Oswald, Martin R. and Snoek, Cees G. M. , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

  55. [55]

    2024 , organization=

    Yang, Yuchen and others , booktitle=. 2024 , organization=

  56. [56]

    Tang, Jiaqi and Lu, Hao and Wu, Ruizheng and Xu, Xiaogang and Ma, Ke and Fang, Cheng and Guo, Bin and Lu, Jiangbo and Chen, Qifeng and Chen, Ying-Cong , booktitle=

  57. [57]

    2025 , note=

    Huang, Chao and Wang, Benfeng and others , journal=. 2025 , note=

  58. [58]

    2025 , note=

    Gao, Shibo and Yang, Peipei and Liu, Yangyang and Chen, Yi and Zhu, Han and Zhang, Xuyao and Huang, Linlin , journal=. 2025 , note=

  59. [59]

    Advances in Neural Information Processing Systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=

  60. [60]

    Advances in Neural Information Processing Systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=

  61. [61]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Segment anything , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  62. [62]

    European Conference on Computer Vision , pages=

    End-to-end object detection with transformers , author=. European Conference on Computer Vision , pages=. 2020 , organization=

  63. [63]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  64. [64]

    2020 , url=

    Zhang, Tianyi and Kishore, Varsha and Wu, Felix and Weinberger, Kilian Q and Artzi, Yoav , booktitle=. 2020 , url=

  65. [65]

    2022 , url=

    Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=. 2022 , url=

  66. [66]

    International Conference on Learning Representations , year=

    Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

  67. [67]

    2002 , publisher=

    Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , booktitle=. 2002 , publisher=

  68. [68]

    2004 , publisher=

    Lin, Chin-Yew , booktitle=. 2004 , publisher=

  69. [69]

    Curriculum Learning , booktitle =

    Bengio, Yoshua and Louradour, J. Curriculum Learning , booktitle =. 2009 , publisher =

  70. [70]

    Advances in Neural Information Processing Systems , volume =

    Sener, Ozan and Koltun, Vladlen , title =. Advances in Neural Information Processing Systems , volume =

  71. [71]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

    Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

  72. [72]

    International Conference on Learning Representations , year =

    Loshchilov, Ilya and Hutter, Frank , title =. International Conference on Learning Representations , year =

  73. [73]

    Proceedings of the European Conference on Computer Vision , pages =

    Integral Human Pose Regression , author =. Proceedings of the European Conference on Computer Vision , pages =. 2018 , publisher =

  74. [74]

    and Guu, Kelvin and Yu, Adams Wei and Lester, Brian and Du, Nan and Dai, Andrew M

    Wei, Jason and Bosma, Maarten and Zhao, Vincent Y. and Guu, Kelvin and Yu, Adams Wei and Lester, Brian and Du, Nan and Dai, Andrew M. and Le, Quoc V. , title =. International Conference on Learning Representations , year =

  75. [75]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , pages =

    Self-Instruct: Aligning Language Models with Self-Generated Instructions , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , pages =

  76. [76]

    Vedantam, Ramakrishna and Lawrence Zitnick, C and Parikh, Devi , booktitle=