Hawkeye: Training video-text llms for grounding text in videos.CoRR, abs/2403.10228

· 2024 · arXiv 2403.10228

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

representative citing papers

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding

cs.MM · 2026-04-28 · unverdicted · novelty 7.0

MarkIt uses a query-to-mask bridge with open-vocabulary segmentation to add visual markers and frame indices to videos, enabling Vid-LLMs to achieve state-of-the-art temporal grounding on moment retrieval and highlight detection benchmarks.

OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

cs.CV · 2026-04-28 · unverdicted · novelty 7.0

OmniVTG creates a new large-scale open-world VTG dataset using iterative concept-gap filling and timestamped captioning, paired with a three-stage self-correction CoT paradigm that yields SOTA zero-shot results on four existing benchmarks.

Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.

A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

Fully end-to-end training with a sentence-conditioned adapter outperforms frozen-backbone baselines for localizing video segments that match sentence queries.

ViLL-E: Video LLM Embeddings for Retrieval

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.

UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

UniversalVTG is a lightweight foundation model for video temporal grounding that achieves state-of-the-art results across five benchmarks while being over 100 times smaller than recent MLLM-based methods.

How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms

cs.CV · 2026-04-10 · unverdicted · novelty 5.0

A controlled study on compact video LLMs finds that continuous temporal decoding delivers the strongest accuracy-efficiency trade-off for video temporal grounding across three benchmarks.

Empowering Video Translation using Multimodal Large Language Models

cs.CV · 2026-04-13 · unverdicted · novelty 4.0

The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.

citing papers explorer

Showing 9 of 9 citing papers.

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding cs.CV · 2026-05-13 · unverdicted · none · ref 46
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding cs.MM · 2026-04-28 · unverdicted · none · ref 54
MarkIt uses a query-to-mask bridge with open-vocabulary segmentation to add visual markers and frame indices to videos, enabling Vid-LLMs to achieve state-of-the-art temporal grounding on moment retrieval and highlight detection benchmarks.
OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding cs.CV · 2026-04-28 · unverdicted · none · ref 30
OmniVTG creates a new large-scale open-world VTG dataset using iterative concept-gap filling and timestamped captioning, paired with a three-stage self-correction CoT paradigm that yields SOTA zero-shot results on four existing benchmarks.
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding cs.CV · 2026-04-09 · unverdicted · none · ref 57
Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos cs.CV · 2026-04-03 · unverdicted · none · ref 60
Fully end-to-end training with a sentence-conditioned adapter outperforms frozen-backbone baselines for localizing video segments that match sentence queries.
ViLL-E: Video LLM Embeddings for Retrieval cs.CV · 2026-04-13 · unverdicted · none · ref 53
ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding cs.CV · 2026-04-09 · unverdicted · none · ref 53
UniversalVTG is a lightweight foundation model for video temporal grounding that achieves state-of-the-art results across five benchmarks while being over 100 times smaller than recent MLLM-based methods.
How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms cs.CV · 2026-04-10 · unverdicted · none · ref 28
A controlled study on compact video LLMs finds that continuous temporal decoding delivers the strongest accuracy-efficiency trade-off for video temporal grounding across three benchmarks.
Empowering Video Translation using Multimodal Large Language Models cs.CV · 2026-04-13 · unverdicted · none · ref 43
The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.

Hawkeye: Training video-text llms for grounding text in videos.CoRR, abs/2403.10228

fields

years

verdicts

representative citing papers

citing papers explorer