TaskMem uses RL in two phases to learn a task-focused memorization policy for multimodal agents, yielding 5.3-7.0% VQA accuracy gains on reformulated streaming benchmarks from VideoMME, EgoLife, and EgoTempo.
Storyteller: Improving long video description through global audio-visual character identification.arXiv preprint arXiv:2411.07076, 2024
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
VISTA mines multi-level event semantics via visual prompts, knowledge-enhanced retrieval, and proposal integration to improve long-video event prediction over existing LVLMs.
citing papers explorer
-
Task-Focused Memorization for Multimodal Agents
TaskMem uses RL in two phases to learn a task-focused memorization policy for multimodal agents, yielding 5.3-7.0% VQA accuracy gains on reformulated streaming benchmarks from VideoMME, EgoLife, and EgoTempo.
-
Towards Effective Long-Video Event Prediction via Multi-Level Event Semantics Mining
VISTA mines multi-level event semantics via visual prompts, knowledge-enhanced retrieval, and proposal integration to improve long-video event prediction over existing LVLMs.