The paper proposes that reusable agent skills should incorporate visual elements alongside text, introduces three forms of visual skills and an automatic conversion system, and reports better performance on GUI and visual-centric tasks.
In: Proceedings of the AAAI Confer- ence on Artificial Intelligence
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
LVSum is a new benchmark for timestamp-aware long video summarization that exposes systematic temporal gaps in existing multimodal large language models.
citing papers explorer
-
Agent Skills Should Go Beyond Text: The Case for Visual Skills
The paper proposes that reusable agent skills should incorporate visual elements alongside text, introduces three forms of visual skills and an automatic conversion system, and reports better performance on GUI and visual-centric tasks.
-
LVSum: A Benchmark for Timestamp-Aware Long Video Summarization
LVSum is a new benchmark for timestamp-aware long video summarization that exposes systematic temporal gaps in existing multimodal large language models.