Learning video representations from large language models

Zhao, Y · 2022 · arXiv 2212.04501

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

cs.CV · 2026-06-28 · unverdicted · novelty 7.0

Introduces VG-GUIBench benchmark and TASKER keyframe extraction algorithm that improves performance on VideoQA and video-guided agentic tasks.

ChipSeek: Optimizing Verilog Generation via EDA-Integrated Reinforcement Learning

cs.AI · 2025-07-07 · unverdicted · novelty 6.0

ChipSeek is a hierarchical-reward reinforcement learning framework with Curriculum-Guided Dynamic Policy Optimization that integrates EDA simulator feedback to improve LLM-generated RTL code on both functional correctness and PPA metrics.

Agent AI: Surveying the Horizons of Multimodal Interaction

cs.AI · 2024-01-07 · unverdicted · novelty 4.0

The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.

citing papers explorer

Showing 3 of 3 citing papers.

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction cs.CV · 2026-06-28 · unverdicted · none · ref 75
Introduces VG-GUIBench benchmark and TASKER keyframe extraction algorithm that improves performance on VideoQA and video-guided agentic tasks.
ChipSeek: Optimizing Verilog Generation via EDA-Integrated Reinforcement Learning cs.AI · 2025-07-07 · unverdicted · none · ref 8
ChipSeek is a hierarchical-reward reinforcement learning framework with Curriculum-Guided Dynamic Policy Optimization that integrates EDA simulator feedback to improve LLM-generated RTL code on both functional correctness and PPA metrics.
Agent AI: Surveying the Horizons of Multimodal Interaction cs.AI · 2024-01-07 · unverdicted · none · ref 213
The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.

Learning video representations from large language models

fields

years

verdicts

representative citing papers

citing papers explorer