arxiv: 2604.05076 · v1 · submitted 2026-04-06 · 💻 cs.MA · cs.MM· cs.SD

Recognition: no theorem link

GLANCE: A Global-Local Coordination Multi-Agent Framework for Music-Grounded Non-Linear Video Editing

Zihao Lin , Haibo Wang , Zhiyang Xu , Siyao Dai , Huanjie Dong , Xiaohan Wang , Yolo Y. Tang , Yixin Wang

show 2 more authors

Qifan Wang Lifu Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:08 UTC · model grok-4.3

classification 💻 cs.MA cs.MMcs.SD

keywords multi-agent systemsnonlinear video editingmusic-grounded editingglobal-local coordinationvideo mashup creationconflict resolutionediting benchmark

0 comments

The pith

GLANCE uses global-local coordination in a multi-agent system to create coherent music-grounded nonlinear video edits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that a bi-loop multi-agent architecture can overcome rigid pipelines in composing video timelines aligned to music while preserving story and structural integrity. It separates long-horizon planning from segment-level editing and adds preventive and corrective steps to fix conflicts that appear when short edited pieces are joined. If this holds, automated editing becomes more adaptable to varied user prompts and large, mixed source collections. The authors demonstrate gains through experiments on a new benchmark that varies task type, prompt detail, and music length.

Core claim

GLANCE adopts a bi-loop architecture for better editing practice: an outer loop performs long-horizon planning and task-graph construction, and an inner loop adopts the Observe-Think-Act-Verify flow for segment-wise editing tasks and their refinements. To address the cross-segment and global conflict emerging after subtimelines composition, we introduce a dedicated global-local coordination mechanism with both preventive and corrective components, which includes a novelly designed context controller, conflict region decomposition module, and a bottom-up dynamic negotiation mechanism.

What carries the argument

The global-local coordination mechanism, which includes a context controller, conflict region decomposition module, and bottom-up dynamic negotiation to resolve cross-segment and global conflicts after subtimeline composition.

If this is right

Generated videos achieve stronger alignment with music rhythm, user intent, story completeness, and long-range constraints.
The system adapts more readily to diverse prompts and heterogeneous source video collections.
MVEBench and the agent-as-a-judge framework enable scalable, multi-dimensional testing of editing methods.
Performance improves consistently over fixed-pipeline and retrieval-based baselines under identical backbone models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same preventive-corrective coordination pattern could apply to other sequential composition tasks such as audio track assembly or slide-deck creation.
Embedding the bi-loop structure into existing video tools might reduce the amount of manual timeline adjustments needed in professional workflows.
Scaling tests on longer music tracks or larger segment counts would show whether the negotiation step remains efficient.

Load-bearing premise

The global-local coordination mechanism resolves cross-segment and global conflicts after subtimeline composition without introducing new inconsistencies or degrading quality.

What would settle it

Final edited videos that still contain visible rhythm misalignments or narrative gaps, yielding lower quality scores than non-coordinated baselines in both automated metrics and human review.

Figures

Figures reproduced from arXiv: 2604.05076 by Haibo Wang, Huanjie Dong, Lifu Huang, Qifan Wang, Siyao Dai, Xiaohan Wang, Yixin Wang, Yolo Y. Tang, Zhiyang Xu, Zihao Lin.

**Figure 1.** Figure 1: The illustration of MVEBench. Abstract Music-grounded mashup video creation is a challenging form of video non-linear editing, where a system must compose a coherent timeline from large collections of source videos while aligning with music rhythm, user intent, story completeness, and long-range structural constraints. Existing approaches typically rely on fixed pipelines or simplified retrieval-and-concat… view at source ↗

**Figure 2.** Figure 2: Overview of the GLANCE framework. intervals. Based on these intervals, the outer loop constructs an adaptive editing task graph, where each node corresponds to a segment-level editing subtask and each edge captures dependency relations among segments (e.g., task 1 must be completed before task 2). For each subtask, the inner loop generates a sub-timeline T𝑖 by optimizing T ∗ 𝑖 = arg maxT𝑖 𝑠𝑖(T𝑖), where 𝑠𝑖(… view at source ↗

**Figure 3.** Figure 3: Overall quality across different task configurations. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Evaluation scores across different task configurations.. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Music-grounded mashup video creation is a challenging form of video non-linear editing, where a system must compose a coherent timeline from large collections of source videos while aligning with music rhythm, user intent, story completeness, and long-range structural constraints. Existing approaches typically rely on fixed pipelines or simplified retrieval-and-concatenation paradigms, limiting their ability to adapt to diverse prompts and heterogeneous source materials. In this paper, we present GLANCE, a global-local coordination multi-agent framework for music-grounded nonlinear video editing. GLANCE adopts a bi-loop architecture for better editing practice: an outer loop performs long-horizon planning and task-graph construction, and an inner loop adopts the "Observe-Think-Act-Verify" flow for segment-wise editing tasks and their refinements. To address the cross-segment and global conflict emerging after subtimelines composition, we introduce a dedicated global-local coordination mechanism with both preventive and corrective components, which includes a novelly designed context controller, conflict region decomposition module, and a bottom-up dynamic negotiation mechanism. To support rigorous evaluation, we construct MVEBench, a new benchmark that factorizes editing difficulty along task type, prompt specificity, and music length, and propose an agent-as-a-judge evaluation framework for scalable multi-dimensional assessment. Experimental results show that GLANCE consistently outperforms prior research baselines and open-source product baselines under the same backbone models. With GPT-4o-mini as the backbone, GLANCE improves over the strongest baseline by 33.2% and 15.6% on two task settings, respectively. Human evaluation further confirms the quality of the generated videos and validates the effectiveness of the proposed evaluation framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GLANCE adds a bi-loop multi-agent design with explicit global-local coordination for music-aligned video editing, but the reported gains rest on unablated components and a self-defined benchmark.

read the letter

The core contribution is the bi-loop setup: an outer loop for long-horizon planning and task-graph building, paired with an inner observe-think-act-verify loop for segment work. After local segments are assembled, a coordination layer steps in with a context controller, conflict-region decomposition, and bottom-up dynamic negotiation to handle cross-segment and global inconsistencies. That combination, plus the MVEBench benchmark that splits difficulty by task type, prompt specificity, and music length, and the agent-as-a-judge protocol, is what the paper actually puts forward as new relative to fixed pipelines or retrieval-concatenation baselines.

Referee Report

3 major / 2 minor

Summary. The paper presents GLANCE, a multi-agent framework for music-grounded non-linear video editing using a bi-loop architecture: an outer loop for long-horizon planning and task-graph construction, and an inner Observe-Think-Act-Verify loop for segment-wise editing and refinement. It introduces a global-local coordination mechanism (context controller, conflict region decomposition module, and bottom-up dynamic negotiation) to resolve cross-segment and global conflicts after subtimeline composition. The work also constructs MVEBench, a new benchmark factorizing difficulty by task type, prompt specificity, and music length, along with an agent-as-a-judge evaluation framework. Experimental results claim consistent outperformance over prior research and open-source baselines under identical backbones, including 33.2% and 15.6% gains with GPT-4o-mini on two task settings, corroborated by human evaluation.

Significance. If the performance claims hold after proper validation, this would be a meaningful contribution to multi-agent coordination for long-horizon creative tasks with complex constraints such as rhythm alignment and story coherence in video mashups. The bi-loop design and preventive/corrective coordination components address real challenges in scalable editing pipelines. The new benchmark and agent-judge method could aid future work, though their value depends on demonstrated robustness beyond the proposed system.

major comments (3)

[§5 (Experimental Results)] §5 (Experimental Results): The reported gains of 33.2% and 15.6% with GPT-4o-mini are presented as aggregate scores on MVEBench without ablation studies isolating the context controller, conflict region decomposition module, or bottom-up dynamic negotiation. This is load-bearing for the central claim that the global-local coordination resolves cross-segment conflicts without introducing new inconsistencies, as no conflict-rate metrics, before/after quality comparisons, or component-wise removals are provided.
[§4.3 (Global-Local Coordination Mechanism)] §4.3 (Global-Local Coordination Mechanism): The description of the bottom-up dynamic negotiation and conflict region decomposition does not include quantitative evidence (e.g., conflict resolution rates or quality degradation scores) showing these components are effective after subtimeline composition. Without such analysis, it remains unclear whether gains derive from the coordination or from the outer-loop planner and inner O-T-A-V loop alone.
[Evaluation section] Evaluation section: MVEBench and the agent-as-a-judge framework are defined within the paper; the manuscript provides only limited human validation of the judge and no external benchmarks or established metrics for comparison. This circularity weakens the reliability of the outperformance claims, as improvements may partly reflect alignment with the self-defined evaluation criteria.

minor comments (2)

[Abstract] Abstract: 'nove lly designed' appears to be a typographical error and should be corrected to 'newly designed'.
[§3 (Method)] Notation: The O-T-A-V acronym is used without an initial full expansion in the main text, which may reduce clarity for readers unfamiliar with the flow.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, providing the strongest honest defense of the manuscript while acknowledging areas where additional evidence or clarification is warranted. We will incorporate revisions as indicated.

read point-by-point responses

Referee: [§5 (Experimental Results)] The reported gains of 33.2% and 15.6% with GPT-4o-mini are presented as aggregate scores on MVEBench without ablation studies isolating the context controller, conflict region decomposition module, or bottom-up dynamic negotiation. This is load-bearing for the central claim that the global-local coordination resolves cross-segment conflicts without introducing new inconsistencies, as no conflict-rate metrics, before/after quality comparisons, or component-wise removals are provided.

Authors: We agree that component-wise ablations and conflict-specific metrics would strengthen the evidence for the global-local coordination mechanism. In the revised manuscript, we will add ablation studies that systematically remove the context controller, conflict region decomposition module, and bottom-up dynamic negotiation. We will report the resulting changes in aggregate scores, conflict resolution rates, and before/after quality comparisons on conflict regions using the agent-as-a-judge framework. This will directly address whether the coordination components contribute to resolving inconsistencies. revision: yes
Referee: [§4.3 (Global-Local Coordination Mechanism)] The description of the bottom-up dynamic negotiation and conflict region decomposition does not include quantitative evidence (e.g., conflict resolution rates or quality degradation scores) showing these components are effective after subtimeline composition. Without such analysis, it remains unclear whether gains derive from the coordination or from the outer-loop planner and inner O-T-A-V loop alone.

Authors: We acknowledge that quantitative metrics focused on these specific components after subtimeline composition would help isolate their impact. We will add new analysis in the revised version, including conflict resolution rates and quality degradation scores computed before and after the coordination steps. These will be presented alongside comparisons to the base bi-loop architecture to demonstrate that the reported gains are attributable to the preventive and corrective coordination mechanisms. revision: yes
Referee: [Evaluation section] MVEBench and the agent-as-a-judge framework are defined within the paper; the manuscript provides only limited human validation of the judge and no external benchmarks or established metrics for comparison. This circularity weakens the reliability of the outperformance claims, as improvements may partly reflect alignment with the self-defined evaluation criteria.

Authors: The manuscript already reports human evaluation results validating both the generated videos and the agent-as-a-judge framework. To mitigate concerns about circularity, we will expand the evaluation section in the revision with further details on the human study protocol, agreement statistics, and correlation analysis between agent and human judgments. We will also explicitly discuss the absence of prior established benchmarks for this task as a limitation while noting that the new benchmark and human validation provide a necessary foundation for the field. revision: partial

Circularity Check

0 steps flagged

No significant circularity in framework design or evaluation claims

full rationale

The paper presents an empirical multi-agent framework with a new benchmark (MVEBench) and agent-as-a-judge protocol, both introduced in the work. Performance gains (e.g., 33.2% and 15.6% over baselines with GPT-4o-mini) are reported via direct comparison on this benchmark under identical backbones. No mathematical derivation chain, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The evaluation setup, while internal, applies uniformly to baselines and does not reduce any core claim to a self-definition or construction. This matches the common case of a self-contained empirical contribution without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The framework depends on the assumption that current LLMs can reliably execute the Observe-Think-Act-Verify cycle and the proposed negotiation process. No numerical free parameters are stated. The three coordination modules and the benchmark are new constructs introduced by the paper.

axioms (1)

domain assumption LLM-based agents can reliably follow complex multi-step workflows including observation, planning, action, and verification for video editing tasks
The inner loop and overall performance claims rest on this capability of the backbone models such as GPT-4o-mini.

invented entities (3)

Context controller no independent evidence
purpose: Manages information flow across segments to support coordination
Newly designed component within the global-local coordination mechanism.
Conflict region decomposition module no independent evidence
purpose: Breaks down cross-segment conflicts into manageable parts
Novel module introduced to handle editing conflicts after subtimeline composition.
Bottom-up dynamic negotiation mechanism no independent evidence
purpose: Enables agents to resolve global conflicts through iterative discussion
New coordination approach for preventive and corrective conflict handling.

pith-pipeline@v0.9.0 · 5643 in / 1620 out tokens · 69609 ms · 2026-05-10T19:08:12.327835+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 23 canonical work pages · 3 internal anchors

[1]

[n. d.]. Bilibili. https://www.bilibili.com. Accessed: 2026-03-26

2026
[2]

[n. d.]. YouTube. https://www.youtube.com. Accessed: 2026-03-26

2026
[3]

Emre Can Acikgoz, Cheng Qian, Heng Ji, Dilek Hakkani-Tür, and Gokhan Tur
[4]

Self-improving llm agents at test-time.arXiv preprint arXiv:2510.07841 (2025)

work page arXiv 2025
[5]

2026.Adobe Premiere

Adobe Inc. 2026.Adobe Premiere. San Jose, CA. https://www.adobe.com/ products/premiere.html Version 26.0. Released January 2026

2026
[6]

2026.Final Cut Pro

Apple Inc. 2026.Final Cut Pro. Cupertino, CA. https://www.apple.com/final- cut-pro/ Version 12.0. Released January 2026

2026
[7]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Aadit Barua, Karim Benharrak, Meng Chen, Mina Huh, and Amy Pavel. 2025. Lotus: Creating short videos from long videos with abstractive and extractive summarization. InProceedings of the 30th International Conference on Intelligent User Interfaces. 967–981

2025
[9]

2026.DaVinci Resolve

Blackmagic Design. 2026.DaVinci Resolve. Port Melbourne, VIC, Australia. https: //www.blackmagicdesign.com/products/davinciresolve Version 20.0. Accessed March 2026

2026
[10]

2026.CapCut

Bytedance Ltd. 2026.CapCut. Los Angeles, CA. https://www.capcut.com/ Version 14.6. Accessed March 2026

2026
[11]

Kaijie Chen, Zihao Lin, Zhiyang Xu, Ying Shen, Yuguang Yao, Joy Rimchala, Jiaxin Zhang, and Lifu Huang. 2025. R2i-bench: Benchmarking reasoning-driven text-to-image generation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 12606–12641

2025
[12]

Xiaolan Chen, Ruoyu Chen, Pusheng Xu, Xiaojie Wan, Weiyi Zhang, Bingjie Yan, Xianwen Shang, Mingguang He, and Danli Shi. 2026. From visual ques- tion answering to intelligent AI agents in ophthalmology.British Journal of Ophthalmology110, 1 (2026), 1–7

2026
[13]

Meng Chu, Yicong Li, and Tat-Seng Chua. 2025. GraphVideoAgent: Enhancing Long-form Video Understanding with Entity Relation Graphs. InProceedings of the 33rd ACM International Conference on Multimedia. 4639–4648

2025
[14]

Google DeepMind. 2025. Gemini: A Family of Highly Capable Multimodal Models. https://deepmind.google/technologies/gemini/

2025
[15]

Yuichi Inoue, Kou Misaki, Yuki Imajuku, So Kuroki, Taishi Nakamura, and Takuya Akiba. 2025. Wider or deeper? scaling llm inference-time compute with adaptive branching tree search.arXiv preprint arXiv:2503.04412(2025)

work page arXiv 2025
[16]

Kaixun Jiang, Yuzheng Wang, Junjie Zhou, Pandeng Li, Zhihang Liu, Chen-Wei Xie, Zhaoyu Chen, Yun Zheng, and Wenqiang Zhang. 2026. GenAgent: Scaling Text-to-Image Generation via Agentic Multimodal Reasoning.arXiv preprint arXiv:2601.18543(2026)

work page arXiv 2026
[17]

Chris Kelly, Luhui Hu, Bang Yang, Yu Tian, Deshun Yang, Cindy Yang, Zaoshan Huang, Zihao Li, Jiayin Hu, and Yuexian Zou. 2024. Visiongpt: Vision-language understanding agent using generalized multimodal framework.arXiv preprint arXiv:2403.09027(2024)

work page arXiv 2024
[18]

Noriyuki Kugo, Xiang Li, Zixin Li, Ashish Gupta, Arpandeep Khatua, Nidhish Jain, Chaitanya Patel, Yuta Kyuragi, Yasunori Ishii, Masamoto Tanabiki, et al
[19]

Video- multiagents: A multi-agent framework for video question answering.arXiv preprint arXiv:2504.20091,

Videomultiagents: A multi-agent framework for video question answering. arXiv preprint arXiv:2504.20091(2025)

work page arXiv 2025
[20]

Mackenzie Leake, Abe Davis, Anh Truong, and Maneesh Agrawala. 2017. Com- putational video editing for dialogue-driven scenes.ACM Trans. Graph.36, 4 (2017), 130–1

2017
[21]

Binxu Li, Tiankai Yan, Yuanting Pan, Jie Luo, Ruiyang Ji, Jiayuan Ding, Zhe Xu, Shilong Liu, Haoyu Dong, Zihao Lin, et al . 2024. Mmedagent: Learning to use medical tools with multi-modal agent. InFindings of the Association for Computational Linguistics: EMNLP 2024. 8745–8760

2024
[22]

Yuzhi Li, Haojun Xu, and Feng Tian. 2025. From shots to stories: Llm- assisted video editing with unified language representations.arXiv preprint arXiv:2505.12237(2025)

work page arXiv 2025
[23]

Yuzhi Li, Haojun Xu, and Feng Tian. 2025. Shot sequence ordering for video editing: Benchmarks, metrics, and cinematology-inspired computing methods. arXiv preprint arXiv:2503.17975(2025)

work page arXiv 2025
[24]

Zihao Lin, Wanrong Zhu, Jiuxiang Gu, Jihyung Kil, Christopher Tensmeyer, Lin Zhang, Shilong Liu, Ruiyi Zhang, Lifu Huang, Vlad I Morariu, et al . 2026. MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing.arXiv preprint arXiv:2601.04589(2026)

work page arXiv 2026
[25]

2026.NarratoAI: One-stop AI video narration and automated editing tool

linyqh. 2026.NarratoAI: One-stop AI video narration and automated editing tool. https://github.com/linyqh/NarratoAI GitHub repository, accessed 2026-03-26

2026
[26]

Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, and Qifeng Chen. 2025. LongVideoAgent: Multi-Agent Reasoning with Long Videos.arXiv preprint arXiv:2512.20618(2025)

work page arXiv 2025
[27]

Ruyang Liu, Shangkun Sun, Haoran Tang, Wei Gao, and Ge Li. 2025. Flow4agent: Long-form video understanding via motion prior from optical flow. InProceedings of the IEEE/CVF International Conference on Computer Vision. 23817–23827

2025
[28]

Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. 2024. Llava-plus: Learning to use tools for creating multimodal agents. InEuropean conference on computer vision. Springer, 126–142

2024
[29]

Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. 2024. A dynamic LLM- powered agent network for task-oriented agent collaboration. InFirst Conference on Language Modeling

2024
[30]

Meng Lu, Ran Xu, Yi Fang, Wenxuan Zhang, Yue Yu, Gaurav Srivastava, Yuchen Zhuang, Mohamed Elhoseiny, Charles Fleming, Carl Yang, et al. 2025. Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs.arXiv preprint arXiv:2511.19773(2025)

work page arXiv 2025
[31]

Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, and James Zou. 2025. Octotools: An agentic framework with extensible tools for complex reasoning. arXiv preprint arXiv:2502.11271(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Shengjie Ma, Chengjin Xu, Xuhui Jiang, Muzhi Li, Huaren Qu, Cehao Yang, Jiaxin Mao, and Jian Guo. 2024. Think-on-graph 2.0: Deep and faithful large language model reasoning with knowledge-guided retrieval augmented generation.arXiv preprint arXiv:2407.10805(2024)

work page arXiv 2024
[33]

2025.FunClip: Open-source video speech recognition and LLM-based video clipping tool

ModelScope Team. 2025.FunClip: Open-source video speech recognition and LLM-based video clipping tool. https://github.com/modelscope/FunClip GitHub repository, accessed 2026-03-26

2025
[34]

OpenAI. 2025. OpenAI GPT Models. https://platform.openai.com/docs/models

2025
[35]

Marcelo Sandoval-Castaneda, Bryan Russell, Josef Sivic, Gregory Shakhnarovich, and Fabian Caba Heilbron. 2025. EditDuet: A Multi-Agent System for Video Non-Linear Editing. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. 1–11

2025
[36]

Sana Shah, Mackenzie Leake, Kun Chu, Cornelius Weber, Nico Becherer, and Stefan Wermter. 2026. RankCut: A Ranking-Based LLM Approach to Extractive Summarization for Transcript-Based Video Editing. InProceedings of the 31st International Conference on Intelligent User Interfaces. 1476–1495

2026
[37]

Xiaoqian Shen, Wenxuan Zhang, Jun Chen, and Mohamed Elhoseiny. 2025. Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Un- derstanding.arXiv preprint arXiv:2510.14032(2025)

work page arXiv 2025
[38]

2026.VEED.IO

Veed Limited. 2026.VEED.IO. London, UK. https://www.veed.io/ Online AI Video Editor. Accessed March 2026

2026
[39]

Kaishen Wang, Ruibo Chen, Tong Zheng, and Heng Huang. 2025. ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation. arXiv preprint arXiv:2511.11483(2025)

work page arXiv 2025
[40]

Xiangfeng Wang, Xiao Li, Yadong Wei, Yang Song, Fangrui Zeng, Zaiyi Chen, Gu Xu, Tong Xu, et al. 2025. From Long Videos to Engaging Clips: A Human- Inspired Video Editing Framework with Multimodal Narrative Understanding. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2764–2781

2025
[41]

Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung-Levy. 2024. Videoa- gent: Long-form video understanding with large language model as agent. In European Conference on Computer Vision. Springer, 58–76

2024
[42]

Zhenyu Wang, Aoxue Li, Zhenguo Li, and Xihui Liu. 2024. Genartist: Multimodal llm as an agent for unified image generation and editing.Advances in Neural Information Processing Systems37 (2024), 128374–128395

2024
[43]

Ziyue Wang, Junde Wu, Linghan Cai, Chang Han Low, Xihong Yang, Qiaxuan Li, and Yueming Jin. 2025. Medagent-pro: Towards evidence-based multi-modal medical diagnosis via reasoning agentic workflow.arXiv preprint arXiv:2503.18968 (2025)

work page arXiv 2025
[44]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

2022
[45]

Yongliang Wu, Wenbo Zhu, Jiawang Cao, Yi Lu, Bozheng Li, Weiheng Chi, Zihan Qiu, Lirian Su, Haolin Zheng, Jay Wu, et al. 2025. Video repurposing from user generated content: A large-scale dataset and benchmark. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 8487–8495

2025
[46]

Peng Xia, Jinglu Wang, Yibo Peng, Kaide Zeng, Zihan Dong, Xian Wu, Xiangru Tang, Hongtu Zhu, Yun Li, Linjun Zhang, et al. 2025. Mmedagent-rl: Optimizing multi-agent collaboration for multimodal medical reasoning.arXiv preprint arXiv:2506.00555(2025)

work page arXiv 2025
[47]

Weihan Xu, Paul Pu Liang, Haven Kim, Julian McAuley, Taylor Berg-Kirkpatrick, and Hao-Wen Dong. 2024. Teasergen: Generating teasers for long documentaries. arXiv preprint arXiv:2410.05586(2024)

work page arXiv 2024
[48]

Haolin Yang, Feilong Tang, Lingxiao Zhao, Xiang An, Ming Hu, Huifa Li, Xinlin Zhuang, Yifan Lu, Xiaofeng Zhang, Abdalla Swikir, et al . 2025. Streamagent: Towards anticipatory agents for streaming video understanding.arXiv preprint arXiv:2508.01875(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Zeyuan Yang, Delin Chen, Xueyang Yu, Maohao Shen, and Chuang Gan. 2025. Vca: Video curious agent for long video understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision. 20168–20179

2025
[50]

Sukwon Yun, Jie Peng, Pingzhi Li, Wendong Fan, Jie Chen, James Zou, Guohao Li, and Tianlong Chen. [n. d.]. Graph-of-Agents: A Graph-based Framework for Multi-Agent LLM Collaboration. InThe Fourteenth International Conference on Learning Representations. GLANCE: A Global–Local Coordination Multi-Agent Framework for Music-Grounded Non-Linear Video Editing M...
[51]

Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, and Yan Lu. 2025. Deep video discovery: Agentic search with tool use for long-form video understanding.arXiv preprint arXiv:2505.18079(2025)

work page arXiv 2025
[52]

Yuqi Zhang, Bin Guo, Ying Zhang, Nuo Li, Qianru Wang, Zhiwen Yu, and Qing Li. 2025. Cinematographic-Aware Coherent Shot Assembly for How-To Vlog Generation.IEEE Transactions on Human-Machine Systems(2025)

2025
[53]

Zhuo Zhi, Qiangqiang Wu, Wenbo Li, Yinchuan Li, Kun Shao, Kaiwen Zhou, et al
[54]

Videoagent2: Enhancing the llm-based agent system for long-form video understanding by uncertainty-aware cot.arXiv preprint arXiv:2504.04471(2025)

work page arXiv 2025
[55]

Hengji Zhou, Lingxuan Huang, Si Wu, Lianghao Xia, Chao Huang, et al. [n. d.]. VideoAgent: All-in-One Agentic Framework for Video Understanding and Editing. ([n. d.])
[56]

Yiyang Zhou, Yangfan He, Yaofeng Su, Siwei Han, Joel Jang, Gedas Bertasius, Mohit Bansal, and Huaxiu Yao. 2025. Reagent-v: A reward-driven multi-agent framework for video understanding.arXiv preprint arXiv:2506.01300(2025)

work page arXiv 2025
[57]

Sidan Zhu, Hongteng Xu, and Dixin Luo. 2025. Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation.arXiv preprint arXiv:2512.04426 (2025)

work page arXiv 2025
[58]

create a high-energy Harry Potter mashup that conveys a joyful magical life

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. 2024. Gptswarm: Language agents as optimizable graphs. InForty-first International Conference on Machine Learning. A Implementation Details and Qualitative Analysis We will show the detailed implementations including all prompts for each agent, hyperpar...

2024