arxiv: 2604.11102 · v1 · submitted 2026-04-13 · 💻 cs.CV · cs.MM

Recognition: unknown

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

Junfu Pu, Teng Wang, Ying Shan, Yuxin Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:31 UTC · model grok-4.3

classification 💻 cs.CV cs.MM

keywords video-to-script generationlong-form video understandingomni-modal language modelcinematic script generationtemporally-aware evaluationchain-of-thought fine-tuningreinforcement learning

0 comments

The pith

An 8B-parameter audio-visual model generates detailed hierarchical scripts from long-form cinematic videos at levels matching proprietary systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the video-to-script task, which requires turning long cinematic videos into scene-by-scene scripts that capture character actions, dialogues, expressions, and audio cues with precise timing. To enable progress on this task, the authors release a human-annotated benchmark and a new evaluation framework that scores both temporal placement and multi-aspect semantic correctness. They then present OmniScript, an 8B omni-modal model trained in two stages: first with chain-of-thought fine-tuning to build plot and character reasoning, then with reinforcement learning that gives rewards based on specific time segments. Experiments show this smaller model exceeds other open-source systems and reaches parity with closed models such as Gemini 3-Pro on localization and accuracy metrics. The work therefore tests whether careful staged training on combined audio-visual signals can deliver strong narrative understanding without requiring massive scale.

Core claim

OmniScript is an 8B-parameter omni-modal language model tailored for the video-to-script task on long-form cinematic videos. It is trained via a progressive pipeline that first applies chain-of-thought supervised fine-tuning for plot and character reasoning and then performs reinforcement learning using temporally segmented rewards. Despite its parameter efficiency, OmniScript significantly outperforms larger open-source models and achieves performance comparable to state-of-the-art proprietary models, including Gemini 3-Pro, in both temporal localization and multi-field semantic accuracy.

What carries the argument

The 8B omni-modal model trained through chain-of-thought supervised fine-tuning followed by reinforcement learning on temporally segmented rewards.

If this is right

Smaller open-source models become viable for detailed long-video narrative tasks when audio-visual inputs and staged training are used.
Hierarchical script output supplies machine-readable breakdowns of actions, dialogue, and timing that go beyond flat captions.
Temporally segmented rewards improve event localization inside extended video sequences.
Comparable results to larger proprietary systems suggest deployment of script-generation tools is possible with modest compute.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Automated script generation could serve as a starting point for film pre-production workflows that currently rely on manual scene breakdowns.
The same training approach might extend to other long-sequence understanding problems such as multi-hour lecture or sports video analysis.
Integration with existing video editing platforms could allow AI to suggest cuts or audio adjustments based on generated scripts.

Load-bearing premise

The human-annotated benchmark and temporally-aware hierarchical evaluation framework give an unbiased and comprehensive measure of script generation quality for long-form cinematic videos.

What would settle it

Independent human evaluation or testing on a separate long-form video collection where OmniScript shows clear drops in temporal accuracy or semantic completeness compared with Gemini 3-Pro.

read the original abstract

Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded scripts remains a significant challenge. This paper introduces the novel video-to-script (V2S) task, aiming to generate hierarchical, scene-by-scene scripts encompassing character actions, dialogues, expressions, and audio cues. To facilitate this, we construct a first-of-its-kind human-annotated benchmark and propose a temporally-aware hierarchical evaluation framework. Furthermore, we present OmniScript, an 8B-parameter omni-modal (audio-visual) language model tailored for long-form narrative comprehension. OmniScript is trained via a progressive pipeline that leverages chain-of-thought supervised fine-tuning for plot and character reasoning, followed by reinforcement learning using temporally segmented rewards. Extensive experiments demonstrate that despite its parameter efficiency, OmniScript significantly outperforms larger open-source models and achieves performance comparable to state-of-the-art proprietary models, including Gemini 3-Pro, in both temporal localization and multi-field semantic accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OmniScript carves out a new video-to-script task with a custom benchmark and an 8B model that claims to match Gemini on temporal and semantic metrics, but the self-built evaluation setup is the part that needs the closest look.

read the letter

The paper introduces the video-to-script task for long-form cinematic videos, where the goal is to output hierarchical scripts that cover actions, dialogue, expressions, and audio cues with temporal grounding. They release a human-annotated benchmark for it and define a temporally-aware hierarchical evaluation that scores scene by scene. OmniScript itself is an 8B audio-visual model trained in two stages: chain-of-thought supervised fine-tuning on plot and character reasoning, followed by reinforcement learning that uses rewards segmented by time intervals. This is new relative to the short-clip focus in most existing video-language work, and the progressive training plus the benchmark construction are concrete additions that others can build on or test against.

Referee Report

2 major / 2 minor

Summary. The paper introduces the video-to-script (V2S) task for generating hierarchical, scene-by-scene scripts from long-form cinematic videos that include actions, dialogues, expressions, and audio cues. It constructs a new human-annotated benchmark, proposes a temporally-aware hierarchical evaluation framework, and presents OmniScript, an 8B-parameter omni-modal model trained via chain-of-thought supervised fine-tuning for plot/character reasoning followed by reinforcement learning with temporally segmented rewards. Experiments claim that OmniScript outperforms larger open-source models and matches proprietary SOTA models such as Gemini 3-Pro on temporal localization and multi-field semantic accuracy.

Significance. If the performance claims hold under independent scrutiny, this would be a notable contribution to long-form multimodal understanding, demonstrating that a relatively small 8B model can approach proprietary frontier performance on a complex narrative task. The introduction of the V2S benchmark and evaluation framework could help standardize assessment of script generation quality, and the progressive training pipeline (CoT SFT + RL) offers a concrete recipe that other researchers could adapt. Parameter efficiency is a practical strength worth highlighting.

major comments (2)

[§4] §4 (Benchmark and Evaluation Framework): The central performance claims rest on a newly introduced human-annotated V2S benchmark and author-defined temporally-aware hierarchical metrics (scene-by-scene decomposition into actions/dialogue/expressions/audio plus temporal segmentation). The training pipeline (CoT SFT for plot/character reasoning + RL with temporally segmented rewards) directly mirrors this structure. This alignment creates a risk that reported gains reflect optimization to the in-house annotation and scoring rules rather than broader generalization; an external benchmark or third-party re-annotation is needed to substantiate the claim of matching Gemini 3-Pro.
[§6] §6 (Experiments): The abstract and results sections assert that OmniScript 'significantly outperforms larger open-source models' and achieves 'performance comparable to Gemini 3-Pro' on both temporal localization and semantic accuracy. However, without reporting results on established external video-understanding benchmarks (e.g., ActivityNet, MovieNet, or standard VQA suites) or providing inter-annotator agreement statistics and release details for the V2S benchmark, it is difficult to rule out benchmark-specific overfitting as the source of the gains.

minor comments (2)

[§4.1] The abstract states 'extensive experiments' but the manuscript should explicitly state the number of videos, total duration, and annotation protocol (including how many annotators per video) in §4.1 to allow reproducibility assessment.
[§4.3] Notation for the hierarchical evaluation scores (e.g., how temporal localization precision is aggregated across fields) could be clarified with a small example in §4.3 or an appendix table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below with clarifications on our benchmark design, evaluation choices, and planned revisions. While we maintain that the V2S task and OmniScript represent a meaningful advance in long-form multimodal narrative understanding, we acknowledge the need for greater transparency on annotation quality and benchmark accessibility.

read point-by-point responses

Referee: [§4] §4 (Benchmark and Evaluation Framework): The central performance claims rest on a newly introduced human-annotated V2S benchmark and author-defined temporally-aware hierarchical metrics (scene-by-scene decomposition into actions/dialogue/expressions/audio plus temporal segmentation). The training pipeline (CoT SFT for plot/character reasoning + RL with temporally segmented rewards) directly mirrors this structure. This alignment creates a risk that reported gains reflect optimization to the in-house annotation and scoring rules rather than broader generalization; an external benchmark or third-party re-annotation is needed to substantiate the claim of matching Gemini 3-Pro.

Authors: We agree that the structural alignment between the new benchmark, metrics, and training pipeline warrants scrutiny for potential overfitting. The V2S task is novel—no prior benchmark existed for hierarchical, scene-by-scene script generation from long-form cinematic videos that jointly models actions, dialogues, expressions, and audio cues. Our human-annotated dataset was created specifically to enable this task. To strengthen evidence of annotation quality, we will add inter-annotator agreement statistics to the revised manuscript. We also commit to publicly releasing the full V2S benchmark (videos, annotations, and guidelines) upon acceptance, enabling independent verification and third-party re-annotation. While an external benchmark for this exact task is unavailable, the progressive training pipeline and consistent performance across video genres and lengths in our benchmark provide supporting evidence for generalization beyond in-house rules. revision: yes
Referee: [§6] §6 (Experiments): The abstract and results sections assert that OmniScript 'significantly outperforms larger open-source models' and achieves 'performance comparable to Gemini 3-Pro' on both temporal localization and semantic accuracy. However, without reporting results on established external video-understanding benchmarks (e.g., ActivityNet, MovieNet, or standard VQA suites) or providing inter-annotator agreement statistics and release details for the V2S benchmark, it is difficult to rule out benchmark-specific overfitting as the source of the gains.

Authors: We acknowledge that results on established benchmarks such as ActivityNet or MovieNet would offer useful context. However, these datasets target action recognition, temporal localization of actions, or general video QA and do not evaluate the core V2S requirements: hierarchical script generation with explicit multi-field semantics (actions/dialogues/expressions/audio) and precise temporal segmentation into scenes. Direct comparison is therefore not meaningful, as those benchmarks lack the narrative script output format. In the revision we will add a dedicated discussion clarifying this mismatch and include the requested inter-annotator agreement statistics plus explicit benchmark release commitments. These additions should help demonstrate that performance gains are tied to the new task rather than overfitting. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external model comparisons and human annotations without definitional reduction

full rationale

The paper introduces the V2S task, a new human-annotated benchmark, a temporally-aware hierarchical evaluation framework, and the OmniScript model trained via CoT SFT followed by RL with temporally segmented rewards. Performance claims compare the 8B model against larger open-source and proprietary models (e.g., Gemini 3-Pro) on temporal localization and semantic accuracy. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The training rewards align thematically with the evaluation framework, but this is standard for new-task papers and does not reduce the reported outperformance to an input by construction. The derivation chain is self-contained as an empirical contribution against external baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard techniques in multimodal learning and reinforcement learning applied to a new domain. No additional free parameters or invented entities are specified in the abstract.

axioms (1)

domain assumption Multimodal large language models can be progressively trained using chain-of-thought supervised fine-tuning for reasoning followed by reinforcement learning with temporally segmented rewards to improve long-form comprehension.
This is the core training pipeline assumed to work for the task.

pith-pipeline@v0.9.0 · 5482 in / 1399 out tokens · 86321 ms · 2026-05-10T15:31:26.062971+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search
cs.SD 2026-05 unverdicted novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.

Reference graph

Works this paper leans on

59 extracted references · 18 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Condensed movies: Story based retrieval with contextual embeddings

Max Bain, Arsha Nagrani, Andrew Brown, and Andrew Zisserman. Condensed movies: Story based retrieval with contextual embeddings. InProceedings of the Asian Conference on Computer Vision, 2020

2020
[3]

Seed1.8 Model Card: Towards Generalized Real-World Agency

ByteDance Seed Team. Seed 1.8 model card.https://arxiv.org/pdf/2603.20633, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Seed 2.0 model card

ByteDance Seed Team. Seed 2.0 model card. https://lf3-static.bytednsdoc.com/obj/eden-cn/lapzild-tss/ ljhwZthlaukjlkulzlp/seed2/0214/Seed2.0%20Model%20Card.pdf, 2026

2026
[5]

Multi-subject open-set personalization in video generation

Xinlong Chen, Yue Ding, Weihong Lin, Jingyun Hua, Linli Yao, Yang Shi, Bozhou Li, Yuanxing Zhang, Qiang Liu, Pengfei Wan, et al. Avocado: An audiovisual video captioner driven by temporal orchestration.arXiv preprint arXiv:2510.10395, 2025

work page arXiv 2025
[6]

Diadem: Advancing dialogue descriptions in audiovisual video captioning for multimodal large language models, 2026

Xinlong Chen, Weihong Lin, Jingyun Hua, Linli Yao, Yue Ding, Bozhou Li, Bohan Zeng, Yang Shi, Qiang Liu, Yuanxing Zhang, et al. Diadem: Advancing dialogue descriptions in audiovisual video captioning for multimodal large language models.arXiv preprint arXiv:2601.19267, 2026

work page arXiv 2026
[7]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review arXiv 2025
[8]

Arc-hunyuan-video-7b: Structured video comprehension of real-world shorts.arXiv preprint arXiv:2507.20939, 2025

Yuying Ge, Yixiao Ge, Chen Li, Teng Wang, Junfu Pu, Yizhuo Li, Lu Qiu, Jin Ma, Lisheng Duan, Xinyu Zuo, et al. Arc-hunyuan-video-7b: Structured video comprehension of real-world shorts.arXiv preprint arXiv:2507.20939, 2025

work page arXiv 2025
[9]

Longvale: Vision- audio-language-event benchmark towards time-aware omni-modal perception of long videos

Tiantian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan, and Feng Zheng. Longvale: Vision- audio-language-event benchmark towards time-aware omni-modal perception of long videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18959–18969, 2025

2025
[10]

Gemini 2.5 flash model card.https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-2-5-Flash-Model-Card.pdf, 2025

Google DeepMind. Gemini 2.5 flash model card.https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-2-5-Flash-Model-Card.pdf, 2025

2025
[11]

Gemini 2.5 pro model card.https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-2-5-Pro-Model-Card.pdf, 2025

Google DeepMind. Gemini 2.5 pro model card.https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-2-5-Pro-Model-Card.pdf, 2025

2025
[12]

Gemini 3 flash model card.https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Flash-Model-Card.pdf, 2025

Google DeepMind. Gemini 3 flash model card.https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Flash-Model-Card.pdf, 2025

2025
[13]

Gemini 3 pro model card.https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf, 2025

Google DeepMind. Gemini 3 pro model card.https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf, 2025

2025
[14]

Ava: A video dataset of spatio-temporally localized atomic visual actions

Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijaya- narasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6047–6056, 2018

2018
[15]

Movienet: A holistic dataset for movie understanding

Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. Movienet: A holistic dataset for movie understanding. In European conference on computer vision, pages 709–727. Springer, 2020

2020
[16]

Dense-captioning events in videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706–715, 2017

2017
[17]

Ptvd: A large-scale plot-oriented multimodal dataset based on television dramas.arXiv preprint arXiv:2306.14644, 2023

Chen Li, Xutan Peng, Teng Wang, Yixiao Ge, Mengyang Liu, Xuyuan Xu, Yexin Wang, and Ying Shan. Ptvd: A large-scale plot-oriented multimodal dataset based on television dramas.arXiv preprint arXiv:2306.14644, 2023

work page arXiv 2023
[18]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

2004
[19]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023
[20]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

2002
[21]

Arc-chapter: Structuring hour-long videos into navigable chapters and hierarchical summaries.arXiv preprint arXiv:2511.14349, 2025

Junfu Pu, Teng Wang, Yixiao Ge, Yuying Ge, Chen Li, and Ying Shan. Arc-chapter: Structuring hour-long videos into navigable chapters and hierarchical summaries.arXiv preprint arXiv:2511.14349, 2025

work page arXiv 2025
[22]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

2023
[23]

A dataset for movie description

Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. A dataset for movie description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3202–3212, 2015

2015
[24]

Movie description.International Journal of Computer Vision, 123(1):94–120, 2017

Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. Movie description.International Journal of Computer Vision, 123(1):94–120, 2017

2017
[25]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Longvu: Spatiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spatiotemporal adaptive compression for long video-language understanding. arXiv preprint arXiv:2410.17434, 2024

work page arXiv 2024
[27]

Mad: A scalable dataset for language grounding in videos from movie audio descriptions

Mattia Soldan, Alejandro Pardo, Juan León Alcázar, Fabian Caba, Chen Zhao, Silvio Giancola, and Bernard Ghanem. Mad: A scalable dataset for language grounding in videos from movie audio descriptions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5026–5035, 2022

2022
[28]

Z., and Liu, Z

Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. video-salmonn 2: Caption-enhanced audio-visual large language models.arXiv preprint arXiv:2506.15220, 2025

work page arXiv 2025
[29]

Movieqa: Understanding stories in movies through question-answering

Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Movieqa: Understanding stories in movies through question-answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4631–4640, 2016

2016
[30]

Using descriptive video services to create a large data source for video annotation research.arXiv preprint arXiv:1503.01070, 2015

Atousa Torabi, Christopher Pal, Hugo Larochelle, and Aaron Courville. Using descriptive video services to create a large data source for video annotation research.arXiv preprint arXiv:1503.01070, 2015

work page arXiv 2015
[31]

Moviegraphs: Towards understanding human- centric situations from videos

Paul Vicol, Makarand Tapaswi, Lluis Castrejon, and Sanja Fidler. Moviegraphs: Towards understanding human- centric situations from videos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8581–8590, 2018

2018
[32]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review arXiv 2025
[34]

Timechat-captioner: Scripting multi-scene videos with time-aware and structural audio-visual captions

Linli Yao, Yuancheng Wei, Yaojie Zhang, Lei Li, Xinlong Chen, Feifan Song, Ziyue Wang, Kun Ouyang, Yuanxin Liu, Lingpeng Kong, et al. Timechat-captioner: Scripting multi-scene videos with time-aware and structural audio-visual captions. arXiv preprint arXiv:2602.08711, 2026

work page arXiv 2026
[35]

arXiv preprint arXiv:2509.18154 , year=

Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe. arXiv preprint arXiv:2509.18154, 2025

work page arXiv 2025
[36]

Movie101: A new movie understanding benchmark

Zihao Yue, Qi Zhang, Anwen Hu, Liang Zhang, Ziheng Wang, and Qin Jin. Movie101: A new movie understanding benchmark. InProceedings of the 61st AnnualMeeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4669–4684, 2023

2023
[37]

Movie101v2: Improved movie narration benchmark

Zihao Yue, Yepeng Zhang, Ziheng Wang, and Qin Jin. Movie101v2: Improved movie narration benchmark. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17081–17095, 2025

2025
[38]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024

work page internal anchor Pith review arXiv 2024
[39]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. Appendix A Prompt Details You are annotating the first segment of a TV series/video. Please watch ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

**Scene Description**: Describe the main scenes in the video, including the time, location, and environmental atmosphere
[41]

**Character Identification**: Identify all characters appearing in the video and record their physical appearances and vocal characteristics
[42]

**Plot Summary**: Summarize the main plot development within this segment
[43]

‘json {

**Character Relationships**: Analyze the relationships between the characters. [Output Format] Please output in JSON format, strictly adhering to the following structure: “‘json { "segment_info": { "segment_index": 1, "duration": "Video duration (e.g., 05:30)", "main_location": "Main scene location", "time_period": "Time of day (e.g., Daytime/Night/Mornin...
[44]

- Match them with known character profiles (based on appearance, voice, name, etc.)

**Character Identification and Matching**: - Identify all characters appearing in the video. - Match them with known character profiles (based on appearance, voice, name, etc.). - If it is a known character, use their existing ID. - If it is a new character, create a new ID (Format: unknown_XXX or char_XXX). - If new information about a known character is...
[45]

**Scene Description**: Describe the main scenes in the video
[46]

**Plot Summary**: Summarize the main plot development within this segment, paying attention to continuity with previous segments
[47]

‘json {

**Character Relationship Updates**: Update or supplement character relationship information. [Output Format] Please output in JSON format, strictly adhering to the following structure: “‘json { "segment_info": { "segment_index": {segment_index}, "duration": "Video duration", "main_location": "Main scene location", "time_period": "Time period description",...
[48]

dialogue

Script Structuring Extract events from the video in chronological order. - **Field Independence**: ‘dialogue‘ and ‘action‘ must be stored separately. - **Non-empty Constraint**: For each Event object, **at least one** of the following fields must be present: "dialogue", "action", "expression", "audio_cue". If a field does not exist at a given moment, omit...
[49]

- **Scene Transitions**: A new scene should be created whenever the location, time, or **environment type (interior/exterior)** changes

Characters and Scenes - **Character Consistency**: Identify and maintain character IDs based on visual features (clothing, hairstyle) and voice characteristics. - **Scene Transitions**: A new scene should be created whenever the location, time, or **environment type (interior/exterior)** changes
[50]

‘json {

High-Point Mining Identify segments in the video that produce strong emotional stimulation for the audience (e.g., successful revenge, truth revealed, ultimate romance, thrilling action, dark humor). - Your analysis must combine three dimensions: **Visual**, **Audio**, and **Text**. [Output Format (JSON Schema)] Please strictly follow the JSON structure b...
[51]

cold start

Initialization Prompt (First Segment) As shown in Fig. 9, the initialization prompt is designed to tackle the “cold start” problem of a video. Since no prior knowledge exists, the LLM is instructed to perform a dense analysis of the scene, characters, and initial plot. A critical design choice in this prompt is the Character ID Rule. To handle situations ...
[52]

10 is applied to all subsequent segments to maintain temporal consistency

Tracking and Reasoning Prompt (Subsequent Segments) The prompt shown in Fig. 10 is applied to all subsequent segments to maintain temporal consistency. Unlike the initialization prompt, this prompt is dynamically constructed by injecting the historical context ({character_context}and{prev_segment_summary}). This prompt is specifically engineered to perfor...
[53]

{gt_text}

Video-to-Script Prompt with Synopsis Given a video and its plot synopsis, this prompt instructs the LLM to generate temporally ordered script events and identify key high points in a unified JSON output. The synopsis serves as global narrative guidance to improve long-range coherence and reduce identity or event inconsistencies. A.2 Evaluation Prompts You...
[54]

Negative emotion

Semantic and Granularity Matching (Action, Audio Cue, Expression, Scene Mood) For highly subjective and dynamic fields, human annotators often describe the same event at varying levels of detail. Our prompts (Fig. 13, 14, 15, and 20) explicitly instruct the LLM to tolerate significant paraphrasing, different perspectives, and partial matches. A prediction...
[55]

Inside the hospital

Spatial and Environmental Alignment (Scene Location, Scene Environment) For spatial and environmental descriptions, the evaluation focuses on the core setting. As shown in Fig. 16 and Fig. 18, the LLM judge allows for reasonable spatial hierarchies (e.g., predicting the broader scene “Inside the hospital” for the GT “Hospital corridor”) and forgives missi...
[56]

Ext” with “Exterior

Categorical Synonym Matching (Scene Type, Scene Time) For fields with a more constrained vocabulary, the matching criteria are stricter but remain robust to synonyms and semantic mappings. As detailed in Fig. 17 and Fig. 19, the LLM handles industry-standard abbreviations (e.g., matching “Ext” with “Exterior”) and reasonable time-period mappings, ensuring...
[57]

Gentleman→Incompatible

**Gender conflict**: Lady vs. Gentleman→Incompatible
[58]

Thief, Customer vs

**Opposing identities**: Police vs. Thief, Customer vs. Waiter→Incompatible
[59]

Dr. Wang

**Inconsistent function/domain**: Security Guard vs. Nurse, Driver vs. Chef→Incompatible Compatible situations (not considered conflicts): - Same domain, different levels: Soldier vs. Officer - Same domain, different positions: Nurse vs. Doctor - Generic vs. Specific: Lady vs. Nurse - Synonyms: Passerby vs. Pedestrian ## Task 5: Cross-type Conflict Detect...