arxiv: 2604.02409 · v1 · submitted 2026-04-02 · 💻 cs.CV · cs.AI

Recognition: no theorem link

LumiVideo: An Intelligent Agentic System for Video Color Grading

Yuchen Guo , Junli Gong , Hongmin Cai , Yiu-ming Cheung , Weifeng Su

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:33 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video color gradingagentic systemlarge language modelretrieval-augmented generationTree of ThoughtsASC-CDL3D LUTLumiGrade benchmark

0 comments

The pith

LumiVideo is an agentic AI system that autonomously color-grades raw log video to near human-expert quality while supporting natural-language refinements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LumiVideo to automate the transformation of flat raw video footage into cinematic color grades by replicating a professional colorist's workflow. It processes the input through perception of lighting and semantics, then uses LLM-driven reasoning with retrieval and search to select parameters, and finally outputs standard industry formats that guarantee frame-to-frame consistency. The system runs fully automatically for a base grade but also accepts iterative human feedback in plain language to adjust the result. Experiments position the output quality close to expert manual work on a new benchmark of log-encoded videos. This approach replaces direct pixel manipulation with interpretable parameter generation.

Core claim

LumiVideo mimics the cognitive workflow of professional colorists through four stages: Perception, Reasoning, Execution, and Reflection. Given only raw log video, it autonomously produces a cinematic base grade by analyzing physical lighting and semantic content. Its Reasoning engine combines an LLM's internalized cinematic knowledge with Retrieval-Augmented Generation via Tree of Thoughts search to navigate color parameters. The system compiles these into ASC-CDL configurations and a globally consistent 3D LUT rather than editing pixels directly, analytically ensuring temporal consistency. An optional Reflection loop permits refinement through natural language feedback.

What carries the argument

The Reasoning engine that combines LLM cinematic knowledge with RAG and Tree of Thoughts search to select color parameters from scene analysis.

If this is right

Automated grading produces temporally consistent results across an entire clip without per-frame manual corrections.
Color adjustments become interpretable parameters instead of opaque pixel changes.
Creators can direct refinements through natural language instructions rather than technical sliders.
Standard ASC-CDL and 3D LUT outputs integrate directly with existing professional software pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same agentic structure could be adapted to other video post-production steps such as exposure matching or shot balancing.
Widespread use might lower the barrier for non-experts to achieve broadcast-quality grades on independent projects.
The benchmark LumiGrade could serve as a shared testbed for comparing future automated grading methods.
Combining the system with real-time video capture tools might enable on-set preview grading during production.

Load-bearing premise

An LLM's internalized cinematic knowledge combined with RAG and Tree of Thoughts search can reliably navigate the non-linear color parameter space to produce high-quality temporally consistent grades from raw log video.

What would settle it

Professional colorists rating the system's fully automatic grades as substantially lower than human-expert grades on the LumiGrade benchmark videos, or the appearance of visible temporal inconsistencies in the output.

Figures

Figures reproduced from arXiv: 2604.02409 by Hongmin Cai, Junli Gong, Weifeng Su, Yiu-ming Cheung, Yuchen Guo.

**Figure 2.** Figure 2: Architecture of LumiVideo. As an example, we set up max iteration times to 5 here. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison with other SOTA methods. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Demonstration of reflective iteration, the automatic base grade is refined via a single natural language directive. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Video color grading is a critical post-production process that transforms flat, log-encoded raw footage into emotionally resonant cinematic visuals. Existing automated methods act as static, black-box executors that directly output edited pixels, lacking both interpretability and the iterative control required by professionals. We introduce LumiVideo, an agentic system that mimics the cognitive workflow of professional colorists through four stages: Perception, Reasoning, Execution, and Reflection. Given only raw log video, LumiVideo autonomously produces a cinematic base grade by analyzing the scene's physical lighting and semantic content. Its Reasoning engine synergizes an LLM's internalized cinematic knowledge with a Retrieval-Augmented Generation (RAG) framework via a Tree of Thoughts (ToT) search to navigate the non-linear color parameter space. Rather than generating pixels, the system compiles the deduced parameters into industry-standard ASC-CDL configurations and a globally consistent 3D LUT, analytically guaranteeing temporal consistency. An optional Reflection loop then allows creators to refine the result via natural language feedback. We further introduce LumiGrade, the first log-encoded video benchmark for evaluating automated grading. Experiments show that LumiVideo approaches human expert quality in fully automatic mode while enabling precise iterative control when directed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LumiVideo applies an LLM agent with ToT and RAG to pick ASC-CDL parameters for video grading and adds a new benchmark, but the performance claims rest on missing quantitative evidence.

read the letter

The paper's main move is to treat color grading as a four-stage agentic process: Perception to read scene lighting and semantics, Reasoning via LLM plus RAG and Tree of Thoughts to search the parameter space, Execution to output ASC-CDL and a global 3D LUT, and Reflection for natural-language tweaks. The LUT step analytically guarantees temporal consistency without per-frame variation, which is a clean engineering choice. They also introduce LumiGrade as the first log-video benchmark for this task. That combination of workflow mimicry and industry-standard output format is the actual novelty here, and it is a reasonable way to add interpretability and user control over pure pixel-to-pixel methods.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces LumiVideo, an agentic system for automated video color grading that processes raw log footage via a Perception-Reasoning-Execution-Reflection pipeline. The Reasoning stage combines an LLM with RAG and Tree of Thoughts search to derive ASC-CDL parameters and a globally consistent 3D LUT; the system claims to produce temporally consistent cinematic grades that approach human expert quality on a newly introduced LumiGrade benchmark while supporting natural-language iterative refinement.

Significance. If the performance claims hold, the work would offer a meaningful step toward interpretable, controllable AI tools in film post-production that emulate professional colorist workflows rather than acting as opaque pixel transformers, with the analytical LUT guarantee providing a clean solution to temporal consistency.

major comments (3)

[Experiments] Experiments section: the abstract and introduction assert that LumiVideo approaches human expert quality on LumiGrade, yet no quantitative results (expert preference scores, CIEDE2000, temporal flicker metrics, or statistical tests) are supplied, nor are any baselines (LUT-only, direct regression, commercial auto-graders) or ablations (removing ToT or RAG) reported. This leaves the central performance claim unsupported.
[Reasoning Engine] Reasoning engine description (Section 3.2): the claim that ToT search reliably navigates the non-linear color-parameter space rests on an unverified mapping from scene semantics to ASC-CDL values; without an ablation comparing ToT against direct LLM prompting or simple RAG retrieval, the contribution of the search strategy cannot be assessed.
[Benchmark] LumiGrade benchmark introduction: the manuscript provides no details on dataset composition (number of scenes, duration, log-encoding format, source cameras), how expert ground-truth grades were collected, or inter-expert agreement statistics, rendering the benchmark unusable for independent verification of the reported results.

minor comments (2)

[Figures] Ensure all figures showing graded frames include side-by-side comparisons with expert grades and raw input, with captions stating the exact ASC-CDL parameters used.
[Execution Stage] Clarify the precise ASC-CDL parameterization (slope, offset, power per channel) and the exact procedure for converting the deduced parameters into the 3D LUT to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for highlighting these important aspects. We will make the suggested revisions to provide stronger empirical support and complete documentation.

read point-by-point responses

Referee: [Experiments] Experiments section: the abstract and introduction assert that LumiVideo approaches human expert quality on LumiGrade, yet no quantitative results (expert preference scores, CIEDE2000, temporal flicker metrics, or statistical tests) are supplied, nor are any baselines (LUT-only, direct regression, commercial auto-graders) or ablations (removing ToT or RAG) reported. This leaves the central performance claim unsupported.

Authors: We recognize that the current Experiments section does not include the quantitative results necessary to substantiate the claims made in the abstract and introduction. We will revise this section to report expert preference scores from studies with professional colorists, objective metrics including CIEDE2000 and temporal flicker, statistical significance tests, comparisons to baselines such as LUT-only methods, direct regression, and commercial auto-graders, as well as ablations for the ToT and RAG components. revision: yes
Referee: [Reasoning Engine] Reasoning engine description (Section 3.2): the claim that ToT search reliably navigates the non-linear color-parameter space rests on an unverified mapping from scene semantics to ASC-CDL values; without an ablation comparing ToT against direct LLM prompting or simple RAG retrieval, the contribution of the search strategy cannot be assessed.

Authors: The description in Section 3.2 explains the rationale for using ToT to navigate the parameter space, but we agree that an empirical validation through ablation is needed. We will add such an ablation study to the Experiments section, comparing the full Reasoning engine with ToT to versions using direct LLM prompting and RAG retrieval alone. revision: yes
Referee: [Benchmark] LumiGrade benchmark introduction: the manuscript provides no details on dataset composition (number of scenes, duration, log-encoding format, source cameras), how expert ground-truth grades were collected, or inter-expert agreement statistics, rendering the benchmark unusable for independent verification of the reported results.

Authors: We will expand the introduction of the LumiGrade benchmark to include comprehensive details on dataset composition (number of scenes, duration, log-encoding formats, source cameras), the methodology for collecting expert ground-truth grades, and inter-expert agreement statistics to enable independent verification. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper presents LumiVideo as an agentic pipeline that composes pre-existing external components (LLM knowledge, RAG retrieval, Tree-of-Thoughts search) to map scene semantics onto ASC-CDL parameters and a global 3D LUT. No equations or derivations are shown to reduce to fitted parameters within the paper itself, nor does any load-bearing claim rest on self-citation chains or ansatzes imported from prior author work. The new LumiGrade benchmark is introduced as an external evaluation set rather than a self-referential construct, and the system architecture remains independent of its own outputs. This yields a self-contained description whose central claims can be assessed against external benchmarks without internal circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on assumptions about LLM capabilities for creative color decisions and the analytical guarantee of temporal consistency from ASC-CDL/LUT outputs; no free parameters or invented physical entities are explicitly fitted or postulated beyond the system itself.

axioms (2)

domain assumption LLMs contain sufficient internalized cinematic knowledge to guide color grading decisions when augmented by RAG
Invoked in the description of the Reasoning engine that synergizes LLM knowledge with RAG via Tree of Thoughts.
domain assumption Compiling parameters into ASC-CDL and 3D LUT analytically guarantees temporal consistency
Stated directly as a property of the Execution stage output.

invented entities (2)

LumiVideo agentic system no independent evidence
purpose: To mimic professional colorist cognitive workflow for autonomous grading
Newly introduced four-stage pipeline not previously described.
LumiGrade benchmark no independent evidence
purpose: To evaluate automated log-encoded video grading systems
Presented as the first such benchmark.

pith-pipeline@v0.9.0 · 5522 in / 1517 out tokens · 41635 ms · 2026-05-13T21:33:40.537359+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 4 internal anchors

[1]

GPT-4 Technical Report

Gpt-4 technical report.arXiv preprint arXiv:2303.08774(2023). Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Liang- hao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Qwen3-VL Technical Report

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025). Max Bain, Arsha Nagrani, Andrew Brown, and Andrew Zisserman

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Graph.32, 4 (2013), 39–1

Example- based video color grading.ACM Trans. Graph.32, 4 (2013), 39–1. Tim Brooks, Aleksander Holynski, and Alexei A Efros

work page 2013
[4]

Haoyu Chen, Keda Tao, Yizao Wang, Xinlei Wang, Lei Zhu, and Jinjin Gu

Restoreagent: Autonomous image restoration agent via multimodal large language models.Advances in Neural Information Processing Systems37 (2024), 110643–110666. Haoyu Chen, Keda Tao, Yizao Wang, Xinlei Wang, Lei Zhu, and Jinjin Gu

work page 2024
[5]

arXiv preprint arXiv:2505.23130(2025)

Pho- toArtAgent: Intelligent Photo Retouching with Language Model-Based Artist Agents. arXiv preprint arXiv:2505.23130(2025). Ken Dancyger. 2018.The technique of film and video editing: history, theory, and practice. Routledge. Yuchen Guo, Ruoxiang Xu, Rongcheng Li, and Weifeng Su

work page arXiv 2025
[6]

Jeff Johnson, Matthijs Douze, and Hervé Jégou

Denoising diffusion probabilistic models.Advances in neural information processing systems33 (2020), 6840–6851. Jeff Johnson, Matthijs Douze, and Hervé Jégou

work page 2020
[7]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al

Billion-scale similarity search with GPUs.IEEE transactions on big data7, 3 (2019), 535–547. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al

work page 2019
[8]

Jingjing Li, Yue Feng, Yuchen Guo, Jincai Huang, Yongri Piao, Qi Bi, Miao Zhang, Xiaoqi Zhao, Qiang Chen, Shihao Zou, et al

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474. Jingjing Li, Yue Feng, Yuchen Guo, Jincai Huang, Yongri Piao, Qi Bi, Miao Zhang, Xiaoqi Zhao, Qiang Chen, Shihao Zou, et al

work page 2020
[9]

SAM3-I: Segment Anything with Instructions

SAM3-I: Segment Anything with Instructions.arXiv preprint arXiv:2512.04585(2025). Francois Pitie, Anil C Kokaram, and Rozenn Dahyot

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Erik Reinhard, Michael Adhikhmin, Bruce Gooch, and Peter Shirley

Automated colour grading using colour distribution transfer.Computer Vision and Image Understanding107, 1-2 (2007), 123–137. Erik Reinhard, Michael Adhikhmin, Bruce Gooch, and Peter Shirley

work page 2007
[11]

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wight- man, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al

Color transfer between images.IEEE Computer graphics and applications21, 5 (2002), 34–41. Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wight- man, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al

work page 2002
[12]

Seunghyun Shin, Dongmin Shin, Jisu Shin, Hae-Gon Jeon, and Joon-Young Lee

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems35 (2022), 25278–25294. Seunghyun Shin, Dongmin Shin, Jisu Shin, Hae-Gon Jeon, and Joon-Young Lee

work page 2022
[13]

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yix- uan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837. Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yix- uan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al

work page 2022
[14]

arXiv preprint arXiv:2312.17090 (2023)

Q-align: Teaching lmms for visual scoring via discrete text-defined levels.arXiv preprint arXiv:2312.17090 (2023). Canqian Yang, Meiguang Jin, Xu Jia, Yi Xu, and Ying Chen. 2022a. AdaInt: Learning adaptive intervals for 3D lookup tables on real-time image enhancement. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1...

work page arXiv 2023
[15]

arXiv preprint arXiv:2602.22809(2026)

PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning. arXiv preprint arXiv:2602.22809(2026). Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan

work page arXiv 2026
[16]

PhotoFramer: Multi-modal Image Composition Instruction

Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems36 (2023), 11809–11822. Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. 2025a. Teaching large language models to regress accurate image quality scores using score distribution. In Proceedings of the Computer Vision and Patt...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

IEEE Transactions on Pattern Analysis and Machine Intelligence44, 4 (2020), 2058–

Learning image- adaptive 3d lookup tables for high performance photo enhancement in real-time. IEEE Transactions on Pattern Analysis and Machine Intelligence44, 4 (2020), 2058–

work page 2020
[18]

Kaiwen Zhu, Jinjin Gu, Zhiyuan You, Yu Qiao, and Chao Dong

Judging llm-as-a- judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623. Kaiwen Zhu, Jinjin Gu, Zhiyuan You, Yu Qiao, and Chao Dong

work page 2023
[19]

An intelligent agen- tic system for complex image restoration problems.arXiv preprint arXiv:2410.17809 (2024)

work page arXiv 2024