Holo-World: Unified Camera, Object and Weather Control for Video World Model

Chunfeng Wang; Dachun Kai; Jiahui Yuan; Wei Li; Wenzhang Sun; Xiangchen Yin; Xiaoyan Sun; Yinda Chen; Zijie Liu

arxiv: 2606.20083 · v3 · pith:CSPGQAWFnew · submitted 2026-06-18 · 💻 cs.CV

Holo-World: Unified Camera, Object and Weather Control for Video World Model

Xiangchen Yin , Wenzhang Sun , Jiahui Yuan , Zijie Liu , Yinda Chen , Wei Li , Dachun Kai , Chunfeng Wang

show 1 more author

Xiaoyan Sun

This is my paper

Pith reviewed 2026-07-02 21:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords video world modelcontrollable video generationcamera controlobject controlweather transferunified scene adapterfirst-frame generation

0 comments

The pith

Holo-World generates videos from a single image by jointly controlling camera paths, object motions, and weather states.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to build a video world model that begins with one image and follows explicit instructions for camera movement, object placement, and optional weather changes to output a full video sequence. This setup matters because earlier methods either needed an existing source video that already encoded future structure or handled the different controls in isolation. The work creates a dataset of unified control examples and a model whose adapter splits the task of keeping the underlying scene intact from the task of applying weather-specific looks and effects. If the factorization works, the model can preserve geometric consistency under motion controls while still producing varied environmental conditions from the same starting frame.

Core claim

Holo-World is a unified controllable video world model that starts from a single image and follows explicit camera and object controls plus an optional weather instruction to generate a video that either preserves the source world or transfers it to a target weather state. Its Unified Scene Adapter factorizes world preservation and weather transfer into distinct parameter subspaces, using rendered background, geometry buffers, and object controls to maintain controlled scene structure while modeling weather-dependent appearance and particle effects. Scene-Weather Decomposed CFG guides scene and weather residuals separately to strengthen target weather effects without over-amplifying the full

What carries the argument

Unified Scene Adapter, which factorizes world preservation and weather transfer into distinct parameter subspaces using rendered background, geometry buffers, and object controls.

If this is right

Precise camera and object controls can be maintained with consistent scene structure across generated frames.
Scenes can be transferred into diverse target weather states while the same controls remain in effect.
The model outperforms video-to-video weather editing baselines specifically on weather-state generation quality.
A new dataset supplies aligned supervision for camera, object, and weather signals from diverse source videos.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of structural and appearance subspaces may allow the same adapter design to handle other environmental shifts such as time of day or seasonal changes without retraining the full model.
Because the method requires only a first frame plus controls, it could support interactive editing loops where a user adjusts motion or weather mid-generation and sees immediate consistency.
Extending the first-frame anchor to multi-view or stereo inputs might produce 3D-consistent world models that support novel-view synthesis under the same controls.

Load-bearing premise

Factorizing preservation of scene structure from weather appearance changes inside separate parameter subspaces, anchored only to the first frame and its rendered geometry buffers, is enough to produce videos that obey the given controls.

What would settle it

Videos that visibly break the supplied camera trajectory, move objects to incorrect positions, or fail to produce the instructed weather particles and lighting while still satisfying the structural inputs would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.20083 by Chunfeng Wang, Dachun Kai, Jiahui Yuan, Wei Li, Wenzhang Sun, Xiangchen Yin, Xiaoyan Sun, Yinda Chen, Zijie Liu.

**Figure 1.** Figure 1: Unified state control in Holo-World. Holo-World jointly controls camera motion, object dynamics, and weather state within the same observed world. ABSTRACT Video world models are moving toward preserving an observed world under controllable camera and object motion while allowing its environmental state to change. Yet these controls remain isolated, and weather generation typically relies on a source video… view at source ↗

**Figure 2.** Figure 2: HoloStateData construction pipeline. These annotations are converted into source-side rendered controls and object controls, while paired target-weather videos provide supervision for weather-state transfer. weather text, object masks, camera motion, and dense geometry. Scene construction renders sourceside background RGB, depth, and normal controls, converts object masks into object controls, and associa… view at source ↗

**Figure 3.** Figure 3: Overview of Holo-World. Given a first frame and factorized source-to-state controls, Holo-World decomposes controllable video generation into world-preservation and weather-transfer residual paths. Metrics. On the Real subset, we report VBench-I2V (Huang et al., 2025b) for video quality, rotation error (RotErr), translation error (TransErr) (He et al., 2024) and ObjMC (Wang et al., 2024) for camera and ob… view at source ↗

**Figure 4.** Figure 4: Main qualitative comparison. Real and Weather rows show the source-to-state requirement of preserving the source-controlled world while synthesizing the requested weather state. Results of Weather Transfer. We next evaluate the Weather subset, where Holo-World must change the weather from a single image under camera and object control . This setting is stricter than other video-to-video editing models beca… view at source ↗

**Figure 5.** Figure 5: Core decoupling ablation. The comparison visualizes whether UniSA reduces interference between Real-subset preservation and Weather-subset editing. SW-CFG (weather=2) w/o CFG with CFG SW-CFG (weather=4) SW-CFG (weather=2) w/o CFG with CFG SW-CFG (weather=4) Weather Sample Real Sample [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Guidance decoupling ablation. The comparison visualizes how no CFG, vanilla CFG, and SceneWeather Decomposed CFG balance weather strength and source-world preservation. and raise Weather Alignment and VLM Evaluation on the Weather subset, indicating that explicit geometry anchors help produce weather effects and preserves the controlled scene. UniSA further improves all three background metrics and VLM E… view at source ↗

**Figure 7.** Figure 7: Multi-weather state control. With the same source world and camera/object controls, changing only the target weather prompt produces different weather videos [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Additional visualization of Holo-World. Real examples visualize background-consistent generation under rendered controls, while Weather examples visualize target weather transfer under the same source-side control interface [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Weather-family distribution in HoloStateData. The pie chart shows the target weather-family distribution of the Weather training subset used for state-transfer supervision. VLM text annotation. HoloStateData uses Qwen3-VL to produce factorized text conditions with low-randomness decoding (Bai et al., 2025). The annotation pipeline uses two VLM prompts rather than one generic caption. The scene prompt is ge… view at source ↗

**Figure 10.** Figure 10: HoloStateData example. The visualization shows how one video record is converted into the source [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

Video world models are moving toward preserving an observed world under controllable camera and object motion while allowing its environmental state to change. Yet these controls remain isolated, and weather generation typically relies on a source video or reconstructed scene that already specifies future structure. We study a first-frame-anchored source-to-state setting, where the model starts from a single image and follows explicit camera and object controls and an optional weather instruction, then generates a video that either preserves the source world or transfers it to a target weather state. To address these challenges, we first build HoloStateData, a state video dataset that turns diverse videos into unified control samples for camera, object, and weather supervision. Second, we introduce Holo-World, a unified controllable video world model that jointly controls the scene from a single image. Its Unified Scene Adapter factorizes world preservation and weather transfer into distinct parameter subspaces, using rendered background, geometry buffers, and object controls to maintain controlled scene structure while modeling weather-dependent appearance and particle effects. Additionally, Scene-Weather Decomposed CFG guides scene and weather residuals separately, strengthening target weather effects without over-amplifying the full condition. Quantitative and qualitative experiments demonstrate that Holo-World maintains precise camera and object controls with consistent scene structure while transferring scenes into diverse target weather states, outperforming video-to-video weather editing baselines on weather-state generation. Our project page is available at https://xiangchenyin.github.io/Holo-World/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a new dataset and adapter to control camera, object, and weather from one image in video generation, but the abstract gives almost no experimental details to judge whether the factorization actually works.

read the letter

The core contribution is HoloStateData, a dataset that turns videos into unified supervision for camera, object, and weather, plus the Unified Scene Adapter that splits preservation from weather transfer using geometry buffers and rendered backgrounds. Scene-Weather Decomposed CFG is the other new piece, meant to strengthen weather effects without messing up the controls. This first-frame-anchored setup is genuinely different from most video-to-video weather editing, which usually starts from a source video that already encodes future structure.

The approach looks practical on paper for simulation-style world models. If the adapter really keeps object trajectories and camera paths stable while swapping weather states, that would be a useful step. The claim of outperforming baselines on weather generation is the part worth checking.

The main weakness is the complete lack of numbers, baselines, or setup details in the abstract. Without seeing metrics, ablations, or failure cases on complex motion, it's impossible to tell if the subspace factorization leaks or if drift appears when the input has no future cues. The stress-test concern about relying on the adapter without source-video structure is still open.

This is for groups already working on controllable video world models who need multi-axis control from minimal input. It deserves a serious referee if the full paper shows solid quantitative results and reproducible code, because the unified-control direction matters even if the current evidence is thin.

Referee Report

2 major / 2 minor

Summary. The paper introduces Holo-World, a unified controllable video world model operating in a first-frame-anchored source-to-state setting. Starting from a single image plus explicit camera/object controls and optional weather instruction, the model generates videos that either preserve the source world or transfer it to a target weather state. Key contributions include the HoloStateData dataset for unified supervision, the Unified Scene Adapter that factorizes world preservation and weather transfer into distinct subspaces via rendered background/geometry buffers and object controls, and Scene-Weather Decomposed CFG to guide residuals separately. The manuscript claims quantitative and qualitative outperformance over video-to-video weather editing baselines while maintaining precise controls and consistent scene structure.

Significance. If the central claims hold, the work would advance video world models by demonstrating unified control without requiring a source video that already encodes future structure, a stricter and more flexible setting than prior video-to-video approaches. The dataset construction and factorization via buffers represent concrete engineering contributions that could support downstream applications in simulation and editing.

major comments (2)

[Method (Unified Scene Adapter and CFG sections)] The central claim that the Unified Scene Adapter factorization (via rendered background, geometry buffers, and object controls) plus Scene-Weather Decomposed CFG is sufficient to produce drift-free videos under explicit controls from a single first-frame input is load-bearing, yet the manuscript provides no quantitative analysis of subspace leakage, buffer rendering accuracy under complex trajectories, or failure cases when the source video is absent. This directly affects whether the first-frame-anchored setting succeeds without future structure leakage.
[Experiments] Quantitative results are asserted to show outperformance on weather-state generation, but without reported metrics, baselines, or dataset splits in the provided description, it is impossible to verify whether the cross-setting comparison is fair or whether the controls remain precise when weather transfer is active.

minor comments (2)

[Method] Clarify the exact rendering pipeline for geometry buffers and how they are conditioned into the adapter; notation for the distinct parameter subspaces is introduced but not formalized.
[Abstract/Experiments] The project page link is given but no supplementary video or failure-case analysis is referenced in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications from the full manuscript and commit to targeted revisions where the concerns identify gaps in presentation or analysis.

read point-by-point responses

Referee: [Method (Unified Scene Adapter and CFG sections)] The central claim that the Unified Scene Adapter factorization (via rendered background, geometry buffers, and object controls) plus Scene-Weather Decomposed CFG is sufficient to produce drift-free videos under explicit controls from a single first-frame input is load-bearing, yet the manuscript provides no quantitative analysis of subspace leakage, buffer rendering accuracy under complex trajectories, or failure cases when the source video is absent. This directly affects whether the first-frame-anchored setting succeeds without future structure leakage.

Authors: We agree that direct quantitative probes of subspace leakage and buffer accuracy would strengthen the load-bearing claim. The current manuscript demonstrates the factorization through control-consistency metrics (camera pose error, object trajectory alignment) and visual ablations showing preserved structure under weather transfer, which provide indirect evidence against leakage. In revision we will add explicit measurements: cosine similarity between scene and weather residual features to quantify leakage, rendering error of geometry buffers on complex trajectories, and a failure-case study for first-frame inputs lacking future structure. These will appear in an expanded Section 4.3. revision: yes
Referee: [Experiments] Quantitative results are asserted to show outperformance on weather-state generation, but without reported metrics, baselines, or dataset splits in the provided description, it is impossible to verify whether the cross-setting comparison is fair or whether the controls remain precise when weather transfer is active.

Authors: The full manuscript (Section 4 and supplementary material) reports the requested details: FID and CLIP-based weather classification accuracy for state transfer, camera/object control precision (pose RMSE, trajectory IoU), comparisons against video-to-video baselines (e.g., ControlVideo, VideoComposer variants), and the 80/10/10 train/val/test splits of HoloStateData. To improve verifiability we will add a consolidated summary table and explicit discussion of comparison fairness and control precision under active weather transfer in the main text. revision: partial

Circularity Check

0 steps flagged

No circularity detected; model claims are architectural and empirical

full rationale

The paper presents an empirical video generation model (Holo-World) built on a new dataset (HoloStateData), with architectural components such as the Unified Scene Adapter and Scene-Weather Decomposed CFG. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on quantitative/qualitative experiments comparing against baselines, which are externally falsifiable and do not reduce to the inputs by construction. This is the normal case of a self-contained engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities can be extracted beyond high-level model names.

pith-pipeline@v0.9.1-grok · 5815 in / 1125 out tokens · 24714 ms · 2026-07-02T21:54:13.064053+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

16 extracted references

[1]

Do not describe weather or lighting here; omit those entirely (they are covered by a separate annotation prompt)

Describe the main theme and setting (such as location and spatial layout of the scene). Do not describe weather or lighting here; omit those entirely (they are covered by a separate annotation prompt). ,→ ,→
[2]

Describe the environment and scene-anchored cues (terrain, architecture, furniture, obstacles, open space, notable static props), then key subjects (appearance, clothing, expression, quantity, ethnicity, posture, etc.), spatial relationships, and camera movements; ,→ ,→ ,→
[3]

Describe relationships and interactions between people, objects, and the environment (positioning, contact, support, occlusion, entry and exit, use of tools or furniture, etc.); ,→ ,→
[4]

Describe any character present and their emotions or activities (such as expressions, postures, etc.), in relation to nearby objects and the scene;,→
[5]

The caption should reflect the style of the video (e.g., cinematic, anime, documentary, vlog, vintage film, etc.);,→
[6]

Use accurate verbs to describe movement; ,→ ,→

Describe the content in chronological order, including scene-level changes (rearranged or moved objects, reconfiguration of space) and how characters or objects move; describe how the camera changes perspective. Use accurate verbs to describe movement; ,→ ,→
[7]

Describe background and environmental details (such as architecture, natural scenery, materials and clutter, etc.) without weather or lighting;,→
[8]

Describe camera motion control (e.g., zoom in, zoom out, push in, pull out, pan right, pan left, truck right, truck left, tilt up, tilt down, pedestal up, pedestal down, arc shot, tracking shot, static shot, and handheld shot, etc.); ,→ ,→
[9]

Only describe what can be determined from the video

Do not describe imagined content. Only describe what can be determined from the video. Avoid listing things. Do not use abstract concepts (love, hate, justice, infinity, joy) as subjects. Use concrete nouns (human, cup, dog, planet, headphones) for more accurate results. Use verbs to describe movement and state changes. Write in plain, conversational lang...
[10]

18 Preprint You are a weather-only caption specialist for video weather editing data

Control the caption length to around 80-150 words; (1) Scene annotation prompt.The prompt extracts source-scene content and motion while excluding weather from the scene text. 18 Preprint You are a weather-only caption specialist for video weather editing data. From the input video (or the specified target), produce one short English line that states only...
[11]

Cloud (cloud-only) -- use one phrase only: -`a few clouds`/`cloudy`/`overcast`
[12]

Rain (precipitation and/or wet ground, independent) -- choose what is visibly present; do not force a pair. Use only phrases from the lists below:,→ - Rain (precipitation):`light rain`/`rain`/`heavy rain` - Puddle / rain ground:`slightly wet rain ground`/`rain ground`/`flooded rain ground`,→ - Compose one line: one matching phrase or, if both rain and wet...
[13]

Snow (falling snow and/or snow on ground, independent) -- same independence as rain: do not require falling snow and ground snow together. Use only:,→ - Falling snow:`light snow`/`snow`/`heavy snow` - Snow-covered ground:`light snow-covered ground`/`snow ground`/`deep snow ground` - One phrase, or both as`{falling-snow phrase} and {ground-snow phrase}`whe...
[14]

drizzle",

Fog (fog-only) -- use one phrase only: -`light fog`/`fog`/`thick fog` Disambiguation rules: - Do not add cloud or fog when the family is Rain or Snow, and do not add rain or snow when the family is Cloud or Fog. This keeps labels disjoint for training (cloudy / rainy / snowy / foggy are separate regimes). ,→ ,→ - Do not mix rain and snow in one line; pick...

2025
[15]

Weather Background: whether the sky, ground, road, buildings, visibility, surface materials, reflections, snow coverage, wetness, fog atmosphere, or other background cues are plausibly edited for the target weather
[16]

score": an integer from 0 to 100,

Weather Dynamics: whether dynamic weather elements such as rain, snow, fog, mist, or moving atmospheric effects appear natural, temporally coherent, and consistent with the target weather. Do not reward a method merely for keeping the background unchanged. A good result should edit the weather background when the target weather requires it. Do not evaluat...

[1] [1]

Do not describe weather or lighting here; omit those entirely (they are covered by a separate annotation prompt)

Describe the main theme and setting (such as location and spatial layout of the scene). Do not describe weather or lighting here; omit those entirely (they are covered by a separate annotation prompt). ,→ ,→

[2] [2]

Describe the environment and scene-anchored cues (terrain, architecture, furniture, obstacles, open space, notable static props), then key subjects (appearance, clothing, expression, quantity, ethnicity, posture, etc.), spatial relationships, and camera movements; ,→ ,→ ,→

[3] [3]

Describe relationships and interactions between people, objects, and the environment (positioning, contact, support, occlusion, entry and exit, use of tools or furniture, etc.); ,→ ,→

[4] [4]

Describe any character present and their emotions or activities (such as expressions, postures, etc.), in relation to nearby objects and the scene;,→

[5] [5]

The caption should reflect the style of the video (e.g., cinematic, anime, documentary, vlog, vintage film, etc.);,→

[6] [6]

Use accurate verbs to describe movement; ,→ ,→

Describe the content in chronological order, including scene-level changes (rearranged or moved objects, reconfiguration of space) and how characters or objects move; describe how the camera changes perspective. Use accurate verbs to describe movement; ,→ ,→

[7] [7]

Describe background and environmental details (such as architecture, natural scenery, materials and clutter, etc.) without weather or lighting;,→

[8] [8]

Describe camera motion control (e.g., zoom in, zoom out, push in, pull out, pan right, pan left, truck right, truck left, tilt up, tilt down, pedestal up, pedestal down, arc shot, tracking shot, static shot, and handheld shot, etc.); ,→ ,→

[9] [9]

Only describe what can be determined from the video

Do not describe imagined content. Only describe what can be determined from the video. Avoid listing things. Do not use abstract concepts (love, hate, justice, infinity, joy) as subjects. Use concrete nouns (human, cup, dog, planet, headphones) for more accurate results. Use verbs to describe movement and state changes. Write in plain, conversational lang...

[10] [10]

18 Preprint You are a weather-only caption specialist for video weather editing data

Control the caption length to around 80-150 words; (1) Scene annotation prompt.The prompt extracts source-scene content and motion while excluding weather from the scene text. 18 Preprint You are a weather-only caption specialist for video weather editing data. From the input video (or the specified target), produce one short English line that states only...

[11] [11]

Cloud (cloud-only) -- use one phrase only: -`a few clouds`/`cloudy`/`overcast`

[12] [12]

Rain (precipitation and/or wet ground, independent) -- choose what is visibly present; do not force a pair. Use only phrases from the lists below:,→ - Rain (precipitation):`light rain`/`rain`/`heavy rain` - Puddle / rain ground:`slightly wet rain ground`/`rain ground`/`flooded rain ground`,→ - Compose one line: one matching phrase or, if both rain and wet...

[13] [13]

Snow (falling snow and/or snow on ground, independent) -- same independence as rain: do not require falling snow and ground snow together. Use only:,→ - Falling snow:`light snow`/`snow`/`heavy snow` - Snow-covered ground:`light snow-covered ground`/`snow ground`/`deep snow ground` - One phrase, or both as`{falling-snow phrase} and {ground-snow phrase}`whe...

[14] [14]

drizzle",

Fog (fog-only) -- use one phrase only: -`light fog`/`fog`/`thick fog` Disambiguation rules: - Do not add cloud or fog when the family is Rain or Snow, and do not add rain or snow when the family is Cloud or Fog. This keeps labels disjoint for training (cloudy / rainy / snowy / foggy are separate regimes). ,→ ,→ - Do not mix rain and snow in one line; pick...

2025

[15] [15]

Weather Background: whether the sky, ground, road, buildings, visibility, surface materials, reflections, snow coverage, wetness, fog atmosphere, or other background cues are plausibly edited for the target weather

[16] [16]

score": an integer from 0 to 100,

Weather Dynamics: whether dynamic weather elements such as rain, snow, fog, mist, or moving atmospheric effects appear natural, temporally coherent, and consistent with the target weather. Do not reward a method merely for keeping the background unchanged. A good result should edit the weather background when the target weather requires it. Do not evaluat...