pith. machine review for the scientific record. sign in

arxiv: 2605.12938 · v1 · submitted 2026-05-13 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation

Jong Chul Ye, Seonghyun Jin, Sunwoo Park, Youngmin Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:02 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords positional encodingvideo generationcamera controlunified camera modeldiffusion transformergeometric attentionray distributioncurved projection
0
0 comments X

The pith

CRePE represents each image token as a depth-aware positional distribution along its source ray to support unified camera control under the Unified Camera Model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

CRePE introduces a positional encoding that represents each image token as a depth-aware distribution along its source ray. This approach captures the projected-path geometry for wide-angle and fisheye cameras under the Unified Camera Model, going beyond pinhole assumptions. The method adds a Geometric Attention Adapter to frozen video DiTs and uses pseudo-supervision from a monocular geometry model to inject scene-distance information. If successful, it delivers more stable camera control and better geometry-aware metrics while keeping video quality competitive with baselines.

Core claim

CRePE represents each image token as a depth-aware positional distribution along its source ray, providing a Unified Camera Model-compatible positional encoding that captures the projected-path geometry induced by wide-angle and fisheye cameras. It is implemented through a Geometric Attention Adapter added to frozen video DiTs, injecting token-wise scene-distance information into selected attention layers and stabilizing it with pseudo supervision from a monocular geometry foundation model. This design leads to more stable camera control and improves several geometry-aware and perceptual-quality metrics, while remaining competitive on video-quality metrics.

What carries the argument

Curved Ray Expectation Positional Encoding (CRePE) that models each token's position as the expectation of a depth distribution along its curved source ray under the Unified Camera Model.

If this is right

  • More stable camera control for wide-angle and fisheye lenses compared to pinhole-only encodings.
  • Improved scores on geometry-aware and perceptual-quality metrics while staying competitive on video quality.
  • Better overall average rank than RayRoPE-style endpoint baselines in positional-encoding ablations.
  • Additional support for external radial-map control and source-video motion transfer through Radial MixForcing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ray-distribution pathway could let a single model train on footage from mixed or uncalibrated real-world cameras without lens-specific adapters.
  • Self-supervised consistency losses computed across generated frames might eventually replace the external monocular pseudo-supervisor.
  • The expectation formulation may permit direct differentiation through camera parameters to optimize trajectories at inference time.

Load-bearing premise

Pseudo-supervision from a monocular geometry foundation model is sufficient to stabilize the Geometric Attention Adapter without introducing systematic bias in the learned ray distributions.

What would settle it

Generating videos under known fisheye parameters on a synthetic scene with ground-truth ray paths and checking whether object trajectories and line projections match the expected curved geometry only when CRePE is used.

Figures

Figures reproduced from arXiv: 2605.12938 by Jong Chul Ye, Seonghyun Jin, Sunwoo Park, Youngmin Kim.

Figure 1
Figure 1. Figure 1: Text-to-video camera-conditioned generation with non-pinhole lenses. Each row shows a text-to-video (T2V) generation by CRePE, where the model is conditioned only on the text prompt shown in the figure and the specified camera trajectory/lens parameters; no source video, reference image, or external geometry map is provided at inference. The left side illustrates the target camera setting. The right side s… view at source ↗
Figure 2
Figure 2. Figure 2: gives an overview of this pipeline. Curved-Ray Expectation Positional Encoding Non-Linear UCM Projection 𝝁 , 𝝈 Geometry Estimator Input Token Feature Key Frame Query Frame 𝐡𝐬,𝐩 𝒈𝝓 ① ② ③ ④ Piece-wise Phase Integration [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative camera-control comparison. Generated videos under matched prompts and camera trajectories. The main comparison includes ReCamMaster, UCPE, and CRePE. CRePE follows the requested camera trajectory most stably under non-pinhole lenses [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative external-radial-map control. CRePE-Mix can consume externally supplied radial-distance maps through the same positional-encoding pathway used during camera-conditioned generation. We show scene-geometry-conditioned generation and source-video motion transfer qualitative results. On PanShot (top) and wild AI-generated videos (bottom), our model shows camera- and geometry-conditioned generation c… view at source ↗
Figure 5
Figure 5. Figure 5: Architecture of the Geometric Attention Adapter. Our adapter follows the UCPE-style camera-conditioning structure, but augments the attention pathway with a CRePE-based geometric attention branch. Following the layer-placement ablation, CRePE is applied to the middle 10 transformer blocks, where recoverable radial-distance information is strongest. The remaining early and late blocks retain the RelRay enco… view at source ↗
Figure 6
Figure 6. Figure 6: reports the layer-wise radial-distance probing results [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Layer-wise visualization of predicted radial-distance maps. We visualize radial-distance maps decoded from different Wan2.1 layers using the shared geometry head. The layer-15 map, a representative map from the middle CRePE window, shows the sharpest and most coherent scene structure. This supports our choice of applying CRePE and radial-distance supervision to the middle￾layer representation selected by t… view at source ↗
Figure 8
Figure 8. Figure 8: Evolution of predicted radial-distance maps during denoising. We visualize CRePE’s internal radial-distance maps at multiple denoising steps. Denoising progresses from left to right. The maps evolve from coarse responses to structured scene layouts, suggesting that the ray-position pathway progressively organizes scene geometry during generation [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
read the original abstract

Camera-conditioned video generation requires positional encoding that remains reliable under changes in camera motion, lens configuration, and scene structure. However, existing attention-level camera encodings either provide ray-only camera signals or rely on pinhole camera geometry, limiting their applicability to general camera control under the Unified Camera Model, including wide-angle and fisheye lenses. To address this limitation, we propose Curved Ray Expectation Positional Encoding (CRePE). CRePE represents each image token as a depth-aware positional distribution along its source ray, providing a Unified Camera Model-compatible positional encoding that captures the projected-path geometry induced by wide-angle and fisheye cameras. CRePE is implemented through a Geometric Attention Adapter added to frozen video DiTs, injecting token-wise scene-distance information into selected attention layers and stabilizing it with pseudo supervision from a monocular geometry foundation model. This design leads to more stable camera control and improves several geometry-aware and perceptual-quality metrics, while remaining competitive on video-quality metrics. Controlled positional-encoding ablations show a better overall average rank than a RayRoPE-style endpoint PE baseline, demonstrating the effectiveness of UCM-aware projected-path integration across diverse camera models. Furthermore, by extending the same positional-encoding pathway to external geometry control through Radial MixForcing, CRePE supports external radial-map control for scene-geometry-conditioned generation and source-video motion transfer beyond camera control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Curved Ray Expectation Positional Encoding (CRePE) to enable reliable camera control in video diffusion transformers under the Unified Camera Model (UCM), including wide-angle and fisheye lenses. Each image token is represented as a depth-aware positional distribution along its source ray; this is realized by adding a Geometric Attention Adapter to frozen DiTs, stabilized via pseudo-supervision from a monocular geometry foundation model. The method is claimed to improve geometry-aware and perceptual metrics while remaining competitive on video quality, outperforming a RayRoPE-style baseline in average rank across ablations, and to support external radial-map control via Radial MixForcing.

Significance. If the learned ray distributions accurately reflect UCM geometry without systematic bias from the pseudo-labels, the approach would provide a practical route to camera- and geometry-conditioned generation beyond pinhole assumptions, with potential impact on controllable video synthesis pipelines.

major comments (3)
  1. [Method (Geometric Attention Adapter)] Method section (Geometric Attention Adapter and pseudo-supervision): the central claim that CRePE supplies UCM-compatible curved-ray positional encodings rests on the assumption that monocular foundation-model depth labels supply unbiased expectations along actual distorted rays; no quantitative validation (e.g., comparison of learned distributions to ground-truth UCM ray paths on fisheye or wide-angle test data) is provided, leaving open the possibility that the encoding collapses to approximate pinhole behavior.
  2. [Experiments and Ablations] Experiments and ablations: metric gains and the reported better average rank versus the RayRoPE baseline are presented without error bars, standard deviations across runs, or statistical significance tests, so it is impossible to determine whether the observed improvements are reliable or merely within noise.
  3. [Abstract / Method] Abstract and method: the precise formulation of the depth-aware positional distribution (how the expectation is computed from the adapter output and integrated into attention) is not derived in sufficient detail to verify that it captures projected-path geometry rather than simply injecting scalar depth.
minor comments (2)
  1. [Notation] Notation for the ray-distribution parameters should be introduced once and used consistently; currently the distinction between the adapter output and the final positional encoding is unclear on first reading.
  2. [Tables] Tables reporting ablation ranks would benefit from explicit column headers indicating the exact metric being ranked and the number of camera/lens configurations evaluated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the thorough and insightful comments, which have helped us identify areas for improvement in the manuscript. Below, we provide a point-by-point response to the major comments. We have made revisions to address each concern.

read point-by-point responses
  1. Referee: Method section (Geometric Attention Adapter and pseudo-supervision): the central claim that CRePE supplies UCM-compatible curved-ray positional encodings rests on the assumption that monocular foundation-model depth labels supply unbiased expectations along actual distorted rays; no quantitative validation (e.g., comparison of learned distributions to ground-truth UCM ray paths on fisheye or wide-angle test data) is provided, leaving open the possibility that the encoding collapses to approximate pinhole behavior.

    Authors: We thank the referee for highlighting this important aspect. The pseudo-supervision from the monocular geometry foundation model is designed to provide depth estimates that align with the actual ray paths under the UCM. However, we acknowledge the value of direct quantitative validation. In the revised manuscript, we will include a comparison of the learned distributions to ground-truth UCM ray paths on fisheye and wide-angle test data to confirm that the encodings do not collapse to pinhole behavior. revision: yes

  2. Referee: Experiments and ablations: metric gains and the reported better average rank versus the RayRoPE baseline are presented without error bars, standard deviations across runs, or statistical significance tests, so it is impossible to determine whether the observed improvements are reliable or merely within noise.

    Authors: We agree that including error bars, standard deviations, and statistical significance tests would strengthen the experimental results. We will rerun the key experiments with multiple random seeds and report the mean and standard deviation, along with p-values from appropriate statistical tests, in the revised manuscript. revision: yes

  3. Referee: Abstract and method: the precise formulation of the depth-aware positional distribution (how the expectation is computed from the adapter output and integrated into attention) is not derived in sufficient detail to verify that it captures projected-path geometry rather than simply injecting scalar depth.

    Authors: We appreciate this feedback on the clarity of the formulation. We will expand the Method section with a detailed derivation of the depth-aware positional distribution, explicitly describing how the expectation is computed from the Geometric Attention Adapter output and integrated into the attention mechanism to capture the projected-path geometry under the UCM, rather than merely injecting scalar depth values. revision: yes

Circularity Check

0 steps flagged

No circularity: new adapter and external pseudo-supervision keep derivation self-contained

full rationale

The paper defines CRePE as a depth-aware positional distribution along source rays under the Unified Camera Model, realized by adding a Geometric Attention Adapter to frozen video DiTs and stabilizing it via pseudo-depth labels from an external monocular geometry foundation model. No equation or central claim reduces by construction to a parameter fitted inside the paper, nor does any load-bearing step rely on a self-citation chain, imported uniqueness theorem, or ansatz smuggled from prior author work. Ablations compare against an external RayRoPE-style baseline, and the UCM compatibility follows directly from the ray-distribution construction rather than from re-labeling fitted quantities. The derivation therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The claim rests on the domain assumption that monocular depth estimates provide reliable pseudo-supervision for ray distributions and on the new invented entity of the Geometric Attention Adapter; no free parameters are explicitly listed but the depth distribution itself functions as one.

free parameters (1)
  • depth distribution parameters
    Parameters defining the positional distribution along each curved ray are introduced to make the encoding work.
axioms (1)
  • domain assumption Unified Camera Model correctly captures lens-induced ray curvature for wide-angle and fisheye cases
    Invoked when claiming compatibility beyond pinhole geometry.
invented entities (2)
  • Curved Ray Expectation Positional Encoding no independent evidence
    purpose: To supply token-wise scene-distance information compatible with non-pinhole cameras
    New encoding scheme introduced in the paper
  • Geometric Attention Adapter no independent evidence
    purpose: To inject the curved-ray information into selected attention layers of frozen DiTs
    New module added to existing architecture

pith-pipeline@v0.9.0 · 5564 in / 1337 out tokens · 59047 ms · 2026-05-14T20:02:32.921479+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 3 internal anchors

  1. [1]

    Recammaster: Camera-controlled generative rendering from a single video

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InICCV, 2025

  2. [2]

    Rayrope: Projective ray positional encoding for multi-view attention, 2026

    Yu Wu, Minsik Jeon, Jen-Hao Rick Chang, Oncel Tuzel, and Shubham Tulsiani. Rayrope: Projective ray positional encoding for multi-view attention, 2026. URLhttps://arxiv.org/abs/2601.15275

  3. [3]

    Unified camera positional encoding for controlled video generation.arXiv preprint arXiv:2512.07237, 2025

    Cheng Zhang, Boying Li, Meng Wei, Yan-Pei Cao, Camilo Cruz Gambardella, Dinh Phung, and Jianfei Cai. Unified camera positional encoding for controlled video generation.arXiv preprint arXiv:2512.07237, 2025

  4. [4]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URL https: //arxiv.org/abs/2212.09748

  5. [5]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  6. [6]

    Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647, 2025

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, and Di Zhang. Recammaster: Camera-controlled generative rendering from a single video, 2025. URLhttps://arxiv.org/abs/2503.11647

  7. [7]

    Yonosplat: You only need one model for feedforward 3d gaussian splatting

    Botao Ye, Boqi Chen, Haofei Xu, Daniel Barath, and Marc Pollefeys. Yonosplat: You only need one model for feedforward 3d gaussian splatting. InThe F ourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=ImRhA9xmay

  8. [8]

    Cameractrl: Enabling camera control for text-to-video generation, 2025

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation, 2025. URL https://arxiv.org/abs/2404. 02101

  9. [9]

    Motionctrl: A unified and flexible motion controller for video generation, 2024

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation, 2024. URL https://arxiv. org/abs/2312.03641

  10. [10]

    Direct-a-video: Customized video generation with user-directed camera movement and object motion

    Shiyuan Yang, Liang Hou, Haibin Huang, Chongyang Ma, Pengfei Wan, Di Zhang, Xiaodong Chen, and Jing Liao. Direct-a-video: Customized video generation with user-directed camera movement and object motion. InSpecial Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, SIGGRAPH ’24, page 1–12. ACM, 2024. doi: 10.1145/3...

  11. [11]

    Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, and Shubham Tulsiani

    Jason Y . Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, and Shubham Tulsiani. Cameras as rays: Pose estimation via ray diffusion, 2024. URL https://arxiv.org/abs/2402.14817

  12. [12]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. URLhttps://arxiv.org/abs/2104.09864

  13. [13]

    Xin Kong, Shikun Liu, Xiaoyang Lyu, Marwan Taher, Xiaojuan Qi, and Andrew J. Davison. Eschernet: A generative model for scalable view synthesis, 2024. URLhttps://arxiv.org/abs/2402.03908. 10

  14. [14]

    Gta: A geometry-aware attention mechanism for multi-view transformers, 2024

    Takeru Miyato, Bernhard Jaeger, Max Welling, and Andreas Geiger. Gta: A geometry-aware attention mechanism for multi-view transformers, 2024. URLhttps://arxiv.org/abs/2310.10375

  15. [15]

    Cameras as relative positional encoding, 2025

    Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding, 2025. URLhttps://arxiv.org/abs/2507.10496

  16. [16]

    Freeman, Joshua B

    Vincent Sitzmann, Semon Rezchikov, William T. Freeman, Joshua B. Tenenbaum, and Fredo Durand. Light field networks: Neural scene representations with single-evaluation rendering, 2022. URL https: //arxiv.org/abs/2106.02634

  17. [17]

    Learning neural light fields with ray-space embedding networks, 2022

    Benjamin Attal, Jia-Bin Huang, Michael Zollhoefer, Johannes Kopf, and Changil Kim. Learning neural light fields with ray-space embedding networks, 2022. URLhttps://arxiv.org/abs/2112.01523

  18. [18]

    Mehdi S. M. Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan, Suhani V ora, Mario Lucic, Daniel Duckworth, Alexey Dosovitskiy, Jakob Uszkoreit, Thomas Funkhouser, and Andrea Tagliasacchi. Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations, 2022. URLhttps://arxiv.org/abs/...

  19. [19]

    Barron, and Ben Poole

    Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srini- vasan, Jonathan T. Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models,

  20. [20]

    URLhttps://arxiv.org/abs/2405.10314

  21. [21]

    Dust3r: Geometric 3d vision made easy, 2024

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy, 2024. URLhttps://arxiv.org/abs/2312.14132

  22. [22]

    Grounding image matching in 3d with mast3r, 2024

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r, 2024. URLhttps://arxiv.org/abs/2406.09756

  23. [23]

    Vggt: Visual geometry grounded transformer, 2025

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer, 2025. URLhttps://arxiv.org/abs/2503.11651

  24. [24]

    Unik3d: Universal camera monocular 3d estimation, 2025

    Luigi Piccinelli, Christos Sakaridis, Mattia Segu, Yung-Hsu Yang, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unik3d: Universal camera monocular 3d estimation, 2025. URL https://arxiv.org/abs/2503. 16591. A Standard RoPE and UCM Details Rotary positional encoding (RoPE) [12] encodes a scalar positionxas a block-diagonal rotation ρD(x) = D/2M f=1 ρ2(ωf x), ...