pith. machine review for the scientific record. sign in

arxiv: 2605.07910 · v2 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

One World, Dual Timeline: Decoupled Spatio-Temporal Gaussian Scene Graph for 4D Cooperative Driving Reconstruction

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords 4D scene reconstructionGaussian scene graphcooperative autonomous drivingtemporal asynchronyV2X-Seqdynamic agentspose decouplingphotometric loss
0
0 comments X

The pith

Decoupled pose trajectories per source eliminate quadratic photometric loss and ghosting in 4D cooperative scene reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vehicle and infrastructure cameras in cooperative driving capture the same scene at different times due to independent clocks, breaking the synchronized assumption in existing Gaussian scene graph methods. This leads to conflicting gradients and ghosting on dynamic agents, with photometric loss proven to scale quadratically with agent velocity and time offset. DUST addresses this by sharing a canonical set of Gaussians per agent for appearance while using separate pose trajectories aligned to each source's timestamps. The decoupling makes the gradient kernel block-diagonal, removing interference, and is made practical with anchor-based pose correction and regularized optimization.

Core claim

Any single-timeline formulation incurs an irreducible photometric loss scaling quadratically with agent velocity and cross-source time offset. Sharing a canonical Gaussian set per agent for appearance consistency while maintaining decoupled pose trajectories aligned to each source's true capture timestamps enables the pose-gradient kernel to become block-diagonal, eliminating cross-source interference entirely. This is supported by a static anchor-based pose correction pipeline and pose-regularized joint optimization.

What carries the argument

The DUST decoupled spatio-temporal Gaussian scene graph, which shares canonical Gaussians per agent but decouples pose trajectories to source-specific timestamps to block-diagonalize the pose-gradient kernel.

If this is right

  • Dynamic-area PSNR increases by 3.2 dB compared to the strongest baseline on V2X-Seq sequences.
  • Fréchet Video Distance decreases by 37.7%.
  • Robust performance is maintained even with larger temporal asynchrony between vehicle and infrastructure data.
  • Cross-source gradient interference is completely removed during optimization of dynamic agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • High-velocity agents would benefit most from this decoupling, suggesting applications in high-speed traffic monitoring.
  • Real-world datasets collected without synchronization hardware could now be used directly for high-quality 4D reconstruction.
  • The block-diagonal gradient property might extend to other differentiable rendering techniques facing multi-view timing issues.
  • Aligning annotations via static anchors could help in broader multi-sensor fusion problems beyond driving.

Load-bearing premise

A shared canonical Gaussian set per agent maintains consistent appearance across temporally offset observations, and the static anchor-based pose correction aligns annotations without new errors affecting dynamic reconstruction.

What would settle it

A direct measurement of photometric loss versus the product of agent velocity and time offset in single-timeline models, or a comparison showing whether ghosting remains after applying the decoupled method to highly asynchronous sequences.

Figures

Figures reproduced from arXiv: 2605.07910 by Haoyu Zhang, Jianping Wang, Kai Wang, Lewei Xie, Xiaoyun Dong, Xinke Li, Yifan Zhang, Yulong Chen, Zongxian Yang.

Figure 1
Figure 1. Figure 1: Comparison of scene representations. Cooperative Input: Vehicle and infrastructure sensors capture dynamic scenes asynchronously. Top: the black car has passed the pole; Bottom: it has not. Classical Representation: Aligning these asynchronous observations to one timeline creates conflicting gradients, leading to severe ghosting. Proposed Representation: Our DUST-GSG assigns separate pose timelines to each… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of DUST. DUST first regenerates cooperative labels offline using co-visible static vehicles as anchors. The pose correction initialize a Decoupled Spatio-Temporal Gaussian Scene Graph, where each dynamic agent shares canonical Gaussians but maintains separate vehicle-side and infrastructure-side pose timelines. Joint optimization then refines Gaussian parameters and dual-timeline poses with pose r… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on dynamic agents under asynchronous observations. Com￾pared to baselines, our method DUST achieves less ghosting and more high-quality reconstruction of various agents, including vehicles and pedestrians. The insets highlight the details. 37.7%, dropping it from 63.7 to 39.7. This temporal stability comes from both the DUST-GSG representation and our pose regularization, which joint… view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison between Ours and StreetGS under increasing temporal asynchrony of 0- 300 (Time Gap) ms. The cumulative growth rate of our method over StreetGS on the specified metrics is illustrated by the red curve. To evaluate robustness against asyn￾chronous data, we introduce time delays be￾tween the two observation sources. Specif￾ically, we pair the vehicle frames with in￾frastructure frames s… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative ablation study results across infrastructure camera from the scene reconstruc￾tion task. (a) Our complete model. (b) Without pose initialization before decoupling. (c) Without dual timelines. (d) Without separated poses. (e) Without Ldrift. (f) Without Lsmooth. 5 Conclusion We present DUST. It is a cooperative reconstruction framework designed for asynchronous au￾tonomous driving scenarios. We … view at source ↗
Figure 6
Figure 6. Figure 6: The images from both the vehicle and infrastructure perspectives of the dataset are presented [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
read the original abstract

Reconstructing dynamic scenes from Vehicle-to-Infrastructure Cooperative Autonomous Driving (VICAD) data is fundamentally complicated by temporal asynchrony: vehicle and infrastructure cameras operate on independent clocks, capturing the same dynamic agent such as cars and pedestrians at different physical times. Existing Gaussian Scene Graph methods implicitly assume synchronized observations and assign a single pose per agent per frame, which is an assumption that breaks in cooperative settings, where the resulting gradient conflicts cause severe ghosting on dynamic agents. We identify this as a representation-level failure, not an optimization artifact: we prove that any single-timeline formulation incurs an irreducible photometric loss scaling quadratically with agent velocity and cross-source time offset. To resolve this, we propose Dust (DecoUpled Spatio-Temporal) Gaussian Scene Graph for 4D Cooperative Driving Reconstruction. DUST Gaussian Scene Graph shares a canonical Gaussian set per agent for appearance consistency, while maintaining decouple pose trajectories aligned to each source's true capture timestamps. We prove that this decoupling enables the pose-gradient kernel block-diagonal, eliminating cross-source interference entirely. To make Dust practical, we further introduce a static anchor-based pose correction pipeline that corrects spatio misalignment between vehicle and infrastructure annotations, and a pose-regularized joint optimization scheme that prevents trajectory jitter and drift during early training. On 26 sequences from V2X-Seq, DUST achieves state-of-the-art performance, improving dynamic-area PSNR by 3.2 dB over the strongest baseline and reducing Fr\'echet Video Distance by 37.7%, with keeping robustness under larger temporal asynchrony.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper claims that single-timeline Gaussian scene graph methods for 4D scene reconstruction suffer from an irreducible quadratic photometric loss due to temporal asynchrony between vehicle and infrastructure sensors in V2X data. It proves that decoupling spatio-temporal pose trajectories per source (while sharing a canonical Gaussian set per agent) produces block-diagonal pose gradients that eliminate cross-source interference. Practical contributions include a static anchor-based pose correction pipeline and pose-regularized joint optimization. On 26 V2X-Seq sequences, DUST reports SOTA results with +3.2 dB dynamic-area PSNR and -37.7% Fréchet Video Distance over baselines, with claimed robustness to larger asynchrony.

Significance. If the proofs hold and the shared canonical Gaussian assumption is valid under real data, this addresses a fundamental representation issue in cooperative driving reconstruction where asynchrony is unavoidable. The block-diagonal gradient property offers a clean structural solution rather than an optimization fix, which could generalize to other multi-source dynamic reconstruction tasks. The empirical gains on a relevant dataset are substantial and suggest practical value for VICAD applications, though significance hinges on verifying the derivations and the absence of confounding errors from pose correction.

major comments (3)
  1. [Abstract and theoretical analysis] Abstract and theoretical analysis section: The claim that 'any single-timeline formulation incurs an irreducible photometric loss scaling quadratically with agent velocity and cross-source time offset' is load-bearing for the motivation, yet the derivation is presented without explicit steps or assumptions about the rendering model (e.g., how velocity enters the photometric integral); this must be expanded to confirm it is not an artifact of specific Gaussian splatting approximations.
  2. [Method section on decoupling] Method section on decoupling (§3.1 likely): The decoupling is introduced as producing block-diagonal gradients 'by construction,' but the paper must clarify whether the shared canonical Gaussian parameters (optimized jointly) can still induce effective cross-source coupling through appearance gradients when observations are temporally offset, undermining the 'eliminating cross-source interference entirely' claim.
  3. [Pose correction pipeline and experiments] Pose correction pipeline and experiments: The static anchor-based correction is asserted to align annotations without new spatial errors, but no quantitative residual error analysis or ablation is provided on how misalignment propagates to dynamic Gaussian optimization; if residuals exceed a threshold, the block-diagonal benefit cannot be isolated from the reported PSNR/FVD gains.
minor comments (3)
  1. [Abstract] Abstract: 'with keeping robustness' is grammatically unclear and should be rephrased to 'while maintaining robustness'.
  2. [Abstract] Abstract: Use standard 'Fréchet Video Distance' spelling instead of 'Fr´echet'.
  3. [Experiments] The paper should include an ablation isolating the contribution of the pose-regularized optimization versus the decoupling itself, as the current results bundle them.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, providing clarifications from the manuscript and indicating planned revisions where appropriate to improve clarity and rigor.

read point-by-point responses
  1. Referee: Abstract and theoretical analysis: The claim that 'any single-timeline formulation incurs an irreducible photometric loss scaling quadratically with agent velocity and cross-source time offset' is load-bearing for the motivation, yet the derivation is presented without explicit steps or assumptions about the rendering model (e.g., how velocity enters the photometric integral); this must be expanded to confirm it is not an artifact of specific Gaussian splatting approximations.

    Authors: We agree that explicit expansion of the derivation is needed for full transparency. Section 3.2 derives the quadratic photometric loss by considering the integral of squared intensity differences between rendered and observed pixels under timestamp misalignment. The velocity term enters through a first-order Taylor expansion of the agent's 3D trajectory between the two sources' capture times, leading to a displacement proportional to velocity times Δt; squaring this displacement in the photometric error yields the quadratic scaling. The derivation assumes a standard differentiable rendering model with Gaussian primitives and linear motion between frames. To address the concern, we will expand the proof with all intermediate mathematical steps, list the rendering assumptions explicitly, and add a remark on its applicability beyond Gaussian splatting specifics. revision: yes

  2. Referee: Method section on decoupling: The decoupling is introduced as producing block-diagonal gradients 'by construction,' but the paper must clarify whether the shared canonical Gaussian parameters (optimized jointly) can still induce effective cross-source coupling through appearance gradients when observations are temporally offset, undermining the 'eliminating cross-source interference entirely' claim.

    Authors: The canonical Gaussians per agent encode intrinsic, view-independent appearance and are shared to enforce consistency. However, because each source renders using its own decoupled pose trajectory evaluated at its true capture timestamp, the pose gradients remain block-diagonal by construction (as stated in Theorem 1). Appearance gradients are aggregated across sources but flow only to the shared parameters; they do not back-propagate to couple the per-source pose variables. This separation ensures no cross-source pose interference. We will insert a clarifying paragraph in §3.1 that explicitly distinguishes the gradient flows for poses versus appearance parameters. revision: partial

  3. Referee: Pose correction pipeline and experiments: The static anchor-based correction is asserted to align annotations without new spatial errors, but no quantitative residual error analysis or ablation is provided on how misalignment propagates to dynamic Gaussian optimization; if residuals exceed a threshold, the block-diagonal benefit cannot be isolated from the reported PSNR/FVD gains.

    Authors: We acknowledge the absence of this quantitative analysis in the submitted version. The static anchor pipeline uses fixed scene landmarks to estimate and correct relative pose offsets between sources, but we did not report residual error statistics or controlled ablations. In the revision we will add a new experimental subsection that (i) quantifies post-correction residual pose errors against available ground-truth alignments, (ii) performs an ablation injecting synthetic misalignments of varying magnitude, and (iii) reports the resulting dynamic-area PSNR and FVD to confirm that the reported gains are driven by the decoupling rather than correction effects. revision: yes

Circularity Check

1 steps flagged

Decoupling produces block-diagonal gradients by construction; quadratic loss presented as general property

specific steps
  1. self definitional [Abstract]
    "We prove that this decoupling enables the pose-gradient kernel block-diagonal, eliminating cross-source interference entirely."

    DUST is defined to maintain decoupled pose trajectories aligned to each source's true capture timestamps while sharing a canonical Gaussian set. The block-diagonal kernel property is an immediate algebraic consequence of this parameter separation (no cross terms between sources), so the 'proof' reduces to restating the decoupling definition rather than deriving a non-trivial result.

full rationale

The paper introduces decoupled pose trajectories as an independent structural change to the Gaussian Scene Graph. The claimed proof that this enables a block-diagonal pose-gradient kernel follows directly from the separation of pose parameters by source timestamp, rendering the elimination of cross-source interference tautological with the model definition. The quadratic photometric loss scaling for single-timeline formulations is derived as a general consequence of the shared-pose assumption rather than fitted from the paper's data or self-citations. No load-bearing self-citations, ansatzes, or uniqueness theorems are invoked. This is a minor presentational circularity with the central reconstruction claims remaining independently testable on V2X-Seq.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The method rests on standard Gaussian splatting rendering assumptions plus new domain assumptions about appearance consistency across time offsets and the effectiveness of anchor-based correction. One optimization hyperparameter for trajectory regularization is introduced but not quantified in the abstract.

free parameters (1)
  • pose regularization weight
    Added to prevent jitter and drift during early joint optimization; its specific value is chosen during training and not reported in the abstract.
axioms (2)
  • domain assumption Any single-timeline Gaussian scene graph formulation incurs photometric loss that scales quadratically with agent velocity and cross-source time offset
    Presented as a proven general property that motivates the need for decoupling.
  • domain assumption Appearance of each dynamic agent can be captured by a single canonical Gaussian set that remains consistent across temporally offset observations
    Core premise enabling the shared appearance model while poses are decoupled.
invented entities (1)
  • Decoupled spatio-temporal pose trajectories no independent evidence
    purpose: To provide independent position paths aligned to each camera source's true capture timestamps
    New structural component introduced to eliminate cross-source gradient interference.

pith-pipeline@v0.9.0 · 5617 in / 1726 out tokens · 80112 ms · 2026-05-12T03:40:51.940225+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

  1. [1]

    Hexplane: A fast representation for dynamic scenes

    Ang Cao and Justin Johnson. Hexplane: A fast representation for dynamic scenes. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 130–141, 2023

  2. [2]

    Periodic vibration gaussian: Dynamic urban scene reconstruction and real-time rendering.International Journal of Computer Vision, 134(3):83, 2026

    Yurui Chen, Chun Gu, Junzhe Jiang, Xiatian Zhu, and Li Zhang. Periodic vibration gaussian: Dynamic urban scene reconstruction and real-time rendering.International Journal of Computer Vision, 134(3):83, 2026

  3. [3]

    Omnire: Omni urban scene reconstruction

    Ziyu Chen, Jiawei Yang, Jiahui Huang, Riccardo De Lutio, Janick Martinez Esturo, Boris Ivanovic, Or Litany, Zan Gojcic, Sanja Fidler, Marco Pavone, et al. Omnire: Omni urban scene reconstruction. InInternational Conference on Learning Representations, volume 2025, pages 85508–85527, 2025

  4. [4]

    K-planes: Explicit radiance fields in space, time, and appearance

    Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12479–12488, 2023

  5. [5]

    Dynamic view synthesis from dynamic monocular video

    Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. Dynamic view synthesis from dynamic monocular video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5712–5721, 2021

  6. [6]

    Where2comm: Communication-efficient collaborative perception via spatial confidence maps.Advances in neural information processing systems, 35:4874–4886, 2022

    Yue Hu, Shaoheng Fang, Zixing Lei, Yiqi Zhong, and Siheng Chen. Where2comm: Communication-efficient collaborative perception via spatial confidence maps.Advances in neural information processing systems, 35:4874–4886, 2022

  7. [7]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

  8. [8]

    H. W. Kuhn. The hungarian method for the assignment problem.Naval Research Logistics Quarterly, page 83–97, 1955

  9. [9]

    Learning distilled collaboration graph for multi-agent perception.Advances in Neural Information Processing Systems, 34:29541–29552, 2021

    Yiming Li, Shunli Ren, Pengxiang Wu, Siheng Chen, Chen Feng, and Wenjun Zhang. Learning distilled collaboration graph for multi-agent perception.Advances in Neural Information Processing Systems, 34:29541–29552, 2021

  10. [10]

    Neural scene flow fields for space-time view synthesis of dynamic scenes

    Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene flow fields for space-time view synthesis of dynamic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6498–6508, 2021

  11. [11]

    Vi-planning: Infrastructure-assisted real-time planning optimization for autonomous driving

    Yang Lu, Jie Wang, Xiaoyun Dong, Ziyao Huang, Bingyi Liu, Jen-Ming Wu, and Jianping Wang. Vi-planning: Infrastructure-assisted real-time planning optimization for autonomous driving. InProceedings of the 31st Annual International Conference on Mobile Computing and Networking, pages 923–937, 2025

  12. [12]

    Robust collaborative 3d object detection in presence of pose errors.arXiv preprint arXiv:2211.07214, 2022

    Yifan Lu, Quanhao Li, Baoan Liu, Mehrdad Dianati, Chen Feng, Siheng Chen, and Yanfeng Wang. Robust collaborative 3d object detection in presence of pose errors.arXiv preprint arXiv:2211.07214, 2022

  13. [13]

    D-nerf: Neural radiance fields for dynamic scenes

    Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10318–10327, 2021

  14. [14]

    Limited memory broyden-fletcher- goldfarb-shanno (l-bfgs) method for the parameter estimation on geographically weighted ordinal logistic regression model (gwolr)

    Dewi Retno Sari Saputro and Purnami Widyaningsih. Limited memory broyden-fletcher- goldfarb-shanno (l-bfgs) method for the parameter estimation on geographically weighted ordinal logistic regression model (gwolr). InAIP conference proceedings, volume 1868, page 040009. AIP Publishing LLC, 2017

  15. [15]

    Vips: Real-time perception fusion for infrastructure-assisted autonomous driving

    Shuyao Shi, Jiahe Cui, Zhehao Jiang, Zhenyu Yan, Guoliang Xing, Jianwei Niu, and Zhenchao Ouyang. Vips: Real-time perception fusion for infrastructure-assisted autonomous driving. In Proceedings of the 28th annual international conference on mobile computing and networking, pages 133–146, 2022. 10

  16. [16]

    Animating rotation with quaternion curves

    Ken Shoemake. Animating rotation with quaternion curves. InProceedings of the 12th annual conference on Computer graphics and interactive techniques, pages 245–254, 1985

  17. [17]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In European conference on computer vision, pages 402–419. Springer, 2020

  18. [18]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

  19. [19]

    Bovik, H.R

    Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004

  20. [20]

    4d gaussian splatting for real-time dynamic scene rendering

    Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20310–20320, 2024

  21. [21]

    V2x-real: a largs-scale dataset for vehicle-to-everything cooperative perception

    Hao Xiang, Zhaoliang Zheng, Xin Xia, Runsheng Xu, Letian Gao, Zewei Zhou, Xu Han, Xinkai Ji, Mingxi Li, Zonglin Meng, et al. V2x-real: a largs-scale dataset for vehicle-to-everything cooperative perception. InEuropean Conference on Computer Vision, pages 455–470. Springer, 2024

  22. [22]

    Cruise: Cooperative reconstruction and editing in v2x scenarios using gaussian splatting

    Haoran Xu, Saining Zhang, Peishuo Li, Baijun Ye, Xiaoxue Chen, Huan-ang Gao, Jv Zheng, Xiaowei Song, Ziqiao Peng, Run Miao, et al. Cruise: Cooperative reconstruction and editing in v2x scenarios using gaussian splatting. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 12518–12525. IEEE, 2025

  23. [23]

    V2x-vit: Vehicle-to-everything cooperative perception with vision transformer

    Runsheng Xu, Hao Xiang, Zhengzhong Tu, Xin Xia, Ming-Hsuan Yang, and Jiaqi Ma. V2x-vit: Vehicle-to-everything cooperative perception with vision transformer. InEuropean conference on computer vision, pages 107–124. Springer, 2022

  24. [24]

    Opv2v: An open benchmark dataset and fusion pipeline for perception with vehicle-to-vehicle communication

    Runsheng Xu, Hao Xiang, Xin Xia, Xu Han, Jinlong Li, and Jiaqi Ma. Opv2v: An open benchmark dataset and fusion pipeline for perception with vehicle-to-vehicle communication. In2022 International Conference on Robotics and Automation (ICRA), pages 2583–2589, 2022

  25. [25]

    Instinct: Instance- level interaction architecture for query-based collaborative perception

    Yunjiang Xu, Lingzhi Li, Jin Wang, Yupeng Ouyang, and Benyuan Yang. Instinct: Instance- level interaction architecture for query-based collaborative perception. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25464–25473, 2025

  26. [26]

    Street gaussians: Modeling dynamic urban scenes with gaussian splatting

    Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians: Modeling dynamic urban scenes with gaussian splatting. InEuropean Conference on Computer Vision, pages 156–173. Springer, 2024

  27. [27]

    arXiv preprint arXiv:2311.02077 , year=

    Jiawei Yang, Boris Ivanovic, Or Litany, Xinshuo Weng, Seung Wook Kim, Boyi Li, Tong Che, Danfei Xu, Sanja Fidler, Marco Pavone, et al. Emernerf: Emergent spatial-temporal scene decomposition via self-supervision.arXiv preprint arXiv:2311.02077, 2023

  28. [28]

    Bevheight: A robust framework for vision-based roadside 3d object detection

    Lei Yang, Kaicheng Yu, Tao Tang, Jun Li, Kun Yuan, Li Wang, Xinyu Zhang, and Peng Chen. Bevheight: A robust framework for vision-based roadside 3d object detection. InIEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), March 2023

  29. [29]

    Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction

    Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20331–20341, 2024

  30. [30]

    Dair-v2x: A large-scale dataset for vehicle- infrastructure cooperative 3d object detection

    Haibao Yu, Yizhen Luo, Mao Shu, Yiyi Huo, Zebang Yang, Yifeng Shi, Zhenglong Guo, Hanyu Li, Xing Hu, Jirui Yuan, and Zaiqing Nie. Dair-v2x: A large-scale dataset for vehicle- infrastructure cooperative 3d object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21361–21370, 2022. 11

  31. [31]

    V2x-seq: A large-scale sequential dataset for vehicle-infrastructure cooperative perception and forecasting

    Haibao Yu, Wenxian Yang, Hongzhi Ruan, Zhenwei Yang, Yingjuan Tang, Xu Gao, Xin Hao, Yifeng Shi, Yifeng Pan, Ning Sun, Juan Song, Jirui Yuan, Ping Luo, and Zaiqing Nie. V2x-seq: A large-scale sequential dataset for vehicle-infrastructure cooperative perception and forecasting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni...

  32. [32]

    End-to-end autonomous driving through v2x cooperation

    Haibao Yu, Wenxian Yang, Jiaru Zhong, Zhenwei Yang, Siqi Fan, Ping Luo, and Zaiqing Nie. End-to-end autonomous driving through v2x cooperation. InThe 39th Annual AAAI Conference on Artificial Intelligence, 2025

  33. [33]

    The unreason- able effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InCVPR, 2018

  34. [34]

    Drivinggaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes

    Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan, Yongtao Wang, Deqing Sun, and Ming-Hsuan Yang. Drivinggaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21634–21643, 2024

  35. [35]

    times of data

    Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5745–5753, 2019. 12 Appendices A Notation 13 B Proof 15 B.1 Formal Rendering Model and Assumptions . . . . . . . . . . . . . . . . . . . ....