GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation
Pith reviewed 2026-05-20 11:02 UTC · model grok-4.3
The pith
A geometry-consistency reward makes scene geometry an explicit optimization target for text-to-video models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a geometry-consistency reward that directly measures whether motion in a generated video is compatible with a coherent scene. Our key insight is that in physically consistent videos, background motion should be explainable by rigid camera-induced flow, while independently moving objects should preserve appearance identity along motion trajectories. We operationalize this using optical flow, depth–pose predictions, and feature-based correspondence to separate rigid and dynamic regions and evaluate their respective consistency. Integrating this reward with reinforcement fine-tuning transforms geometric consistency from an emergent property into an explicit optimization objective.
What carries the argument
The geometry-consistency reward that separates rigid background regions from dynamic objects and scores each for physical plausibility using optical flow, depth-pose predictions, and feature-based correspondence.
If this is right
- Generated videos exhibit fewer temporal geometric artifacts including object stretching and texture drift under camera motion.
- The method applies to diverse scenes containing both camera movement and independently moving objects.
- The reward can be added to existing video generators without altering their core architecture.
- Perceptual quality metrics remain comparable to the original models after fine-tuning.
Where Pith is reading between the lines
- The same separation of rigid and dynamic regions could be reused to create evaluation benchmarks that specifically target 3D consistency rather than just pixel-level similarity.
- Combining this reward with other explicit signals such as lighting or material consistency might produce videos that obey multiple physical constraints at once.
- Because the reward is model-agnostic, it could be applied during continued training of future larger-scale video generators to maintain consistency as model capacity grows.
Load-bearing premise
The approach assumes that optical flow, depth-pose predictions, and feature-based correspondence can reliably separate rigid background regions from dynamic objects and accurately evaluate their respective consistency.
What would settle it
Fine-tuning a video model with the reward and then measuring no reduction in geometric artifacts such as object deformation or background warping on held-out test videos would show the reward does not achieve its claimed effect.
Figures
read the original abstract
Generating geometrically consistent videos remains an open challenge: text-to-video diffusion models trained on web-scale data treat geometry only implicitly, leading to object deformation, texture drift, and non-rigid backgrounds under camera motion. Existing solutions either improve consistency as a byproduct, apply only to static scenes or realign the latent space of the model completely. We introduce a geometry-consistency reward that directly measures whether motion in a generated video is compatible with a coherent scene. Our key insight is that in physically consistent videos, background motion should be explainable by rigid camera-induced flow, while independently moving objects should preserve appearance identity along motion trajectories. We operationalize this using optical flow, depth--pose predictions, and feature-based correspondence to separate rigid and dynamic regions and evaluate their respective consistency. Integrating this reward with reinforcement fine-tuning transforms geometric consistency from an emergent property into an explicit optimization objective for video generators. The approach is model agnostic and applies to diverse dynamic scenes containing both camera and object motion. Experiments show substantial reductions in temporal geometric artifacts over strong baselines while preserving perceptual quality. Code and model weights are published.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GeoFlow, a geometry-consistency reward for text-to-video diffusion models. The reward operationalizes physical consistency by using optical flow, depth-pose predictions, and feature-based correspondence to separate rigid background motion (explainable by camera pose) from dynamic object motion (appearance-preserving trajectories). This reward is combined with reinforcement fine-tuning to make geometric consistency an explicit optimization target rather than an emergent property. The method is presented as model-agnostic and applicable to dynamic scenes with both camera and object motion. Experiments are claimed to show substantial reductions in temporal geometric artifacts (object deformation, texture drift, non-rigid backgrounds) while preserving perceptual quality, with code and model weights released.
Significance. If the results hold, the work provides a concrete mechanism for injecting geometric priors into generative video models via RL, moving beyond implicit learning from web data. The reliance on off-the-shelf CV modules for the reward is a strength, as it avoids self-referential or fitted parameters and enables falsifiable evaluation. Releasing code and weights supports reproducibility. The approach could influence downstream applications requiring reliable 3D-consistent video, such as simulation or robotics, provided the reward signal proves robust.
major comments (2)
- [§3] §3 (Reward Formulation): The central claim requires that the geometry reward accurately scores consistency even when inputs contain the very artifacts it targets. Because the optical flow, depth-pose, and correspondence networks are pre-trained on real footage, their behavior on diffusion outputs with deformations and drift must be validated; otherwise the RL stage may optimize against predictor noise rather than scene geometry. The manuscript should include a controlled study (e.g., injecting known geometric errors and measuring reward correlation) to establish that the proxy remains informative.
- [§4] §4 (Experiments): The abstract asserts 'substantial reductions in temporal geometric artifacts' and preservation of perceptual quality, yet the strength of this claim depends on the specific metrics, baselines, and ablations reported. Quantitative results comparing against prior consistency methods, together with ablations isolating the rigid/dynamic separation and the RL reward weighting, are needed to confirm that improvements are attributable to the proposed objective rather than other factors.
minor comments (2)
- [§3] Notation for the combined reward (rigid flow term + dynamic trajectory term) should be introduced with an explicit equation early in §3 to aid readability.
- [Figures] Figure captions describing qualitative results should explicitly label which artifacts are reduced (e.g., 'non-rigid background warping') for direct comparison with the claims.
Simulated Author's Rebuttal
We thank the referee for their insightful comments and constructive feedback on the manuscript. We address each major comment below and outline the revisions planned for the next version.
read point-by-point responses
-
Referee: [§3] §3 (Reward Formulation): The central claim requires that the geometry reward accurately scores consistency even when inputs contain the very artifacts it targets. Because the optical flow, depth-pose, and correspondence networks are pre-trained on real footage, their behavior on diffusion outputs with deformations and drift must be validated; otherwise the RL stage may optimize against predictor noise rather than scene geometry. The manuscript should include a controlled study (e.g., injecting known geometric errors and measuring reward correlation) to establish that the proxy remains informative.
Authors: We agree that validating the reward's behavior on artifact-containing diffusion outputs is essential to ensure the RL stage optimizes for genuine geometric consistency rather than noise from the pre-trained predictors. In the revised manuscript we will add a controlled study: we will start from real videos, synthetically inject graded geometric errors (controlled object deformations, texture drifts, and non-rigid background motion), and report the Pearson correlation between the resulting reward scores and the magnitude of the injected errors. We will also quantify the accuracy drop of the off-the-shelf optical-flow, depth-pose, and correspondence modules when evaluated directly on generated samples. revision: yes
-
Referee: [§4] §4 (Experiments): The abstract asserts 'substantial reductions in temporal geometric artifacts' and preservation of perceptual quality, yet the strength of this claim depends on the specific metrics, baselines, and ablations reported. Quantitative results comparing against prior consistency methods, together with ablations isolating the rigid/dynamic separation and the RL reward weighting, are needed to confirm that improvements are attributable to the proposed objective rather than other factors.
Authors: We acknowledge that stronger quantitative support and targeted ablations are needed to substantiate the claims. In the revised experiments section we will add direct comparisons against prior consistency-enhancement methods, reporting numerical scores on geometric-consistency metrics (rigid-region flow error, appearance-preservation along trajectories) and perceptual-quality metrics (CLIP similarity, FID, and a small-scale user study). We will also include ablations that separately disable the rigid/dynamic separation and vary the RL reward weighting, thereby isolating the contribution of the proposed objective. revision: yes
Circularity Check
No significant circularity; reward uses external CV modules
full rationale
The paper's central construction defines a geometry-consistency reward by applying off-the-shelf optical flow, depth-pose, and feature correspondence networks to separate rigid background motion from dynamic objects and score appearance preservation along trajectories. These modules are pre-trained external components whose outputs are treated as measurements of physical consistency; the reward is then used as an RL objective. No equation or step reduces the reward value to a fitted parameter of the generator, a self-definition, or a self-citation chain. The derivation therefore remains self-contained against independent computer-vision benchmarks rather than being forced by construction from the video model's own outputs or prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption In physically consistent videos, background motion should be explainable by rigid camera-induced flow while independently moving objects preserve appearance identity along motion trajectories.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
background motion should be explainable by rigid camera-induced flow, while independently moving objects should preserve appearance identity along motion trajectories... using optical flow, depth–pose predictions, and feature-based correspondence to separate rigid and dynamic regions
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel refines?
refinesRelation between the paper passage and the cited Recognition theorem.
R_geo = 1/|Ω| Σ Q_geo(u) − 1 ... Q_geo(u) = (1 − min(Ē_epe(u),1)) · (1 − min(E_depth(u),1))
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Asim, M., Wewer, C., Wimmer, T., Schiele, B., Lenssen, J.E.: Met3r: Measuring multi-view consistency in generated images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6034–6044 (2025) 12, S2
work page 2025
-
[2]
In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)
Bahmani, S., Skorokhodov, I., Qian, G., Siarohin, A., Menapace, W., Tagliasac- chi, A., Lindell, D.B., Tulyakov, S.: AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 22875–22889 (2025) 9
work page 2025
-
[3]
arXiv preprint arXiv:2407.12781 (2024) 2, 3
Bahmani, S., Skorokhodov, I., Siarohin, A., Menapace, W., Qian, G., Vasilkovsky, M., Lee, H.Y., Wang, C., Zou, J., Tagliasacchi, A., et al.: VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control. arXiv preprint arXiv:2407.12781 (2024) 2, 3
-
[4]
arXiv preprint arXiv:2412.07760 (2024) 3
Bai, J., Xia, M., Wang, X., Yuan, Z., Fu, X., Liu, Z., Hu, H., Wan, P., Zhang, D.: SyncamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints. arXiv preprint arXiv:2412.07760 (2024) 3
-
[5]
arXiv preprint arXiv:2512.03453 (2025) 4, 11
Bai, Y., Fang, S., Yu, C., Wang, F., Huang, Q.: Geovideo: Introducing geometric regularization into video generation model. arXiv preprint arXiv:2512.03453 (2025) 4, 11
-
[6]
1 kontext: Flow matching for in-context image generation and editing in latent space
Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., En- glish, J., English, Z., Esser, P., Kulal, S., et al.: Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv e-prints pp. arXiv–2506 (2025) 1
work page 2025
-
[7]
In: Proceed- ings of the Computer Vision and Pattern Recognition Conference
Bengtson, J., Nilsson, D., Kahl, F.: Geometric consistency refinement for single im- age novel view synthesis via test-time adaptation of diffusion models. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 6399–6408 (2025) 4
work page 2025
-
[8]
Training Diffusion Models with Reinforcement Learning
Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training Diffusion Models with Reinforcement Learning. arXiv preprint arXiv:2305.13301 (2023) 4, 8, 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video gener- ation models as world simulators (2024),https://openai.com/research/video- generation-models-as-world-simulators3
work page 2024
-
[10]
Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Generative Interac- tive Environments. In: Proceedings of the International Conference on Machine Learning (ICML) (2024) 3 GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation 17
work page 2024
-
[11]
arXiv preprint arXiv:2508.21058 (2025) 3
Cai, S., Yang, C., Zhang, L., Guo, Y., Xiao, J., Yang, Z., Xu, Y., Yang, Z., Yuille, A., Guibas, L., et al.: Mixture of contexts for long video generation. arXiv preprint arXiv:2508.21058 (2025) 3
-
[12]
Cao, C., et al.: MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025),https://openaccess.thecvf.com/content/ CVPR2025/papers/Cao_MVGenMaster_Scaling_Multi- View_Generation_from_ Any_Image_via_3D_Priors_CVPR_2025_paper.pdf3
work page 2025
-
[13]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., De Mello, S., Gallo, O., Guibas, L.J., Tremblay, J., Khamis, S., et al.: Efficient Geometry-Aware 3D Generative Adversarial Networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16123–16133 (2022) 3
work page 2022
-
[14]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Chan, E.R., Monteiro, M., Kellnhofer, P., Wu, J., Wetzstein, G.: pi-GAN: Pe- riodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5799–5809 (2021) 3
work page 2021
-
[15]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Chan,E.R.,Nagano,K.,Chan,M.A.,Bergman,A.W.,Park,J.J.,Levy,A.,Aittala, M., De Mello, S., Karras, T., Wetzstein, G.: Generative Novel View Synthesis with 3D-Aware Diffusion Models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4217–4229 (2023) 3
work page 2023
-
[16]
Advances in Neural Information Processing Systems37, 24081–24125 (2024) 3
Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2024) 3
work page 2024
-
[17]
arXiv preprint arXiv:2410.18974 (2024),https://arxiv
Chen, H., Shen, B., Liu, Y., Shi, R., Zhou, L., Lin, C.Z., Gu, J., Su, H., Wetzstein, G., Guibas, L.: 3D-Adapter: Geometry-Consistent Multi-View Diffusion for High- Quality 3D Generation. arXiv preprint arXiv:2410.18974 (2024),https://arxiv. org/abs/2410.189744
-
[18]
Chen, S., Chen, X., Pang, A., Zeng, X., Cheng, W., Fu, Y., Yin, F., Wang, B., Yu, J., Yu, G., et al.: MeshXL: Neural Coordinate Field for Generative 3D Foundation Models.In:AdvancesinNeuralInformationProcessingSystems(NeurIPS).vol.37, pp. 97141–97166 (2024) 2
work page 2024
-
[19]
In: Proceedings of the European Conference on Computer Vision (ECCV)
Chen, Y., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.J., Cai, J.: MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 370–386. Springer (2024) 3
work page 2024
-
[20]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Chen, Z., Tang, J., Dong, Y., Cao, Z., Hong, F., Lan, Y., Wang, T., Xie, H., Wu, T., Saito, S., et al.: 3DTopia-XL: Scaling High-Quality 3D Asset Generation via Primitive Diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 26576–26586 (2025) 2
work page 2025
-
[21]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Chou, G., Zhang, K., Bi, S., Tan, H., Xu, Z., Luan, F., Hariharan, B., Snavely, N.: Generating 3d-consistent videos from unposed internet photos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 27934–27945 (2025) 3
work page 2025
-
[22]
Du, H., Ye, J., Cong, X., Li, R., Ni, J., Agarwal, A., Zhou, Z., Li, Z., Balestriero, R., Wang, Y.: Videogpa: Distilling geometry priors for 3d-consistent video generation (2026),https://arxiv.org/abs/2601.232864, 11
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Edelstein, Y., Patashnik, O., Cohen-Bar, D., Zelnik-Manor, L.: Sharp-It: A Multi- view to Multi-view Diffusion Model for 3D Synthesis and Manipulation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) (2025),https://openaccess.thecvf.com/content/CVPR2025/ 18 J. Ackermann et al. papers/Edelstein_Sharp-It_A_Mu...
work page 2025
-
[24]
In: Advances in Neural Information Processing Systems (NeurIPS)
Gao,J.,Shen,T.,Wang,Z.,Chen,W.,Yin,K.,Li,D.,Litany,O.,Gojcic,Z.,Fidler, S.: GET3D: A Generative Model of High Quality 3D Textured Shapes Learned from Images. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 35, pp. 31841–31854 (2022) 3
work page 2022
-
[25]
CAT3D: Create Anything in 3D with Multi-View Diffusion Models
Gao, R., Holynski, A., Henzler, P., Brussee, A., Martin-Brualla, R., Srinivasan, P., Barron, J.T., Poole, B.: Cat3D: Create Anything in 3D with Multi-View Diffusion Models. arXiv preprint arXiv:2405.10314 (2024) 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
GeCo: Evaluating Geometric Consistency for Video Generation via Motion and Structure
Gu, L., Hur, J., Herrmann, C., Zhan, F., Zickler, T., Sun, D., Pfister, H.: Geco: A differentiable geometric consistency metric for video generation. arXiv preprint arXiv:2512.22274 (2025) 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
He, H., et al.: CameraCtrl: Enabling Camera Control for Text-to-Video Diffusion Models. arXiv preprint arXiv:2404.02101 (2024),https://arxiv.org/abs/2404. 021012, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: Large-scale pretrain- ing for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022) 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[29]
LRM: Large Reconstruction Model for Single Image to 3D
Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K., Bui, T., Tan, H.: LRM: Large Reconstruction Model for Single Image to 3D. arXiv preprint arXiv:2311.04400 (2023),https://arxiv.org/abs/2311.044003
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022) 9, S6
work page 2022
-
[31]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024) 11, 12, S7
work page 2024
-
[32]
Adam: A Method for Stochastic Optimization
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 9, S6
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[33]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024) 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
In: Advances in Neural Information Processing Systems (NeurIPS)
Kuang, Z., Cai, S., He, H., Xu, Y., Li, H., Guibas, L.J., Wetzstein, G.: Collabo- rative Video Diffusion: Consistent Multi-Video Generation with Camera Control. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 37, pp. 16240–16271 (2024) 3
work page 2024
-
[35]
arXiv preprint arXiv:2510.21615 (2025) 4
Kupyn,O.,Manhardt,F.,Tombari,F.,Rupprecht,C.:Epipolargeometryimproves video generation models. arXiv preprint arXiv:2510.21615 (2025) 4
-
[36]
In: Proceedings of the European Conference on Computer Vision (ECCV)
Leroy,V.,Cabon,Y.,Revaud,J.:GroundingImageMatchingin3DwithMAST3R. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 71–
-
[37]
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
Li, J., Cui, Y., Huang, T., Ma, Y., Fan, C., Yang, M., Zhong, Z.: Mixgrpo: Unlock- ing flow-based grpo efficiency with mixed ode-sde. arXiv preprint arXiv:2507.21802 (2025) 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Li, P., et al.: Era3D: High-Resolution Multiview Diffusion Using Efficient Re- arrangement Attention. In: Advances in Neural Information Processing Systems (NeurIPS) (2024),https://proceedings.neurips.cc/paper_files/paper/2024/ file/65a723bf7d8dad838c09178270d30e80-Paper-Conference.pdf3 GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation 19
work page 2024
-
[39]
Liang, Y., et al.: Rich Human Feedback for Text-to-Image Generation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) (2024),https://openaccess.thecvf.com/content/CVPR2024/ papers/Liang_Rich_Human_Feedback_for_Text- to- Image_Generation_CVPR_ 2024_paper.pdf4
work page 2024
-
[40]
Depth Anything 3: Recovering the Visual Space from Any Views
Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025) 5, S1, S7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 22166–22176 (2024) 11, S6, S7
work page 2024
-
[43]
Flow Matching for Generative Modeling
Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) S4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
Flow-GRPO: Training Flow Matching Models via Online RL
Liu, J., Liu, G., Liang, J., Li, Y., Liu, J., Wang, X., Wan, P., Zhang, D., Ouyang, W.: Flow-GRPO: Training Flow Matching Models via Online RL. arXiv preprint arXiv:2505.05470 (2025) 2, 3, 4, 5, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Zero-1-to-3: Zero-shot one image to 3d object.arXiv preprint arXiv:2303.11328, 2023
Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero- 1-to-3: Zero-shot One Image to 3D Object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9298–9309 (2023), https://arxiv.org/abs/2303.113283
-
[46]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Liu, X., Gong, C., Liu, Q.: Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. arXiv preprint arXiv:2209.03003 (2023) 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
SyncDreamer: Generating Multiview-consistent Images from a Single-view Image
Liu, Y., et al.: SyncDreamer: Generating Multiview-Consistent Images from a Single-View Image. arXiv preprint arXiv:2309.03453 (2023),https://arxiv.org/ abs/2309.034533
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Interna- tional journal of computer vision60(2), 91–110 (2004) S2
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna- tional journal of computer vision60(2), 91–110 (2004) S2
work page 2004
-
[49]
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
Nan, K., Xie, R., Zhou, P., Fan, T., Zheng, Z., Huang, Z., Li, H., Li, J., Li, J.: OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-Video Generation. arXiv preprint arXiv:2407.02371 (2024) 11, S6, S7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
DINOv2: Learning Robust Visual Features without Supervision
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 5, S7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
arXiv preprint arXiv:2512.12080 (2025) 3
Po, R., Chan, E.R., Chen, C., Wetzstein, G.: Bagger: Backwards aggregation for mitigating drift in autoregressive video diffusion models. arXiv preprint arXiv:2512.12080 (2025) 3
-
[52]
Long-context state-space video world models.ArXiv, abs/2505.20171, 2025
Po, R., Nitzan, Y., Zhang, R., Chen, B., Dao, T., Shechtman, E., Wetzstein, G., Huang, X.: Long-Context State-Space Video World Models. arXiv preprint arXiv:2505.20171 (2025) 3
-
[53]
Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Di- rect preference optimization: Your language model is secretly a reward model. In: NeurIPS (2023) 4
work page 2023
-
[54]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 1
work page 2022
-
[55]
Sampson, P.D.: Fitting conic sections to “very scattered” data: An iterative refine- ment of the bookstein algorithm. Computer graphics and image processing18(1), 97–108 (1982) 12, S2 20 J. Ackermann et al
work page 1982
-
[56]
arXiv preprint arXiv:2303.07937 (2023) 4
Seo, J., Jang, W., Kwak, M.S., Kim, H., Ko, J., Kim, J., Kim, J.H., Lee, J., Kim, S.: Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation. arXiv preprint arXiv:2303.07937 (2023) 4
-
[57]
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. In: arXiv (2024) 4
work page 2024
-
[58]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y., Wu, Y., Guo, D.: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv preprint arXiv:2402.03300 (2024) 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
MVDream: Multi-view Diffusion for 3D Generation
Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: MVDream: Multi-view Dif- fusion for 3D Generation. arXiv preprint arXiv:2308.16512 (2023) 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model
Shi, Y., Wang, P., Ye, J., Mai, L., Li, K., Yang, X.: Zero123++: Single Image to Consistent Multi-view Diffusion Base Model. arXiv preprint arXiv:2310.15110 (2023),https://arxiv.org/abs/2310.151103
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Shue, J.R., Chan, E.R., Po, R., Ankner, Z., Wu, J., Wetzstein, G.: 3D Neural Field Generation using Triplane Diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 20875–20886 (2023) 3
work page 2023
-
[62]
History-Guided Video Diffusion
Song, K., Chen, B., Simchowitz, M., Du, Y., Tedrake, R., Sitzmann, V.: History- Guided Video Diffusion. arXiv preprint arXiv:2502.06764 (2025) 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Score-Based Generative Modeling through Stochastic Differential Equations
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020) 8
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[64]
Tang, J., Chen, Z., Chen, X., Wang, T., Zeng, G., Liu, Z.: LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2024),https://arxiv.org/ abs/2402.050543
-
[65]
Gemini: A Family of Highly Capable Multimodal Models
Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023) S6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
Team, G.: Mochi 1.https://github.com/genmoai/models(2024) 3
work page 2024
-
[67]
In: European conference on computer vision
Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: European conference on computer vision. pp. 402–419. Springer (2020) S1, S3
work page 2020
-
[68]
TripoSR: Fast 3D Object Reconstruction from a Single Image
Tochilkin, D., Pankratz, D., Liu, Z., Huang, Z., Letts, A., Li, Y., Liang, D., Laforte, C., Jampani, V., Cao, Y.P.: TripoSR: Fast 3D Object Reconstruction from a Single Image. arXiv preprint arXiv:2403.02151 (2024),https://arxiv.org/abs/2403. 021513
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
In: Proceedings of the European Conference on Com- puter Vision (ECCV)
Van Hoorick, B., Wu, R., Ozguroglu, E., Sargent, K., Liu, R., Tokmakov, P., Dave, A., Zheng, C., Vondrick, C.: Generative Camera Dolly: Extreme Monocular Dy- namic Novel View Synthesis. In: Proceedings of the European Conference on Com- puter Vision (ECCV). pp. 313–331. Springer (2024) 3
work page 2024
-
[70]
Wallace, B., et al.: Diffusion Model Alignment Using Direct Preference Op- timization. In: Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) (2024),https://openaccess.thecvf. com/content/CVPR2024/papers/Wallace_Diffusion_Model_Alignment_Using_ Direct_Preference_Optimization_CVPR_2024_paper.pdf4
work page 2024
-
[71]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan Team: Wan: Open and Advanced Large-Scale Video Generative Models. arXiv preprint arXiv:2503.20314 (2025) 2, 3, 9, S6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
$\pi^3$: Permutation-Equivariant Visual Geometry Learning
Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang, J., Shen, C., He, T.:π3: Scalable Permutation-Equivariant Visual Geometry Learning. arXiv preprint arXiv:2507.13347 (2025) S1 GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation 21
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[73]
arXiv preprint arXiv:2506.21526 (2025) 5, S1, S7
Wang, Y., Deng, J.: Waft: Warping-alone field transforms for optical flow. arXiv preprint arXiv:2506.21526 (2025) 5, S1, S7
-
[74]
Video models are zero-shot learners and reasoners
Wiedemer, T., Li, Y., Vicol, P., Gu, S.S., Matarese, N., Swersky, K., Kim, B., Jaini, P., Geirhos, R.: Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328 (2025) 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[75]
arXiv preprint arXiv:2512.02793 (2025) 4
Wu, F., Wei, J., Li, R., Xu, Y., Li, J., Ye, D., Lin, G.: Ic-world: In-context gener- ation for shared world modeling. arXiv preprint arXiv:2512.02793 (2025) 4
-
[76]
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Wu, H., Wu, D., He, T., Guo, J., Ye, Y., Duan, Y., Bian, J.: Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling. arXiv preprint arXiv:2507.07982 (2025) 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[77]
In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR)
Wu, R., Gao, R., Poole, B., Trevithick, A., Zheng, C., Barron, J.T., Holynski, A.: Cat4D: Create Anything in 4D with Multi-View Video Diffusion Models. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR). pp. 26057–26068 (2025) 3
work page 2025
-
[78]
arXiv preprint arXiv:2506.05284 (2025) 3
Wu, T., Yang, S., Po, R., Xu, Y., Liu, Z., Lin, D., Wetzstein, G.: Video World Models with Long-term Spatial Memory. arXiv preprint arXiv:2506.05284 (2025) 3
-
[79]
arXiv preprint arXiv:2504.12369 (2025) 3
Xiao, Z., Lan, Y., Zhou, Y., Ouyang, W., Yang, S., Zeng, Y., Pan, X.: World- Mem: Long-Term Consistent World Simulation with Memory. arXiv preprint arXiv:2504.12369 (2025) 3
-
[80]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Xie, D., Li, J., Tan, H., Sun, X., Shu, Z., Zhou, Y., Bi, S., Pirk, S., Kauf- man, A.E.: Carve3D: Improving Multi-view Reconstruction Consistency for Diffu- sion Models with RL Finetuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6369–6379 (2024), https : / / openaccess . thecvf . com / content / CV...
work page 2024
-
[81]
In: Proceedings of the Asian conference on computer vision
Xie, J., Yang, C., Xie, W., Zisserman, A.: Moving object segmentation: All you need is sam (and flow). In: Proceedings of the Asian conference on computer vision. pp. 162–178 (2024) S2
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.