{"total":49,"items":[{"citing_arxiv_id":"2605.13838","ref_index":147,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow","primary_cat":"cs.CV","submitted_at":"2026-05-13T17:58:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12480","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:56:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12038","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation","primary_cat":"cs.CV","submitted_at":"2026-05-12T12:21:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11723","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating","primary_cat":"cs.CV","submitted_at":"2026-05-12T08:08:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GRPO training with temporal/spatial IoU rewards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04461","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Stream-T1: Test-Time Scaling for Streaming Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-06T03:40:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve temporal consistency and visual quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02583","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Stylistic Attribute Control in Latent Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-05-04T13:34:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02948","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-01T16:38:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26917","ref_index":62,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation","primary_cat":"cs.CV","submitted_at":"2026-04-29T17:27:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"AnimateAnyMesh++ animates arbitrary 3D meshes from text using an expanded 300K-identity DyMesh-XL dataset, a power-law topology-aware DyMeshVAE-Flex, and a variable-length rectified-flow generator to produce semantically accurate, temporally coherent animations in seconds.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25819","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation","primary_cat":"cs.CV","submitted_at":"2026-04-28T16:28:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24764","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"World-R1: Reinforcing 3D Constraints for Text-to-Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-27T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"World-R1 uses RL with 3D model feedback and a new text dataset to improve geometric consistency in text-to-video generation while keeping the base model unchanged.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23789","ref_index":10,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-26T16:28:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21291","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-23T05:10:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Synthetic data complements real data in diffusion-based controllable human video generation, with effective sample selection improving motion realism, temporal consistency, and identity preservation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19636","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation","primary_cat":"cs.CV","submitted_at":"2026-04-21T16:25:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18215","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-20T13:00:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17565","ref_index":29,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models","primary_cat":"cs.CV","submitted_at":"2026-04-19T18:11:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22828","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MetaEarth3D: Unlocking World-scale 3D Generation with Spatially Scalable Generative Modeling","primary_cat":"cs.CV","submitted_at":"2026-04-19T15:09:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MetaEarth3D is the first generative foundation model for spatially consistent, unbounded 3D scene generation at planetary scale using optical Earth observation data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16272","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects","primary_cat":"cs.CV","submitted_at":"2026-04-17T17:28:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15829","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Beyond Text Prompts: Precise Concept Erasure through Text-Image Collaboration","primary_cat":"cs.CV","submitted_at":"2026-04-17T08:32:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TICoE achieves more precise and faithful concept erasure in text-to-image models by collaborating text and image data through a convex manifold and hierarchical learning, outperforming prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15003","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-16T13:27:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Flow of Truth introduces a learnable forensic template and template-guided flow module that follows pixel motion to enable temporal tracing in image-to-video generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14025","ref_index":254,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective","primary_cat":"cs.CV","submitted_at":"2026-04-15T16:07:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13793","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation","primary_cat":"cs.CV","submitted_at":"2026-04-15T12:32:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13509","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer","primary_cat":"cs.CV","submitted_at":"2026-04-15T05:52:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and reference-guided video stylization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13425","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning","primary_cat":"cs.CV","submitted_at":"2026-04-15T02:51:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VibeFlow performs versatile video chroma-lux editing in zero-shot fashion by self-supervised disentanglement of structure and color-illumination cues inside pre-trained video models, plus residual velocity fields and consistency regularization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13030","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Generative Refinement Networks for Visual Synthesis","primary_cat":"cs.CV","submitted_at":"2026-04-14T17:59:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12255","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ARGen: Affect-Reinforced Generative Augmentation towards Vision-based Dynamic Emotion Perception","primary_cat":"cs.CV","submitted_at":"2026-04-14T04:05:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ARGen generates high-fidelity dynamic facial expression videos using affective semantic injection and adaptive reinforcement diffusion to improve emotion recognition models facing data scarcity and long-tail distributions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11789","ref_index":53,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation","primary_cat":"cs.CV","submitted_at":"2026-04-13T17:55:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11283","ref_index":129,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Empowering Video Translation using Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-13T10:42:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16479","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Latent-Compressed Variational Autoencoder for Video Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-04-12T04:45:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10127","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation","primary_cat":"cs.CV","submitted_at":"2026-04-11T09:44:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09231","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Hitem3D 2.0: Multi-View Guided Native 3D Texture Generation","primary_cat":"cs.CV","submitted_at":"2026-04-10T11:40:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Hitem3D 2.0 combines multi-view image synthesis with native 3D texture projection to improve completeness, cross-view consistency, and geometry alignment over prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08719","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving","primary_cat":"cs.CV","submitted_at":"2026-04-09T19:13:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LMGenDrive unifies LLM-based multimodal understanding with generative world models to output both future driving videos and control signals for end-to-end closed-loop autonomous driving.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07958","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks","primary_cat":"cs.CV","submitted_at":"2026-04-09T08:22:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07209","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling","primary_cat":"cs.CV","submitted_at":"2026-04-08T15:31:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching distillation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07026","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Not all tokens contribute equally to diffusion learning","primary_cat":"cs.CV","submitted_at":"2026-04-08T12:45:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DARE mitigates neglect of important tokens in conditional diffusion models via distribution-rectified guidance and spatial attention alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06339","ref_index":228,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Evolution of Video Generative Foundations","primary_cat":"cs.CV","submitted_at":"2026-04-07T18:17:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"PEEKABOO [221], DragNUWA [222], and MotionCtrl [223] achieve control by specifying object trajectories, while some methods [224], [225] adopt 3D trajectory-based control. In addition, several works [226], [227] employ mask-based control to guide object motion. 5.2.3 Camera Motion control. Camera motion control tasks are relatively less explored compared to object motion control. Animatediff [228] and Gen-2 [175] train LoRA [61] on the temporal modules of video generative models to control camera motion. MotionCtrl [223] proposes to control camera motion inde- pendently from object motion, and introduces a lightweight Camera Motion Control Module (CMCM) for this purpose. Direct-a-Video [216] incorporates novel temporal cross- attention layers to interpret camera motion parameters,"},{"citing_arxiv_id":"2604.06010","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control","primary_cat":"cs.CV","submitted_at":"2026-04-07T16:06:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05961","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-07T14:55:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HumANDiff improves motion consistency in human video generation by sampling diffusion noise on an articulated human body template and adding joint appearance-motion prediction plus a geometric consistency loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04934","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision","primary_cat":"cs.CV","submitted_at":"2026-04-06T17:59:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Vanast produces coherent garment-transferred human animation videos from a single human image, garment images, and pose guidance video using synthetic triplet supervision and a Dual Module video diffusion transformer architecture.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04787","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AvatarPointillist: AutoRegressive 4D Gaussian Avatarization","primary_cat":"cs.CV","submitted_at":"2026-04-06T15:56:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AvatarPointillist autoregressively generates adaptive 3D point clouds via Transformer for photorealistic 4D Gaussian avatars from one image, jointly predicting animation bindings and using a conditioned Gaussian decoder.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02467","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation","primary_cat":"cs.CV","submitted_at":"2026-04-02T18:58:56+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while improving framing and prompt adherence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.20404","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Open-Sora: Democratizing Efficient Video Production for All","primary_cat":"cs.CV","submitted_at":"2024-12-29T08:52:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tasks with claimed high fidelity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.13720","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Movie Gen: A Cast of Media Foundation Models","primary_cat":"cs.CV","submitted_at":"2024-10-17T16:22:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.18869","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Emu3: Next-Token Prediction is All You Need","primary_cat":"cs.CV","submitted_at":"2024-09-27T16:06:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.02048","ref_index":53,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis","primary_cat":"cs.CV","submitted_at":"2024-09-03T16:53:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.06072","ref_index":77,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer","primary_cat":"cs.CV","submitted_at":"2024-08-12T11:47:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.02101","ref_index":113,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CameraCtrl: Enabling Camera Control for Text-to-Video Generation","primary_cat":"cs.CV","submitted_at":"2024-04-02T16:52:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.17177","ref_index":149,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models","primary_cat":"cs.CV","submitted_at":"2024-02-27T03:30:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Chen, and B. Guo, \"Ccedit: Creative and controllable video editing via diffusion models,\"arXiv preprint arXiv:2309.16496, 2023. [148] J. Xing, M. Xia, Y . Liu, Y . Zhang, Y . Zhang, Y . He, H. Liu, H. Chen, X. Cun, X. Wang,et al., \"Make- your-video: Customized video generation using textual and structural guidance,\" arXiv preprint arXiv:2306.00943, 2023. [149] Y . Guo, C. Yang, A. Rao, Y . Wang, Y . Qiao, D. Lin, and B. Dai, \"Animatediff: Animate your per- sonalized text-to-image diffusion models without specific tuning,\"arXiv preprint arXiv:2307.04725, 2023. [150] Y . He, M. Xia, H. Chen, X. Cun, Y . Gong, J. Xing, Y . Zhang, X. Wang, C. Weng, Y . Shan, et al. , \"Animate-a-story: Storytelling with retrieval-augmented video generation,\" arXiv preprint"},{"citing_arxiv_id":"2311.15127","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets","primary_cat":"cs.CV","submitted_at":"2023-11-25T22:28:38+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results while enabling motion LoRA and multi-view 3D applications.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.19512","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VideoCrafter1: Open Diffusion Models for High-Quality Video Generation","primary_cat":"cs.CV","submitted_at":"2023-10-30T13:12:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}