FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.
citation dossier
Videocrafter1: Open diffusion models for high-quality video generation
why this work matters in Pith
Pith has found this work in 17 reviewed papers. Its strongest current cluster is cs.CV (16 papers). The largest review-status bucket among citing papers is UNVERDICTED (16 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.
representative citing papers
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
CMTA detects AI-generated videos by capturing unnatural temporal stability in visual-textual semantic alignment via joint embeddings and multi-grained temporal modeling, outperforming prior methods in cross-generator tests.
Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute binding and structural control.
FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.
MAST with spiking neural networks achieves 93.14% mean accuracy detecting AI-generated videos from 10 unseen generators by exploiting smoother pixel residuals and compact semantic trajectories.
CineAGI is a multi-agent LLM framework that generates multi-scene movies with improved character consistency, narrative coherence, and audio-visual alignment.
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
NUMINA improves counting accuracy in text-to-video diffusion models by up to 7.4% via a training-free identify-then-guide framework on the new CountBench dataset.
ATSS detects AI-generated videos by measuring unnatural repetitive temporal correlations in triple similarity matrices derived from frame visuals and semantic descriptions.
Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.
citing papers explorer
-
FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction
FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.
-
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
-
CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection
CMTA detects AI-generated videos by capturing unnatural temporal stability in visual-textual semantic alignment via joint embeddings and multi-grained temporal modeling, outperforming prior methods in cross-generator tests.
-
Novel View Synthesis as Video Completion
Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
-
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute binding and structural control.
-
FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity
FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.
-
Detecting AI-Generated Videos with Spiking Neural Networks
MAST with spiking neural networks achieves 93.14% mean accuracy detecting AI-generated videos from 10 unseen generators by exploiting smoother pixel residuals and compact semantic trajectories.
-
CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration
CineAGI is a multi-agent LLM framework that generates multi-scene movies with improved character consistency, narrative coherence, and audio-visual alignment.
-
Generative Refinement Networks for Visual Synthesis
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
-
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
NUMINA improves counting accuracy in text-to-video diffusion models by up to 7.4% via a training-free identify-then-guide framework on the new CountBench dataset.
-
ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity
ATSS detects AI-generated videos by measuring unnatural repetitive temporal correlations in triple similarity matrices derived from frame visuals and semantic descriptions.
-
Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers
Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
-
Movie Gen: A Cast of Media Foundation Models
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
-
Empowering Video Translation using Multimodal Large Language Models
The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
-
Show-o2: Improved Native Unified Multimodal Models
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
-
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.
- GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth