Recognition: 2 theorem links
· Lean TheoremOpen-Sora: Democratizing Efficient Video Production for All
Pith reviewed 2026-05-11 11:55 UTC · model grok-4.3
The pith
Open-Sora delivers an open-source video model that generates up to 15-second clips at 720p using decoupled spatial-temporal attention.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By introducing the Spatial-Temporal Diffusion Transformer (STDiT) that decouples spatial and temporal attention, pairing it with a highly compressive 3D autoencoder, and applying an ad hoc training strategy, the work produces an open-source model capable of high-fidelity video generation up to 15 seconds long at 720p resolution across arbitrary aspect ratios.
What carries the argument
Spatial-Temporal Diffusion Transformer (STDiT), which decouples spatial and temporal attention to process video sequences efficiently within a diffusion framework.
If this is right
- Any researcher or developer can now train or adapt video generation models using the released weights and full codebase.
- Content pipelines gain native support for arbitrary aspect ratios and lengths up to 15 seconds without additional licensing.
- Image-to-video and text-to-video workflows become interchangeable within one open framework.
- Further community experiments can directly modify the attention decoupling or autoencoder to test efficiency gains.
Where Pith is reading between the lines
- The same decoupling pattern may extend to other temporal media such as audio or 3D scene generation.
- Public release of the full stack could encourage standardized benchmarks for open video models that currently do not exist.
- Smaller teams might iterate faster on domain-specific fine-tunes once the base model and training recipe are public.
Load-bearing premise
The combination of decoupled attention, the compressive 3D autoencoder, and the chosen training strategy is enough to reach the stated video length, resolution, and quality without relying on closed-source advantages.
What would settle it
Independent blind ratings or quantitative metrics showing that videos from Open-Sora match or fall short of closed-source equivalents in visual coherence, motion realism, and artifact levels at the same compute budget.
read the original abstract
Vision and language are the two foundational senses for humans, and they build up our cognitive ability and intelligence. While significant breakthroughs have been made in AI language ability, artificial visual intelligence, especially the ability to generate and simulate the world we see, is far lagging behind. To facilitate the development and accessibility of artificial visual intelligence, we created Open-Sora, an open-source video generation model designed to produce high-fidelity video content. Open-Sora supports a wide spectrum of visual generation tasks, including text-to-image generation, text-to-video generation, and image-to-video generation. The model leverages advanced deep learning architectures and training/inference techniques to enable flexible video synthesis, which could generate video content of up to 15 seconds, up to 720p resolution, and arbitrary aspect ratios. Specifically, we introduce Spatial-Temporal Diffusion Transformer (STDiT), an efficient diffusion framework for videos that decouples spatial and temporal attention. We also introduce a highly compressive 3D autoencoder to make representations compact and further accelerate training with an ad hoc training strategy. Through this initiative, we aim to foster innovation, creativity, and inclusivity within the community of AI content creation. By embracing the open-source principle, Open-Sora democratizes full access to all the training/inference/data preparation codes as well as model weights. All resources are publicly available at: https://github.com/hpcaitech/Open-Sora.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper describes the development and open-source release of Open-Sora, a video generation model that supports text-to-image, text-to-video, and image-to-video tasks. It introduces the Spatial-Temporal Diffusion Transformer (STDiT) which decouples spatial and temporal attention for efficiency, a highly compressive 3D autoencoder, and an ad hoc training strategy. The model is claimed to generate high-fidelity videos up to 15 seconds long at up to 720p resolution with arbitrary aspect ratios. All codes and model weights are publicly released to democratize access to video production technology.
Significance. If the claimed capabilities are verified through experiments, this work could have substantial impact by making advanced video generation accessible to the broader research community and creators. The open-source aspect is particularly valuable for fostering innovation and allowing independent verification. The architectural choices, such as the decoupled attention in STDiT, may offer insights into efficient video diffusion models.
major comments (1)
- The abstract outlines the model's capabilities and architectural innovations but does not include any quantitative performance metrics, ablation studies, or baseline comparisons. This absence makes it challenging to evaluate the effectiveness of the STDiT and the 3D autoencoder in achieving the stated high-fidelity and efficiency goals.
minor comments (2)
- The phrase 'ad hoc training strategy' is not defined in the abstract; a clear explanation of the training procedure should be provided in the main text to allow reproducibility.
- It would be helpful to include a table comparing Open-Sora with other open-source video generation models in terms of maximum video length, resolution, and training resources required.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights an important aspect of how the manuscript presents its contributions. We have revised the abstract to incorporate key quantitative metrics and will ensure the experimental sections more explicitly reference the ablation studies and baselines.
read point-by-point responses
-
Referee: The abstract outlines the model's capabilities and architectural innovations but does not include any quantitative performance metrics, ablation studies, or baseline comparisons. This absence makes it challenging to evaluate the effectiveness of the STDiT and the 3D autoencoder in achieving the stated high-fidelity and efficiency goals.
Authors: We agree that the abstract would benefit from including representative quantitative results to allow readers to immediately gauge performance. The full manuscript (Sections 4 and 5) already contains detailed evaluations, including FVD and FID scores on standard benchmarks, ablation studies demonstrating the benefits of decoupled spatial-temporal attention in STDiT, efficiency gains from the compressive 3D autoencoder, and direct comparisons against baselines such as other open-source video diffusion models. To address the referee's concern directly, we have revised the abstract to include concise performance highlights (e.g., competitive FVD scores and training efficiency improvements) while preserving its brevity. This change strengthens the summary without misrepresenting the work. revision: yes
Circularity Check
No significant circularity in claimed derivation chain
full rationale
The paper presents an engineering contribution: the design, training, and public release of the Open-Sora video generation system together with its STDiT architecture and compressive 3D autoencoder. No mathematical derivation, first-principles prediction, or uniqueness theorem is asserted that reduces by construction to fitted parameters, self-citations, or renamed inputs. All load-bearing claims rest on the described implementation, training procedure, and released code/weights rather than on any self-referential equation or ansatz smuggled via prior work. The work is therefore self-contained as a systems paper.
Axiom & Free-Parameter Ledger
free parameters (1)
- ad hoc training strategy parameters
axioms (1)
- domain assumption Diffusion-based generative models can synthesize coherent high-fidelity video when spatial and temporal modeling are appropriately decoupled.
invented entities (2)
-
Spatial-Temporal Diffusion Transformer (STDiT)
no independent evidence
-
Highly compressive 3D autoencoder
no independent evidence
Forward citations
Cited by 42 Pith papers
-
Relative Score Policy Optimization for Diffusion Language Models
RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
-
From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation
A kinematic-to-visual lifting paradigm combined with hierarchically routed control generates action-conditioned surgical videos with better faithfulness, fidelity, and efficiency.
-
Cross-Attention and Encoder-Decoder Transformers: A Logical Characterization
Encoder-decoder transformers are characterized by a temporal logic extending propositional logic with a counting global modality on the encoder and a past modality on the decoder, equivalently via distributed automata.
-
OphEdit: Training-Free Text-Guided Editing of Ophthalmic Surgical Videos
OphEdit enables text-guided editing of eye surgery videos without training by injecting preserved attention value tensors into the diffusion denoising process to maintain anatomical structure.
-
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
-
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
AniMatrix generates anime videos using a production knowledge taxonomy, dual-channel conditioning, style-motion curriculum, and deformation-aware preference optimization, outperforming baselines in animator evaluation...
-
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
AniMatrix generates anime videos using a structured taxonomy of artistic production variables, dual-channel conditioning, a style-motion curriculum, and deformation-aware optimization to prioritize art over physics.
-
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional ani...
-
Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation
Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.
-
Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment
VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with margin...
-
Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation
Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
-
MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production
MCSC-Bench is the first large-scale dataset for the Multimodal Context-to-Script Creation task, requiring models to select relevant shots from redundant materials, plan missing shots, and generate coherent scripts wit...
-
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
-
Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation
SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
-
FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity
FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.
-
Not All Tokens Need 40 Steps: Heterogeneous Step Allocation in Diffusion Transformers for Efficient Video Generation
HSA assigns variable denoising steps to spatiotemporal tokens in DiTs based on velocity dynamics, with KV-cache sync and cached Euler updates, outperforming prior caching methods on quality-runtime tradeoffs for T2V a...
-
Detecting AI-Generated Videos with Spiking Neural Networks
MAST with spiking neural networks achieves 93.14% mean accuracy detecting AI-generated videos from 10 unseen generators by exploiting smoother pixel residuals and compact semantic trajectories.
-
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
-
TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation
TS-Attn dynamically separates and rearranges attention in existing text-to-video models to improve temporal consistency and prompt adherence for videos with multiple sequential actions.
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
-
Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation
Long-CODE isolates long-context video evaluation with a new benchmark dataset and shot-dynamics metric that correlates better with human judgments on narrative richness and global consistency than short-video metrics.
-
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects
VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.
-
Generative Refinement Networks for Visual Synthesis
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
-
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
-
Latent-Compressed Variational Autoencoder for Video Diffusion Models
A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.
-
Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator
Uni-ViGU unifies video generation and understanding by extending a diffusion video generator with unified continuous-discrete flow matching, modality-driven MoE layers, and bidirectional training stages that repurpose...
-
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
-
DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models
DiffHDR converts LDR videos to HDR by formulating the task as generative radiance inpainting in a video diffusion model's latent space, using Log-Gamma encoding and synthesized training data to achieve better fidelity...
-
SCMAPR: Self-Correcting Multi-Agent Prompt Refinement for Complex-Scenario Text-to-Video Generation
SCMAPR is a self-correcting multi-agent prompt refinement framework that boosts text-to-video alignment and quality in complex scenarios, with reported gains on VBench, EvalCrafter, and a new T2V-Complexity benchmark.
-
GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads
GENSERVE improves SLO attainment by up to 44% for co-serving heterogeneous T2I and T2V diffusion workloads via step-level preemption, elastic parallelism, and joint scheduling.
-
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
-
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.
-
SkyReels-V2: Infinite-length Film Generative Model
SkyReels-V2 produces infinite-length film videos via MLLM-based captioning, progressive pretraining, motion RL, and diffusion forcing with non-decreasing noise schedules.
-
Improving Video Generation with Human Feedback
A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
-
Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts
MDMF detects AI-generated images by learning patch-level forensic signatures and quantifying their distributional discrepancies with MMD, yielding larger separation than global methods when micro-defects are present.
-
Video Generation with Predictive Latents
PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.
-
Motion-Aware Caching for Efficient Autoregressive Video Generation
MotionCache speeds up autoregressive video generation by 6.28x on SkyReels-V2 and 1.64x on MAGI-1 via motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on VBench.
-
Training-free, Perceptually Consistent Low-Resolution Previews with High-Resolution Image for Efficient Workflows of Diffusion Models
A commutator-zero condition enables training-free generation of perceptually consistent low-resolution previews for high-resolution diffusion model outputs, achieving up to 33% computation reduction.
-
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
-
EduStory: A Unified Framework for Pedagogically-Consistent Multi-Shot STEM Instructional Video Generation
EduStory combines pedagogical state modeling, structured script control, and new evaluation metrics to generate consistent multi-shot STEM videos while introducing the EduVideoBench diagnostic benchmark.
-
Elucidating the SNR-t Bias of Diffusion Probabilistic Models
Diffusion models have an SNR-timestep mismatch during inference that the authors mitigate with per-frequency differential correction, raising generation quality across IDDPM, ADM, DDIM and others.
Reference graph
Works this paper leans on
-
[1]
Frozen in time: A joint video and image encoder for end-to-end retrieval,
M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” inIEEE International Conference on Computer Vision, 2021
work page 2021
-
[2]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
A. Blattmann et al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,”arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Video generation models as world simulators,
T. Brooks et al. , “Video generation models as world simulators,” 2024. [Online]. Avail- able: https : / / openai . com / research / video - generation - models - as - world - simulators
work page 2024
-
[4]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
J. Chen et al., “Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to- image synthesis,”arXiv preprint arXiv:2310.00426, 2023
work page internal anchor Pith review arXiv 2023
-
[5]
Pixart-sigma: Weak-to-strong training of diffusion transformer for 4k text- to-image generation,
J. Chen et al., “Pixart-sigma: Weak-to-strong training of diffusion transformer for 4k text- to-image generation,” inEuropean Conference on Computer Vision, Springer, 2025, pp. 74– 91
work page 2025
-
[6]
Panda-70m: Captioning 70m videos with multiple cross-modality teachers,
T.-S. Chen et al., “Panda-70m: Captioning 70m videos with multiple cross-modality teachers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 320–13 331
work page 2024
-
[7]
Contributors, Video cut detection and analysis tool , 2024
P. Contributors, Video cut detection and analysis tool , 2024. [Online]. Available: https : //github.com/Breakthrough/PySceneDetect
work page 2024
-
[8]
Flashattention: Fast and memory-efficient exact attention with io-awareness,
T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” Advances in Neural Information Processing Systems, vol. 35, pp. 16 344–16 359, 2022
work page 2022
-
[9]
Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution,
M. Dehghani et al. , “Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution,”Advances in Neural Information Processing Systems, vol. 36, 2024
work page 2024
-
[10]
Scaling rectified flow transformers for high-resolution image synthesis,
P. Esser et al., “Scaling rectified flow transformers for high-resolution image synthesis,” in Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[11]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Y . Guoet al., “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,”arXiv preprint arXiv:2307.04725, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Photorealistic video generation with diffusion models,
A. Gupta et al., “Photorealistic video generation with diffusion models,” inEuropean Confer- ence on Computer Vision, Springer, 2025, pp. 393–411
work page 2025
-
[13]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020
work page 2020
-
[14]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang, “Cogvideo: Large-scale pretraining for text-to-video generation via transformers,”arXiv preprint arXiv:2205.15868, 2022
work page internal anchor Pith review arXiv 2022
-
[15]
Vbench: Comprehensive benchmark suite for video generative models,
Z. Huang et al., “Vbench: Comprehensive benchmark suite for video generative models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 21 807–21 818
work page 2024
-
[16]
Real-time scene text detection with differentiable binarization and adaptive scale fusion,
M. Liao, Z. Zou, Z. Wan, C. Yao, and X. Bai, “Real-time scene text detection with differentiable binarization and adaptive scale fusion,”IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 1, pp. 919–931, 2022
work page 2022
-
[17]
Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131,
B. Lin et al., “Open-sora plan: Open-source large video generation model,” arXiv preprint arXiv:2412.00131, 2024
-
[18]
Flow Matching for Generative Modeling
Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
Fit: Flexible vision transformer for diffusion model.arXiv preprint arXiv:2402.12376, 2024
Z. Lu et al. , “Fit: Flexible vision transformer for diffusion model,” arXiv preprint arXiv:2402.12376, 2024
-
[20]
Latte: Latent Diffusion Transformer for Video Generation
X. Ma et al. , “Latte: Latent diffusion transformer for video generation,” arXiv preprint arXiv:2401.03048, 2024
work page internal anchor Pith review arXiv 2024
-
[21]
A theory on Adam instability in large-scale machine learning.arXiv preprint arXiv:2304.09871, 2023
I. Molybog et al., “A theory on adam instability in large-scale machine learning,” arXiv preprint arXiv:2304.09871, 2023
-
[22]
Scalable diffusion models with transformers,
W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4195–4205
work page 2023
-
[23]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
D. Podell et al., “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Exploring the limits of transfer learning with a unified text-to-text transformer,
C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020. 10
work page 2020
-
[25]
High-Resolution Image Synthesis with Latent Diffusion Models
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, High-resolution image synthesis with latent diffusion models, 2021. arXiv: 2112.10752 [cs.CV]
work page Pith review arXiv 2021
-
[26]
Laion-5b: An open large-scale dataset for training next generation image-text models,
C. Schuhmann et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,”Advances in Neural Information Processing Systems, vol. 35, pp. 25 278– 25 294, 2022
work page 2022
-
[27]
Make-A-Video: Text-to-Video Generation without Text-Video Data
U. Singer et al., “Make-a-video: Text-to-video generation without text-video data,” arXiv preprint arXiv:2209.14792, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[28]
Adapool: Exponential adaptive pooling for information-retaining downsampling,
A. Stergiou and R. Poppe, “Adapool: Exponential adaptive pooling for information-retaining downsampling,” 2021
work page 2021
-
[29]
J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . L. Roformer, “Enhanced transformer with rotary position embedding., 2021,”DOI: https://doi. org/10.1016/j. neucom, 2023
work page doi:10.1016/j 2021
-
[30]
Y . Tayet al., “Ul2: Unifying language learning paradigms,”arXiv preprint arXiv:2205.05131, 2022
- [31]
-
[32]
Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation,
W. Wanget al., “Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation,” 2023
work page 2023
-
[33]
Lavie: High-quality video generation with cascaded latent diffusion models,
Y . Wanget al., “Lavie: High-quality video generation with cascaded latent diffusion models,” International Journal of Computer Vision, pp. 1–20, 2024
work page 2024
-
[34]
Unifying flow, stereo and depth estimation,
H. Xu et al., “Unifying flow, stereo and depth estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023
work page 2023
-
[35]
arXiv preprint arXiv:2404.16994 , year=
L. Xu, Y . Zhao, D. Zhou, Z. Lin, S. K. Ng, and J. Feng, “Pllava: Parameter-free llava extension from images to videos for video dense captioning,”arXiv preprint arXiv:2404.16994, 2024
-
[36]
Vript: A video is worth thousands of words, 2024.https://arxiv.org/abs/2406.06040
D. Yanget al., Vript: A video is worth thousands of words, 2024. arXiv: 2406.06040 [cs.CV]
-
[37]
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
L. Yu et al., “Language model beats diffusion–tokenizer is key to visual generation,”arXiv preprint arXiv:2310.05737, 2023
work page internal anchor Pith review arXiv 2023
-
[38]
Show-1: Marrying pixel and latent diffusion models for text-to-video generation,
D. J. Zhang et al., “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,”International Journal of Computer Vision, pp. 1–15, 2024. 11
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.