hub Baseline reference

Omni-video: Democra- tizing unified video understanding and generation

Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang, Hao Li · 2025 · arXiv 2507.06119

Baseline reference. 60% of citing Pith papers use this work as a benchmark or comparison.

13 Pith papers citing it

Baseline 60% of classified citations

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 3 background 2

citation-polarity summary

baseline 3 background 2

representative citing papers

Goku: A Million-Scale Universal Dataset and Benchmark for Instruction-Based Video Editing

cs.CV · 2026-06-29 · unverdicted · novelty 7.0 · 2 refs

Goku provides a 2M-pair dataset for multi-task structural video editing, Goku-Edit model with MLLM and dual-branch design, and Goku-Bench yielding up to 8% gains in instruction following.

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

Lumos-Nexus is a training-efficient video generation framework using two-stage alignment of a lightweight model followed by progressive frequency bridging to a high-fidelity generator in homogeneous latent space, plus the new VR-Bench for reasoning evaluation.

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

SANA-Streaming delivers 1280x704 streaming video editing at 24 FPS end-to-end on an RTX 5090 using hybrid DiT blocks, cycle-reverse training, and mixed-precision quantization.

Reasoning to Align: Implicit Reasoning in Diffusion Transformers for Video Editing

cs.CV · 2026-05-23 · unverdicted · novelty 6.0

RVEDiT improves DiT-based video editing by granularity-routed token conditioning and reference-anchored attention alignment to achieve better temporal coherence and localized edits.

UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

UniVL unifies vision and language into one mask-rendered input processed by an OCR backbone to condition diffusion models for spatially grounded image generation without a standalone text encoder.

Lance: Unified Multimodal Modeling by Multi-Task Synergy

cs.CV · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.

InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.

ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.

LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

cs.CV · 2026-02-12 · unverdicted · novelty 6.0

LLaMo scales pretrained LLMs for unified motion-language tasks by encoding motion into continuous causal latents and adding a flow-matching head for real-time autoregressive generation and captioning.

Bridging Video Understanding and Generation in a Unified Framework

cs.CV · 2026-06-30 · unverdicted · novelty 5.0

Vega unifies video understanding and generation via shared vocabulary and hybrid autoregressive-diffusion architecture, reporting strong results on VBench and VideoMME.

Bernini: Latent Semantic Planning for Video Diffusion

cs.CV · 2026-05-21 · unverdicted · novelty 5.0

Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.

Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

cs.CV · 2026-05-04 · unverdicted · novelty 4.0

Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.

Valley3: Scaling Omni Foundation Models for E-commerce

cs.AI · 2026-05-02 · unverdicted · novelty 4.0

Valley3 is an omni MLLM for e-commerce that uses a four-stage pre-training pipeline plus post-training for controllable reasoning and agentic search, outperforming baselines on e-commerce benchmarks while staying competitive on general ones.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Lance: Unified Multimodal Modeling by Multi-Task Synergy cs.CV · 2026-05-18 · unverdicted · none · ref 100 · 2 links
Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.
ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks cs.CV · 2026-04-09 · unverdicted · none · ref 31
ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.

Omni-video: Democra- tizing unified video understanding and generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer