{"total":13,"items":[{"citing_arxiv_id":"2606.31326","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Bridging Video Understanding and Generation in a Unified Framework","primary_cat":"cs.CV","submitted_at":"2026-06-30T08:29:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Vega unifies video understanding and generation via shared vocabulary and hybrid autoregressive-diffusion architecture, reporting strong results on VBench and VideoMME.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30599","ref_index":29,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Goku: A Million-Scale Universal Dataset and Benchmark for Instruction-Based Video Editing","primary_cat":"cs.CV","submitted_at":"2026-06-29T17:38:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Goku provides a 2M-pair dataset for multi-task structural video editing, Goku-Edit model with MLLM and dual-branch design, and Goku-Bench yielding up to 8% gains in instruction following.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31603","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models","primary_cat":"cs.CV","submitted_at":"2026-05-29T17:59:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Lumos-Nexus is a training-efficient video generation framework using two-stage alignment of a lightweight model followed by progressive frequency bridging to a high-fidelity generator in homogeneous latent space, plus the new VR-Bench for reasoning evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30409","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:59:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SANA-Streaming delivers 1280x704 streaming video editing at 24 FPS end-to-end on an RTX 5090 using hybrid DiT blocks, cycle-reverse training, and mixed-precision quantization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24674","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reasoning to Align: Implicit Reasoning in Diffusion Transformers for Video Editing","primary_cat":"cs.CV","submitted_at":"2026-05-23T17:22:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RVEDiT improves DiT-based video editing by granularity-routed token conditioning and reference-anchored attention alignment to achieve better temporal coherence and localized edits.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22344","ref_index":66,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Bernini: Latent Semantic Planning for Video Diffusion","primary_cat":"cs.CV","submitted_at":"2026-05-21T11:30:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21611","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation","primary_cat":"cs.CV","submitted_at":"2026-05-20T18:17:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniVL unifies vision and language into one mask-rendered input processed by an OCR backbone to condition diffusion models for spatially grounded image generation without a standalone text encoder.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18678","ref_index":100,"ref_count":4,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Lance: Unified Multimodal Modeling by Multi-Task Synergy","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:18:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.","context_count":2,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"(Image to Text) UND. (Video to Text) GEN. (Image) GEN. (Video) Emergent GeneralizationCap. Per. Rea. Cap. Per. Rea. T2I Edit S2I T2V I2V Edit S2V Non-native Unified MetaQuery-XL [89]✓ ✓ ✓ ✓ ✓ SEED-X [33]✓ ✓ ✓ ✓ ✓ TokenFlow-XL [93]✓ ✓ ✓ ✓ ILLUME [111]✓ ✓ ✓ ✓ ✓ InternVL-U [104]✓ ✓ ✓ ✓ ✓ UniVideo [120]✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓✓ Native Unified Chameleon [101]✓ ✓ ✓ ✓ L WM [70]✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Janus [124]✓ ✓ ✓ ✓ Janus-Pro [14]✓ ✓ ✓ ✓ Transfusion [150]✓ ✓ ✓ ✓ Emu3 [116]✓ ✓ ✓△ △ △✓ ✓ Show-o [134]✓ ✓ ✓ ✓ ✓ Show-o2 [135]✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓△ Bagel [23]✓ ✓ ✓ ✓ ✓ ✓✓ Mogao [64]✓ ✓ ✓ ✓△ △ HaploOmni [133]✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ VILA-U [131]✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ HunyuanImage 3.0 [8]△ △ △✓ ✓ Emu3.5 [19]✓ ✓ ✓△ △ △✓ ✓△ △ △✓ TUNA [78]✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓"},{"citing_arxiv_id":"2605.02641","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE","primary_cat":"cs.CV","submitted_at":"2026-05-04T14:26:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"video editing compounds the efficiency challenge of generation: beyond the cost of denoising the output video, the model must also encode the source video as conditioning input, significantly increasing both memory footprint and inference latency. Recent open-source efforts such as VInO [17], which couples a VLM with an MMDiT backbone, and OmniVideo [18], which connects an MLLM to a diffusion decoder via a lightweight adapter, have begun to explore unified video editing. However, these models adopt dense architectures and still exhibit limited performance in complex scenarios involving large motion, multi-object manipulation, or fine-grained instruction adherence. Motivated by these observations, we introduce Mamoda2."},{"citing_arxiv_id":"2605.01278","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Valley3: Scaling Omni Foundation Models for E-commerce","primary_cat":"cs.AI","submitted_at":"2026-05-02T06:25:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Valley3 is an omni MLLM for e-commerce that uses a four-stage pre-training pipeline plus post-training for controllable reasoning and agentic search, outperforming baselines on e-commerce benchmarks while staying competitive on general ones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08646","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation","primary_cat":"cs.CV","submitted_at":"2026-04-09T17:59:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"consistency over time or edits that happen only in part of a video. An emerging line of work tries to unify visual understanding, generation, and editing across different modalities. Representative examples include UniWorld [21], DreamVE [40], InstructX [26], Uni- Video [35], OmniV2V [19], VACE [13], UNIC [44], EditVerse [14], UniVid [24], and Kling-Omni [31]. These studies suggest that image and video editing can benefit from shared backbones and shared instruction-following ability. However, existing unified frameworks focus mainly on sharing the architecture, whereas our focus is to adapt a video generation backbone and redesign the video training data itself. In our setting, image editing is an additional capabil-"},{"citing_arxiv_id":"2604.07958","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks","primary_cat":"cs.CV","submitted_at":"2026-04-09T08:22:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"design, the training is memory-efficient. It consumes only approximately 20 GB of VRAM per GPU, making it acces- sible to trainImVideoEditeven on a single 3090 GPU. Baseline SettingsTo comprehensively evaluate the su- periority ofImVideoEdit, we benchmark our framework against several recent state-of-the-art video editing mod- els, including V ACE(1.3B & 14B) [13], OmniVideo2-1.3B [31, 40], Lucy-Edit-Dev [33], Kiwi-Edit [17], DITTO [2], and ICVE [16]. To ensure a strictly fair comparison, all baseline methods are evaluated utilizing their official code- bases and default inference hyperparameters. Evaluation Dataset.We construct a meticulously curated testing benchmark encompassing 10 predefined video edit- ing categories, with 25 high-quality samples allocated for"},{"citing_arxiv_id":"2602.12370","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens","primary_cat":"cs.CV","submitted_at":"2026-02-12T20:02:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLaMo scales pretrained LLMs for unified motion-language tasks by encoding motion into continuous causal latents and adding a flow-matching head for real-time autoregressive generation and captioning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}