{"total":12,"items":[{"citing_arxiv_id":"2606.01022","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ProductWebGen: Benchmarking Multimodal Product Webpage Generation","primary_cat":"cs.CV","submitted_at":"2026-05-31T05:25:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces ProductWebGen benchmark for multimodal product webpage generation, compares editing-based vs unified-model workflows on 500 samples, and releases ProductWebGen-1k SFT dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18678","ref_index":112,"ref_count":4,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Lance: Unified Multimodal Modeling by Multi-Task Synergy","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:18:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.","context_count":2,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"We further provide qualitative results for both image and video editing in Figure 12. For image editing, Lance achieves visually coherent image editing with well-preserved structures and realistic 17 (a) VBench Metrics Part I Models Params. Quality Score Semantic Score Subj. Consist. Bkg. Consist. Temp. Flicker. Motion Smooth. Dynamic Degree Aesthetic Quality Imaging Quality Object Class Generation-only Models ModelScope [113] 1.7B 78.05 66.54 89.87 95.29 98.28 95.79 66.39 52.06 58.57 82.25 LaVie [117] 3B 78.78 70.31 91.41 97.47 98.30 96.38 49.72 54.94 61.90 91.82 Show-1 [144] 6B 80.42 72.98 95.53 98.02 99.12 98.24 44.44 57.35 58.66 93.07 AnimateDiff-V2 [36] - 82.90 69.75 95.30 97.68 98.75 97.76 40.83 67.16 70.10 90.90 VideoCrafter-2.0 [10] - 82.20 73.42 96.85 98.22 98."},{"citing_arxiv_id":"2605.12500","ref_index":130,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"68 90.29 88.77 85.18 BAGEL [28] 7B 88.94 90.37 91.29 90.82 88.67 85.07 LongCat-Next [125] 68BA3B - - - - - 84.66 OneCAT [66] 9BA3B - - - - - 84.53 Mogao [74] 7B - - - - - 84.33 Janus-Pro [18] 7B 86.90 88.90 89.40 89.32 89.48 84.19 SD3-Medium [34] 2B 87.90 91.01 88.83 80.70 88.68 84.08 FLUX.1-dev [60] 12B 74.35 90.00 88.96 90.87 88.33 83.84 Ovis-U1 [130] 1.2B 82.37 90.08 88.68 93.35 85.20 83.72 OmniGen2 [141] 4B 88.81 88.83 90.18 89.37 90.27 83.57 BLIP3-o [14] 1.4B - - - - - 81.60 UniWorld-V1 [75] 12B 83.64 88.39 88.44 89.27 87.22 81.38 Table 6 Quantitative evaluation results on DPG-Bench. The parameters of the generation component are denoted as# Params;A in this column denotes activated parameters, e."},{"citing_arxiv_id":"2605.11400","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning","primary_cat":"cs.MM","submitted_at":"2026-05-12T01:43:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixed strategies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24763","ref_index":41,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation","primary_cat":"cs.CV","submitted_at":"2026-04-27T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Tuna-2 shows that direct pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive generation and stronger understanding at scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17850","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UniCSG: Unified High-Fidelity Content-Constrained Style-Driven Generation via Staged Semantic and Frequency Disentanglement","primary_cat":"cs.CV","submitted_at":"2026-04-20T05:59:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"UniCSG adds staged semantic disentanglement and frequency-aware reconstruction to DiT diffusion models to improve content preservation and style fidelity in both text- and reference-guided generation.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"mentary measures. For style consistency[30], we use FID[12], CLIP-T[27], and CSD[29] to assess style alignment from multiple perspectives. Baselines.We compare UniCSG against eight representative and state-of- the-art baselines: OmniConsistency[31], OmniStyle[34], flux1.Kontext-dev[15], Nano-banana[4], DreamOmni2[40], OmniGen2[38], BAGEL[5], and Ovis-U1[32]. 10 Yang et al. T able 1.Quantitative results for text-guided style transfer on CSG-Bench. Method FID↓CSD↑CLIP-T↑CLIP-I↑DINO↑DreamSim↑ OmniConsistency[31]117.0790.503 0.266 0.705 0.487 0.712 OmniStyle[34] 134.415 0.275 0.247 0.737 0.620 0.715 flux1.Kontext-dev[15] 120.649 0.430 0.251 0.772 0.588 0.741 Nano-banana[4] 128.162 0.534 0.2490.792 0.644 0."},{"citing_arxiv_id":"2604.17565","ref_index":68,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models","primary_cat":"cs.CV","submitted_at":"2026-04-19T18:11:08+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Image editing aims to modify visual content in a controllable manner [34]. Early diffusion-based approaches rely on training-free inversion [12,20,31,36,53-55,59,62,80]andmodelfine-tuning[8,13,35,37,85,87,91].Recently, this field has been further advanced by large-scale text-to-image foundation models [3,25,41,42,63,69,83,90] and unified auto-regressive architectures [17, 21,22,45,46,50,68,73,74,78] for fine-grained semantic control. While excelling at appearance manipulation, these approaches generally lack explicit spatial viewpoint control. Moving beyond general semantic editing, recent works explore camera-controllable image editing [7,16,26,49,52,60,61,77]. However, as these methods are predominantly based on the image diffusion models, they often face"},{"citing_arxiv_id":"2604.10949","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-04-13T03:46:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[60] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017. 2 [61] Elena V oita, Rico Sennrich, and Ivan Titov. The bottom- up evolution of representations in the transformer: A study with machine translation and language modeling objectives. 2019. 3 [62] Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiao- hao Chen, Jianshan Zhao, et al. Ovis-u1 technical report. arXiv preprint arXiv:2506.23044, 2025. 2 [63] Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is"},{"citing_arxiv_id":"2604.04487","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Training-Free Image Editing with Visual Context Integration and Concept Alignment","primary_cat":"cs.CV","submitted_at":"2026-04-06T07:26:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VicoEdit performs training-free image editing by transforming source images directly with visual context and concept-alignment-guided posterior sampling, outperforming training-based methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.06663","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks","primary_cat":"cs.CV","submitted_at":"2026-02-06T12:47:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PlanViz is a new benchmark with three sub-tasks and PlanScore metric to evaluate planning-oriented image generation and editing by unified multimodal models for computer-use tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.07064","ref_index":79,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization","primary_cat":"cs.CV","submitted_at":"2026-02-05T14:04:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.01554","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InfoTok: Information-Theoretic Regularization for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs","primary_cat":"cs.LG","submitted_at":"2026-02-02T02:47:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InfoTok uses mutual information constraints to regularize shared visual tokenization in unified MLLMs, improving both understanding and generation performance without extra training data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}