{"total":46,"items":[{"citing_arxiv_id":"2606.31711","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Arena-T2I Hard: Benchmarking and Improving Faithfulness with Dependency-Aware Checklist","primary_cat":"cs.AI","submitted_at":"2026-06-30T14:17:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Arena-T2I Hard benchmark with ~30 decomposed constraints per prompt and a dependency-aware checklist reward yields better faithfulness-aesthetics trade-off than single-reward or weighted-sum baselines on SD3.5-Medium and FLUX.1-dev.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31082","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Fleet: Few Shots Lead Effective AI-generated Image Detection","primary_cat":"cs.CV","submitted_at":"2026-06-30T03:15:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fleet achieves dynamic few-shot adaptation for AIGI detection via avoidance routing in decoupled subspaces, raising accuracy from 20.4% to 73.1% on new generators like Doubao Seedream 4.0 with 10 shots.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30054","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation","primary_cat":"cs.CV","submitted_at":"2026-06-29T09:45:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ILLUME-X is a unified multimodal model that generates free-form interleaved text-image sequences via an expanded data pipeline, progressive self-adaptive training, and ILScore evaluation, claiming outperformance over prior unified models on style transfer, image decomposition, and storytelling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27608","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Qwen-Image-2.0-RL Technical Report","primary_cat":"cs.CV","submitted_at":"2026-06-25T23:49:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Applies RLHF with composite VLM-based reward models and on-policy distillation to a diffusion model, reporting benchmark gains of +2.61 on Qwen-Image-Bench and Elo improvements of +78/+93.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21030","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FlowCodec: One-Step Flow Prior for Generative Image Compression","primary_cat":"eess.IV","submitted_at":"2026-06-19T01:44:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FlowCodec decouples latent compression from one-step flow-based transport to plug pretrained text-to-image models into ultra-low-bitrate codecs with under 0.54% trainable parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20100","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WeGenBench: A Multidimensional Diagnostic Benchmark towards Text-to-Image Model Optimization","primary_cat":"cs.CV","submitted_at":"2026-06-18T11:20:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WeGenBench provides 4000 bilingual prompts with scene and tag annotations plus VLM-derived metrics to locate specific deficiencies in text-to-image models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.13679","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"InterleaveThinker: Reinforcing Agentic Interleaved Generation","primary_cat":"cs.CV","submitted_at":"2026-06-11T17:59:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"InterleaveThinker is the first multi-agent pipeline enabling interleaved generation in any image generator through planner-critic agents, SFT on custom datasets, and GRPO RL with accuracy and step-wise rewards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.13289","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers","primary_cat":"cs.CV","submitted_at":"2026-06-11T12:46:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HYDRA-X presents the first unified multimodal model using a single ViT for holistic image-video tokenization, with ablations on attention and compression plus a latent-level editing improvement.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08492","ref_index":2,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Seeing is Believing: Aligning Prompt Rewriting with Visual Anchors for Text-to-Image Generation","primary_cat":"cs.CV","submitted_at":"2026-06-07T07:34:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FaithRewriter is a prompt-enhancement framework that uses an MLLM-generated image as a visual anchor to guide LLM-based rewriting, producing prompts more faithful to user intent than fluency-only baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05949","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Faithful, Enriched, and Precise: Benchmarking Natural-Science Illustration Generation by T2I models","primary_cat":"cs.CV","submitted_at":"2026-06-04T09:49:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces FEPBench benchmark to evaluate T2I models on instruction faithfulness, reasoning enrichment, and semantic precision for natural-science illustrations using atom set annotations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05730","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TextWand: A Unified Framework for Scene Text Editing","primary_cat":"cs.CV","submitted_at":"2026-06-04T05:43:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TextWand unifies scene text removal, generation and replacement via rendering/erasure decomposition, ORPE for layout fidelity, RAS for clean erasure, and the new TextWand-Bench dataset, claiming superior accuracy and quality over prior models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03168","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"JAVEDIT: Joint Audio-Visual Instruction-Guided Video Editing with Agentic Data Curation","primary_cat":"cs.CV","submitted_at":"2026-06-02T05:26:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"JAVEdit-100k is the first large-scale dataset for instruction-guided joint audio-visual video editing, accompanied by JAVEditBench and the JAVEdit model that outperforms baselines on five of six metrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00188","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PaintBench: Deterministic Evaluation of Precise Visual Editing","primary_cat":"cs.GR","submitted_at":"2026-05-29T16:01:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PaintBench provides a scalable deterministic benchmark for precise visual editing operations, revealing that even the best of 11 models achieves only 17.1% mIoU and that scores correlate strongly with applied data visualization editing performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28091","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Qwen-Image-Bench: From Generation to Creation in Text-to-Image Evaluation","primary_cat":"cs.CV","submitted_at":"2026-05-27T07:46:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Qwen-Image-Bench introduces a hierarchical creator-centric benchmark with 1000 prompts, 23 sub-capabilities, and a Q-Judger model that scores images on 56 verifiable facets to distinguish T2I models on fidelity and creativity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23522","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models","primary_cat":"cs.LG","submitted_at":"2026-05-22T11:37:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Precise is a new SDE-consistent stochastic sampler that balances exploration and stability for RL post-training of flow-matching models via a novel posterior-mean approximation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22344","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Bernini: Latent Semantic Planning for Video Diffusion","primary_cat":"cs.CV","submitted_at":"2026-05-21T11:30:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21605","ref_index":6,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-20T18:12:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GenEvolve introduces a self-evolving agent framework for image generation using tool-orchestrated trajectories and Visual Experience Distillation to achieve claimed SOTA results on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21573","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models","primary_cat":"cs.CV","submitted_at":"2026-05-20T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Lens is a 3.8B-parameter text-to-image model that reaches competitive or superior performance to >6B-parameter systems using 19.3% of the training compute of Z-Image through a densely captioned 800M dataset, multi-resolution batching, semantic VAE, strong language encoder, RL fine-tuning, and 4-step","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21487","ref_index":17,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning","primary_cat":"cs.CV","submitted_at":"2026-05-20T17:59:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18678","ref_index":8,"ref_count":4,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Lance: Unified Multimodal Modeling by Multi-Task Synergy","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:18:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.","context_count":2,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"UniVideo [120]✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓✓ Native Unified Chameleon [101]✓ ✓ ✓ ✓ L WM [70]✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Janus [124]✓ ✓ ✓ ✓ Janus-Pro [14]✓ ✓ ✓ ✓ Transfusion [150]✓ ✓ ✓ ✓ Emu3 [116]✓ ✓ ✓△ △ △✓ ✓ Show-o [134]✓ ✓ ✓ ✓ ✓ Show-o2 [135]✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓△ Bagel [23]✓ ✓ ✓ ✓ ✓ ✓✓ Mogao [64]✓ ✓ ✓ ✓△ △ HaploOmni [133]✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ VILA-U [131]✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ HunyuanImage 3.0 [8]△ △ △✓ ✓ Emu3.5 [19]✓ ✓ ✓△ △ △✓ ✓△ △ △✓ TUNA [78]✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ TUNA-2 [79]✓ ✓ ✓ ✓ ✓ Lance (Ours) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Table 1 Comparison of multimodal unified models by supported task categories. ✓ indicates explicit support; △ indicates description-only support without official code; blank cells indicate no explicit report. Cap., Per., Rea."},{"citing_arxiv_id":"2605.17834","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Stabilizing, Scaling & Enhancing MeanFlow for Large-scale Diffusion Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-18T04:16:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Stabilizes MeanFlow for large-scale diffusion distillation via discrete warm-up and trajectory alignment, reporting better results on FLUX.1-dev and HunyuanImage 3.0.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17311","ref_index":51,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SpecSem-Net: Integrating Spectral and Semantic Features for Robust AI-generated Video Detection","primary_cat":"cs.CV","submitted_at":"2026-05-17T08:02:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpecSem-Net integrates Fourier-based spectral filtering with semantic-guided gated merging to detect AI-generated videos, reporting 87.25% accuracy on a new benchmark of five commercial generators and 95.59% on public datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14876","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-14T14:22:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CLVR framework adds closed-loop visual verification, proxy prompt reinforcement learning, and delta-space weight merge to improve complex text-to-image generation over single-step or unverified multi-step baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13565","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Qwen-Image-VAE-2.0 Technical Report","primary_cat":"cs.CV","submitted_at":"2026-05-13T14:04:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Qwen-Image-VAE-2.0 achieves state-of-the-art high-compression image reconstruction and superior diffusability for diffusion models, with a new text-rich document benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12500","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"8 14.3 / 60.1 GPT-Image-1 [100] - 21.4 / 60.2 6.8 / 48.6 8.6 / 41.0 7.8 / 60.0 11.2 / 52.4 Open-source Models SenseNova-U1 8B 61.6 / 81.6 47.5 / 72.8 46.3 / 74.6 3.5 / 17.9 39.7 / 61.7 SenseNova-U1 8BA3B 50.9 / 76.7 35.5 / 60.9 24.5 / 58.7 2.0 / 11.5 28.2 / 51.9 Emu3.5 [23] 32B 30.4 / 63.4 14.2 / 52.6 7.0 / 33.6 1.2 / 11.0 13.2 / 40.2 HunyuanImage-3.0 [11] 80BA13B 27.8 / 65.0 13.8 / 53.6 10.2 / 39.6 0.0 / 2.0 13.0 / 40.1 Z-Image [7] 6B 26.8 / 69.2 2.6 / 47.6 2.8 / 45.0 0.6 / 13.2 8.2 / 43.8 Qwen-Image-2512 [139] 20B 22.2 / 70.6 1.2 / 47.8 1.8 / 39.2 0.0 / 6.4 6.3 / 41.0 FLUX.2-dev [61] 32B 17.2 / 67.8 1.2 / 49.2 1.0 / 43.0 0.0 / 8.2 4.9 / 42.0 Qwen-Image [139] 20B 10.4 / 51.2 0.2 / 22.2 0.6 / 17.6 0."},{"citing_arxiv_id":"2605.10730","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Qwen-Image-2.0 Technical Report","primary_cat":"cs.CV","submitted_at":"2026-05-11T15:34:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Here, Gθ is used broadly: xθ may be the final clean sample obtained after the full few-step student trajectory, or a clean state directly predicted from an intermediate student state conditioned on c. The gradient of the DMD objectiveℓ DMD(θ)with respect to the student parametersθis then given by ∇θℓDMD(θ) =E c∼p(c),ϵ∼N(0,I),ξ∼N(0,I),t∼p(t) h sfake(xt,t,c)−s real(xt,t,c) \u0001 ∇θxθ i , (4) 16 Figure 11: Qualitative comparison between the multi-step teacher and the few-step distilled student. The top row shows images generated by Qwen-Image-2.0-RL with 40 sampling steps, while the bottom row shows images generated by Qwen-Image-2.0-Distillation with only 4 NFEs. Across diverse prompts, including portraits, landscapes, and natural scenes, the 4-NFE student preserves visual quality, semantic"},{"citing_arxiv_id":"2605.07402","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"InsHuman: Towards Natural and Identity-Preserving Human Insertion","primary_cat":"cs.CV","submitted_at":"2026-05-08T07:58:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InsHuman proposes Human-Background Adaptive Fusion, Face-to-Face ID-Preserving, and Bidirectional Data Pairing to enable natural human insertion in images without altering identity.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"through training-time supervision: it constructs one-to-one face correspondences via the Hungarian 6 Table 1:Quantitative comparisons among InsHuman and other image editing models.InsHuman achieved best or second-best performance in the following metrics. Method IDS↑BM % ↓PCE % ↓BD % ↓BL % ↓FR % ↓ FLUX.2 [5]0.610.76 11.45 6.87 5.34 21.37 DreamOmni2[6] 0.26 60.30 29.00 11.45 3.05 79.39 HunyuanImage-3.0-instruct[7] 0.21 5.34 25.95 8.400.7633.59 OmniGen2[8] 0.28 13.00 31.30 15.27 4.58 46.56 Qwen-Image-Edit-2509[9] 0.50 21.37 29.01 7.63 13.00 59.54 InsHuman (Ours)0.550.76 3.82 3.823.0510.69 algorithm and minimizes identity feature distances per matched pair, with no injection required at inference. To our knowledge, we are the first to simultaneously learn the interactive relationship"},{"citing_arxiv_id":"2605.05204","ref_index":6,"ref_count":3,"confidence":0.98,"is_internal_anchor":true,"paper_title":"D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-05-06T17:59:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by having the model act as both teacher (with multimodal context) and student (with text-only context) on its own roll-outs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"com/black-forest-labs/flux (2023) [4] Black Forest Labs: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/ flux-2(2025) [5] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P ., Neelakantan, A., Shyam, P ., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877-1901 (2020) [6] Cao, S., Chen, H., Chen, P ., Cheng, Y., Cui, Y., Deng, X., Dong, Y., Gong, K., Gu, T., Gu, X., et al.: Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951 (2025) [7] Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P ., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the"},{"citing_arxiv_id":"2604.28185","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling","primary_cat":"cs.CV","submitted_at":"2026-04-30T17:59:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemphasizing perceptual quality.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"image generation, where human judgments are often more reliable as pairwise comparisons than as absolute scores. Diffusion-DPO (Wallace et al., 2024) adapts direct preference optimization to diffusion models by replacing explicit reward maximization with a preference loss over winner-loser pairsD={(c,xw,xl)}: LDPO =−E(xw,xl)∼D [ logσ ( βlogπθ(xw) πref(xw)−βlogπθ(xl) πref(xl) )] .(8) The conceptual shift is important: alignment no longer depends on first fitting a separate reward model, but instead updates the generator directly to increase the relative likelihood of preferred outputs. Recent variants such asVideoDPO(Liu et al., 2025e) extend this logic from single images to trajectory-level preferences. In unified settings, DPO-style post-training is increasingly used to rebalance understanding and generation"},{"citing_arxiv_id":"2604.25636","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-04-28T13:36:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than editing-based methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In a Van Gogh oil painting, next to a huge white polar bear stands a small and exquisite black tuxedo penguin, looking at the iceberg in the distance together. In the watercolor style, a fox has crystal-like horns and its tail is not fluffy, but consists of flowing nebulae. The Statue of Liberty looked confused. Instead of a torch, she was holding a melting ice creamin her hand… Fig.2: Qualitative examples before and after RvR refinement. 1 Introduction Modern text-to-image (T2I) generation models [5,12,18,22,37,45,53] have made remarkable progress in synthesizing high-fidelity images from natural language. Nevertheless, reliably following complex prompts remains a major challenge, particularly when prompts involve multiple objects, diverse attributes, or fine- grained relationships [17,24,52]. To improve prompt-image alignment, recent studies have explored refinement approaches based on unified multimodal models"},{"citing_arxiv_id":"2604.19748","ref_index":2,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items","primary_cat":"cs.CV","submitted_at":"2026-04-21T17:59:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Tstars-Tryon 1.0 is a deployed virtual try-on system claiming high robustness, photorealism, multi-reference flexibility, and near real-time speed for diverse fashion items.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19858","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Wan-Image: Pushing the Boundaries of Generative Visual Intelligence","primary_cat":"cs.CV","submitted_at":"2026-04-21T17:58:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Wan-Image is a unified multi-modal system that integrates LLMs and diffusion transformers to deliver professional-grade image generation features including complex typography, multi-subject consistency, and precise editing, outperforming several prior models in human tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18168","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation","primary_cat":"cs.CV","submitted_at":"2026-04-20T12:28:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"diffusion [1, 2] to flow matching [3, 4, 34, 35] optimization, and the evolution of text encoders from early text foundation models [15, 16] to LLMs [17, 18, 36-39]. Representative works such as the Stable Diffusion [4, 40, 41] and PixArt [42-44]serieshavecontinuouslyimprovedimagegeneration capabilities. Recent large-scale models like FLUX [45, 46], Nano Banana [47], Qwen-Image [48], and HunyuanImage 3.0 [49] have demonstrated the ability to synthesize complex content and accurately edit images. To enhance semantic un- derstanding and instruction following abilities of generative models, models such as Playground v3 [50], SANA-1.5 [14], and BLIP3o-NEXT [51] focus on integrating LLMs [17,18] effectively into the generation framework. Meanwhile, given that high-quality image synthesis typically requires multi-"},{"citing_arxiv_id":"2604.15871","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs","primary_cat":"cs.CV","submitted_at":"2026-04-17T09:21:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"DNAEdit[49]4.40/4.402.64/2.584.37/4.41 3.76/3.79 3.79/3.80 Structured editing InstructPix2Pix[5] 1.65/1.74 1.65/1.79 1.34/1.39 1.23/1.29 1.47/1.55 Step1X-Edit[32]4.53/4.47 3.88/3.87 4.58/4.58 4.01/3.94 4.25/4.21 ICEdit[56] 4.29/4.25 3.08/3.01 4.26/4.21 3.60/3.58 3.81/3.76 control_v11e_sd15_ip2p[55] 1.64/1.62 1.66/1.80 1.18/1.21 1.07/1.09 1.39/1.43 MLLM Hunyuan[7]4.62/4.69 4.63/4.614.10/4.164.13/4.15 4.37/4.40 Qwen-Image-edit[45] 4.54/4.53 4.36/4.35 4.17/4.18 4.09/4.09 4.29/4.29 Bagel_with_thinking[11] 4.20/4.14 3.64/3.64 4.29/4.34 3.52/3.52 3.91/3.91 Bagel_without_thinking[11] 4.17/4.17 3.71/3.704.37/4.383.60/3.58 3.96/3.96 UniWorld[28] 4.24/4.30 3.85/3.85 3.98/4.13 3.69/3.78 3.94/4.02 Omni-Gen2[46] 4.39/4."},{"citing_arxiv_id":"2604.11006","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation","primary_cat":"cs.CV","submitted_at":"2026-04-13T05:14:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04911","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing","primary_cat":"cs.CV","submitted_at":"2026-04-06T17:54:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Furthermore, video-based world models [26,50] remain significantly inferior to image-based spatial editing models in performing fine-grained spatial manipu- lation guided by text instructions. Additionally, our model can also serve as a practical enhancement tool for single-view reconstruction. 2 Related Work Image Editing and Generative Models.Diffusion-based generative mod- els [9,41,56] have greatly improved the fidelity and controllability of image edit- ing. Instruction-based editing [12,42,59] modifies images according to natural language instructions while preserving overall semantics, but relies on large-scale instruction-faithful supervision. Early pipelines such as InstructP2P [8] combine prompt engineering with diffusion editing operators like Prompt-to-Prompt [22]"},{"citing_arxiv_id":"2604.03400","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Banana100: Breaking NR-IQA Metrics by 100 Iterative Image Replications with Nano Banana Pro","primary_cat":"cs.CV","submitted_at":"2026-04-03T19:01:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Banana100 dataset shows that none of 21 popular NR-IQA metrics consistently rate images degraded by 100 iterative edits lower than clean originals.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tual image super-resolution. InProceedings of the European conference on computer vision (ECCV) workshops, pages 0- 0, 2018. 6 [10] Sebastian Bosse, Dominique Maniry, Klaus-Robert M ¨uller, Thomas Wiegand, and Wojciech Samek. Deep neural net- works for no-reference and full-reference image quality as- sessment.IEEE Transactions on image processing, 27(1): 206-219, 2017. 6 [11] Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025. 1 [12] Chaofeng Chen. Model Cards for IQA-PyTorch - py- iqa 0.1.13 documentation.https://iqa- pytorch. readthedocs.io/en/latest/ModelCard.html, 2024."},{"citing_arxiv_id":"2604.03061","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Can Nano Banana 2 Replace Traditional Image Restoration Models? An Evaluation of Its Performance on Image Restoration Tasks","primary_cat":"cs.CV","submitted_at":"2026-04-03T14:33:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Nano Banana 2 delivers competitive perceptual quality on image restoration but produces over-enhanced results that diverge from input fidelity in ways standard metrics miss.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InCVPR, pages 18392-18402, 2023. [4] xAI. Grok imagine image.https : / / docs . x . ai/developers/models/grok-imagine-image, 2026. Accessed: 2026-03-22. [5] Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025. [6] Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next- generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025. 1 [7] James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee,"},{"citing_arxiv_id":"2603.28767","ref_index":45,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Gen-Searcher: Reinforcing Agentic Search for Image Generation","primary_cat":"cs.CV","submitted_at":"2026-03-30T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Gen-Searcher is the first trained search-augmented image generation agent using SFT followed by GRPO reinforcement learning with dual text-image rewards, delivering 15-16 point gains on knowledge-intensive benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.00607","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"IdGlow: Dynamic Identity Modulation for Multi-Subject Generation","primary_cat":"cs.CV","submitted_at":"2026-02-28T11:56:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"IdGlow is a progressive two-stage diffusion framework that uses task-adaptive timestep scheduling, temporal gating, VLM prompt synthesis, and group-level DPO to balance identity preservation and scene coherence in multi-subject image generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.00122","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents","primary_cat":"cs.CV","submitted_at":"2026-01-27T16:51:05+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.01593","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation","primary_cat":"cs.CV","submitted_at":"2026-01-04T16:46:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GAR-Font is a global-aware autoregressive framework for multimodal few-shot font generation that adds global tokenization, a language-style adapter, and post-refinement to improve style coherence over patch-based methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.07584","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LongCat-Image Technical Report","primary_cat":"cs.CV","submitted_at":"2025-12-08T14:26:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LongCat-Image delivers a compact 6B-parameter bilingual image generation model that sets new standards for Chinese character rendering accuracy and photorealism while remaining efficient and fully open-source.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.22699","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer","primary_cat":"cs.CV","submitted_at":"2025-11-27T18:52:07+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer Z-Image Team, Alibaba Group Abstract The landscape of high-performance image generation models is currently dominated by pro- prietary systems, such as Nano Banana Pro [27] and Seedream 4.0 [64]. Leading open-source alternatives, including Qwen-Image [76], Hunyuan-Image-3.0 [8] and FLUX.2 [35], are charac- terized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we proposeZ-Image, an efficient 6B-parameterfoundation generative model built upon a Scalable Single-Stream Diffusion Trans- former (S3-DiT) architecture thatchallenges the \"scale-at-all-costs\" paradigm."},{"citing_arxiv_id":"2511.22663","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model","primary_cat":"cs.CV","submitted_at":"2025-11-27T17:55:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AIA loss teaches unified multimodal models task-specific cross-modal attention patterns to reduce conflicts between image understanding and generation without architecture decoupling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.18870","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HunyuanVideo 1.5 Technical Report","primary_cat":"cs.CV","submitted_at":"2025-11-24T08:22:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HunyuanVideo 1.5 delivers state-of-the-art open-source text-to-video and image-to-video generation with an 8.3B parameter DiT model featuring SSTA attention, glyph-aware encoding, and progressive training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}