{"total":15,"items":[{"citing_arxiv_id":"2605.23518","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset","primary_cat":"cs.CV","submitted_at":"2026-05-22T11:33:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VINS-120K supplies the first large-scale set of instruction-image-edited-image triplets at ultra-high resolution together with an adaptation strategy that improves detail synthesis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21573","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models","primary_cat":"cs.CV","submitted_at":"2026-05-20T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Lens is a 3.8B-parameter text-to-image model that reaches competitive or superior performance to >6B-parameter systems using 19.3% of the training compute of Z-Image through a densely captioned 800M dataset, multi-resolution batching, semantic VAE, strong language encoder, RL fine-tuning, and 4-step","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21090","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TextSculptor: Training and Benchmarking Scene Text Editing","primary_cat":"cs.CV","submitted_at":"2026-05-20T12:22:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TextSculptor supplies an automated data synthesis pipeline yielding 3.2M samples plus a four-task benchmark that raises open-source scene text editing performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16810","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Training-Free Occluded Text Rendering via Glyph Priors and Attention-Guided Semantic Blending","primary_cat":"cs.CV","submitted_at":"2026-05-16T04:58:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A restarted dual-stream inference approach with glyph priors and attention-guided masks improves occluded text rendering in pretrained diffusion models without fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12500","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"SenseNova-U1 extends beyond accurate text rendering to more challenging text-centric generation scenarios that require jointly satisfying fine-grained textual constraints, compositional reasoning, and global instruction consistency. Text-centric Generation.We evaluate text-centric generation, with a focus on long-text rendering, multi-region text generation, and complex text-conditioned instruction following. We conduct experiments on CVTG-2K [ 32] and 20 Model # Params Overall↑ Basic Advanced Design Avg Attr Rel Rsn Avg ARel ARsn RRsn Style Text RealW Closed-source Models GPT-Image-1 [100] - 89.15 90.75 91.33 84.57 96.32 88.55 87.07 87.22 85.59 90.00 89.83 89.73 Seedream 3.0 [38] - 86.02 87.07 90.50 89.85 80.86 79.16 79.76 77.23 75.64 100.00 97.17 83.21 DALL-E 3 [5] - 74.96 78.72 79.50 80."},{"citing_arxiv_id":"2605.12013","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"L2P: Unlocking Latent Potential for Pixel Generation","primary_cat":"cs.CV","submitted_at":"2026-05-12T12:01:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11723","ref_index":10,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating","primary_cat":"cs.CV","submitted_at":"2026-05-12T08:08:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CaC presents a new spatiotemporal concentrating reward model for video anomalies, built on a novel large-scale dataset and three-stage training with RL and IoU rewards, claiming 25.7% accuracy gains and 11.7% anomaly reduction.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"vision and pattern recognition, pages 24185-24198, 2024. [9] Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. [10] Nikai Du, Zhennan Chen, Shan Gao, Zhizhou Chen, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes.arXiv preprint arXiv:2503.23461, 2025. [11] Google. Gemini-3-pro.https://deepmind.google/models/gemini/pro/, 2025. [12] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu,"},{"citing_arxiv_id":"2605.11061","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer","primary_cat":"cs.CV","submitted_at":"2026-05-11T17:59:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B to over 200B parameters.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":", Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) [10] Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P .E., Lomeli, M., Hosseini, L., Jégou, H.: The faiss library. arXiv preprint arXiv:2401.08281 (2024) [11] Du, N., Chen, Z., Gao, S., Chen, Z., Chen, X., Jiang, Z., Yang, J., Tai, Y.: Textcrafter: Accu- rately rendering multiple texts in complex visual scenes. arXiv preprint arXiv:2503.23461 (2025) [12] Esser, P ., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image"},{"citing_arxiv_id":"2604.24953","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ViPO: Visual Preference Optimization at Scale","primary_cat":"cs.CV","submitted_at":"2026-04-27T19:49:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Poly-DPO improves robustness to noisy preference data in visual models, and the new ViPO dataset enables superior performance, with the method reducing to standard DPO on high-quality data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"|−(1−p)(1+αp)| of Poly-DPO adapts to different data characteristics through theα parameter, where p=σ(z) represents the model's confidence in preferring the chosen response. The visualization reveals three distinct optimization regimes that directly correspond to our experimental findings. Whenα >0 (blue and purple curves), the gradient is amplified in the regionp∈[0.5,0.8] , maintaining substantial parameter updates even for moderately confident predictions. This enhancement proves crucial for noisy datasets like Pick-a-Pic V2, where only 20.79% of samples show consistent preferences across evaluation dimensions-the sustained gradient (approximately 2-3× stronger than standard DPO atp≈0.6 when α= 8) prevents premature convergence on spurious patterns and encourages continued exploration to identify"},{"citing_arxiv_id":"2604.20796","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model","primary_cat":"cs.CV","submitted_at":"2026-04-22T17:20:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15654","ref_index":72,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Zero to Detail: A Progressive Spectral Decoupling Paradigm for UHD Image Restoration with New Benchmark","primary_cat":"cs.CV","submitted_at":"2026-04-17T03:13:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new framework called ERR decomposes UHD image restoration into three frequency stages with specialized sub-networks and introduces the LSUHDIR benchmark dataset of over 82,000 images.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Lin, \"Diffusion model for camouflaged object detection,\"arXiv preprint arXiv:2308.00303, 2023. [71] Z. Chen, Y. Li, H. Wang, Z. Chen, Z. Jiang, J. Li, Q. Wang, J. Yang, and Y. Tai, \"Ragd: Regional-aware diffusion model for text-to- image generation,\" inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 19 331-19 341. [72] N. Du, Z. Chen, S. Gao, Z. Chen, X. Chen, Z. Jiang, J. Yang, and Y. Tai, \"Textcrafter: Accurately rendering multiple texts in complex visual scenes,\"arXiv preprint arXiv:2503.23461, 2025. [73] Z. Chen, J. Zhu, X. Chen, J. Zhang, X. Hu, H. Zhao, C. Wang, J. Yang, and Y. Tai, \"Dip: Taming diffusion models in pixel space,\" arXiv preprint arXiv:2511.18822, 2025."},{"citing_arxiv_id":"2604.06870","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details","primary_cat":"cs.CV","submitted_at":"2026-04-08T09:32:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"competitive baselines and near-perfect background preservation, estab- lishing a practical solution for high-precision local refinement. Project Page:https://limuloo.github.io/RefineAnything/. Keywords:Image Generation·Image Editing·Multimodal Learning 1 Introduction Image generation has advanced rapidly, and modern models offer substantially improved controllability [4,8,9,11,12,19-24,26-31,36,40,43,46,51,53-65,67]. Yet a practical failure mode still frequently blocks real-world deployment:local detail collapse.AsshowninFig.1,fine-grainedelementssuchasprintedtext,logos,and thin structures are often distorted or inconsistent, even when the global compo- sition is plausible. This issue is particularly damaging in high-stakes applications where small details carry key information, such as e-commerce product images"},{"citing_arxiv_id":"2512.07584","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LongCat-Image Technical Report","primary_cat":"cs.CV","submitted_at":"2025-12-08T14:26:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LongCat-Image delivers a compact 6B-parameter bilingual image generation model that sets new standards for Chinese character rendering accuracy and photorealism while remaining efficient and fully open-source.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.22699","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer","primary_cat":"cs.CV","submitted_at":"2025-11-27T18:52:07+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Image-T urboandZ-Image-Edit), we conducted extensive experiments across multiple authoritative benchmarks. These evaluations cover general image generation, fine-grained instruction following, text rendering in both English and Chinese, and instruction-based image editing. 5.2.1. Text-to-Image Generation CVTG-2K.To evaluate our model's performance on text rendering tasks, we conduct quantitative experiments on the CVTG-2K benchmark [17]. CVTG-2K is a specialized benchmark designed for Com- plex Visual Text Generation, encompassing diverse scenarios with varying numbers of text regions. As presented in Table 5, our model achieves superior performance on CVTG-2K across all evaluation metrics. Specifically, Z-Image attains the highest average Word Accuracy score of 0.8671, outperforming compet-"},{"citing_arxiv_id":"2510.26583","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Emu3.5: Native Multimodal Models are World Learners","primary_cat":"cs.CV","submitted_at":"2025-10-30T15:11:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"across four publicly accessible benchmarks: GenEval [34], DPG-bench [41], OneIG-Bench [12] and TIIF [105]. These benchmarks offer a thorough assessment of the model's capacity to produce high-quality images that align semantically with given text prompts. To evaluate the model's text rendering capability, we conduct evaluation both on English and Chinese text genera- tion. For English text rendering, we utilize the LeX-Bench [129], CVTG-2K [26] benchmark to test the readability of rendered English text. For Chinese text rendering, we performed an evaluation using LongText-Bench [33]. This benchmark is designed to assess how well models render longer texts in both English and Chinese. 6https://github.com/LAION-AI/aesthetic-predictor 7https://github.com/Breakthrough/PySceneDetect 18 Quantitative Evaluation."}],"limit":50,"offset":0}