{"total":10,"items":[{"citing_arxiv_id":"2605.23531","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PixIE: Prompted Pixel-Space Low-Light Image Enhancement","primary_cat":"cs.CV","submitted_at":"2026-05-22T11:50:40+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23518","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset","primary_cat":"cs.CV","submitted_at":"2026-05-22T11:33:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VINS-120K supplies the first large-scale set of instruction-image-edited-image triplets at ultra-high resolution together with an adaptation strategy that improves detail synthesis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20147","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset","primary_cat":"cs.CV","submitted_at":"2026-05-19T17:35:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PixVerve introduces a 95K ultra-high-resolution image-text dataset and training strategies that enable native 100-megapixel text-to-image generation together with a new evaluation benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17759","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion","primary_cat":"cs.CV","submitted_at":"2026-05-18T02:25:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FrequencyBooster reports state-of-the-art FID scores of 1.60 at 256x256 and 1.69 at 512x512 for pixel diffusion by using a specialized decoder for full-frequency modeling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15908","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RaPD: Resolution-Agnostic Pixel Diffusion via Semantics-Enriched Implicit Representations","primary_cat":"cs.CV","submitted_at":"2026-05-15T12:45:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RaPD enables resolution-agnostic image generation by diffusing in a semantics-enriched continuous Neural Image Field latent space using semantic guidance and a coordinate-queried attention renderer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15741","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion","primary_cat":"cs.CV","submitted_at":"2026-05-15T08:51:55+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12500","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[18] Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025. 35 [19] Yangyi Chen, Xingyao Wang, Hao Peng, and Heng Ji. A single transformer for scalable vision-language modeling.Transactions on Machine Learning Research, 2024. [20] Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, and Ying Tai. Dip: Taming diffusion models in pixel space. arXiv preprint arXiv:2511.18822, 2025. [21] Ethan Chern, Jiadi Su, Yan Ma, and Pengfei Liu. Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation."},{"citing_arxiv_id":"2604.17850","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UniCSG: Unified High-Fidelity Content-Constrained Style-Driven Generation via Staged Semantic and Frequency Disentanglement","primary_cat":"cs.CV","submitted_at":"2026-04-20T05:59:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"UniCSG adds staged semantic disentanglement and frequency-aware reconstruction to DiT diffusion models to improve content preservation and style fidelity in both text- and reference-guided generation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"employing asynchronous denoising such as SFD[24]. In contrast, we retain the standard encoder and decouple objectives via staged training. Stage 1 learns semantic disentanglement under degraded inputs. Stage 2 refines details guided by the learned scaffold. 4 Yang et al. 2.3 Frequency Disentanglement Diffusion Transformers have popularized high/low-frequency separation. DeCo[22], DiP[3], PixelDiT[46], and JiT[20] assign different components to low-frequency semantics versus high-frequency textures, aligning with Transformers' strengths in low-frequency modeling. Common strategies employ heavy downsampling[22,46] or embedding bottlenecks[20] to constrain the backbone to low frequencies. Lightweight decoders[22] or auxiliary paths[46] then restore high-frequency tex-"},{"citing_arxiv_id":"2604.15654","ref_index":73,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Zero to Detail: A Progressive Spectral Decoupling Paradigm for UHD Image Restoration with New Benchmark","primary_cat":"cs.CV","submitted_at":"2026-04-17T03:13:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new framework called ERR decomposes UHD image restoration into three frequency stages with specialized sub-networks and introduces the LSUHDIR benchmark dataset of over 82,000 images.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Tai, \"Ragd: Regional-aware diffusion model for text-to- image generation,\" inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 19 331-19 341. [72] N. Du, Z. Chen, S. Gao, Z. Chen, X. Chen, Z. Jiang, J. Yang, and Y. Tai, \"Textcrafter: Accurately rendering multiple texts in complex visual scenes,\"arXiv preprint arXiv:2503.23461, 2025. [73] Z. Chen, J. Zhu, X. Chen, J. Zhang, X. Hu, H. Zhao, C. Wang, J. Yang, and Y. Tai, \"Dip: Taming diffusion models in pixel space,\" arXiv preprint arXiv:2511.18822, 2025. [74] Z. Chen, X. Zhang, T.-Z. Xiang, and Y. Tai, \"Adaptive guid- ance learning for camouflaged object detection,\"arXiv preprint arXiv:2405.02824, 2024. [75] S. Lu, Z. Lian, Z. Zhou, S."},{"citing_arxiv_id":"2604.06870","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details","primary_cat":"cs.CV","submitted_at":"2026-04-08T09:32:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"competitive baselines and near-perfect background preservation, estab- lishing a practical solution for high-precision local refinement. Project Page:https://limuloo.github.io/RefineAnything/. Keywords:Image Generation·Image Editing·Multimodal Learning 1 Introduction Image generation has advanced rapidly, and modern models offer substantially improved controllability [4,8,9,11,12,19-24,26-31,36,40,43,46,51,53-65,67]. Yet a practical failure mode still frequently blocks real-world deployment:local detail collapse.AsshowninFig.1,fine-grainedelementssuchasprintedtext,logos,and thin structures are often distorted or inconsistent, even when the global compo- sition is plausible. This issue is particularly damaging in high-stakes applications where small details carry key information, such as e-commerce product images"}],"limit":50,"offset":0}