{"total":15,"items":[{"citing_arxiv_id":"2605.23895","ref_index":34,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"From Activation to Causality: Discovery of Causal Visual Representations in the Human Brain","primary_cat":"cs.CV","submitted_at":"2026-05-22T17:56:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BrainCause recovers known visual localizations and finds new candidate representations by validating causal specificity via counterfactual stimuli and encoding models, showing activation alone produces many false positives.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21484","ref_index":6,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"One-Step Distillation of Discrete Diffusion Image Generators via Fixed-Point Iteration","primary_cat":"cs.CV","submitted_at":"2026-05-20T17:59:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Fixed-Point Distillation constructs one-step correction targets for discrete diffusion generators via partial corruption and single teacher refinement, lifted into continuous features with a multi-bandwidth drift loss and straight-through estimation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21090","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"TextSculptor: Training and Benchmarking Scene Text Editing","primary_cat":"cs.CV","submitted_at":"2026-05-20T12:22:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TextSculptor supplies an automated data synthesis pipeline yielding 3.2M samples plus a four-task benchmark that raises open-source scene text editing performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19511","ref_index":17,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Are Watermarked Images Editable? SafeMark for Watermark-Preserving Text-Guided Image Editing","primary_cat":"cs.CV","submitted_at":"2026-05-19T08:08:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SafeMark integrates a thresholded watermark-decoding loss into diffusion editors to enable text-guided edits that preserve embedded watermarks with high bit accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19319","ref_index":9,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution","primary_cat":"cs.CV","submitted_at":"2026-05-19T03:54:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SWEET is a one-shot sparse visual planning framework that progressively generates manipulation keyframes via image editing conditioned on language and spatial guidance, then converts them to actions with a diffusion predictor, showing better fidelity and lower cost than video models on DROID and Rob","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17019","ref_index":4,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"StreamingEffect: Real-Time Human-Centric Video Effect Generation","primary_cat":"cs.CV","submitted_at":"2026-05-16T14:45:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13122","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation","primary_cat":"cs.CV","submitted_at":"2026-05-13T07:48:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13062","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling","primary_cat":"cs.CV","submitted_at":"2026-05-13T06:33:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12271","ref_index":22,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm","primary_cat":"cs.CV","submitted_at":"2026-05-12T15:35:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proposes V2V-Zero, a training-free framework replacing text conditioning with VLM final-layer hidden states from visual pages, achieving 0.85 on GenEval and 32.7/100 on new Simple-V2V Bench across models including video extension.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[20] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022. [21] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392-18402, 2023. [22] Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing.arXiv preprint arXiv:2307.10373, 2023. [23] Max Ku, Cong Wei, Weiming Ren, Harry Yang, and Wenhu Chen. Anyv2v: A tuning-free framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024. [24] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala."},{"citing_arxiv_id":"2605.12038","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation","primary_cat":"cs.CV","submitted_at":"2026-05-12T12:21:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"• We construct a motion-aligned synthetic cross-embodiment dataset across diverse humanoids, motions, scenes, and viewpoints, and validate generalization to unseen embodiments, actions, and environments on both synthetic and real-world benchmarks. 2 2 Related Work 2.1 Video Generation Models In recent years, diffusion models have been widely applied to image synthesis [25], image editing [2, 7, 9], video generation [1, 27, 3, 6, 20, 19, 21, 22, 17], and procedural generation [28, 29, 31, 39]. Within video generation, the currently prevalent DiT architecture [8] has progressively surpassed earlier GAN-based [23] and UNet-based methods [6, 28] by significantly enhancing visual fidelity and temporal consistency. Today, rapidly evolving DiT frameworks form the foundation of cutting-edge"},{"citing_arxiv_id":"2605.09688","ref_index":74,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ConFixGS: Learning to Fix Feedforward 3D Gaussian Splatting with Confidence-Aware Diffusion Priors in Driving Scenes","primary_cat":"cs.CV","submitted_at":"2026-05-10T18:18:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ConFixGS repairs feedforward 3D Gaussian Splatting with confidence-aware diffusion priors, delivering up to 3.68 dB PSNR gains and halved FID scores on Waymo, nuScenes, and KITTI novel view synthesis tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"splatting for driving scene reconstruction from flexible surround-view input. InProceedings of the AAAI Conference on Artificial Intelligence, 2025. [73] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684-10695, 2022. [74] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392-18402, 2023. [75] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object."},{"citing_arxiv_id":"2605.08354","ref_index":5,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria","primary_cat":"cs.AI","submitted_at":"2026-05-08T18:05:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text-to-image and editing benchmarks.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image generation with better captions.arXiv preprint arXiv:2310.07685, 2023. [4] Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023. [5] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392-18402, 2023. [6] Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al."},{"citing_arxiv_id":"2605.08250","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Why Do DiT Editors Drift? Plug-and-Play Low Frequency Alignment in VAE Latent Space","primary_cat":"cs.CV","submitted_at":"2026-05-07T16:33:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"VAE-LFA suppresses semantic drift in multi-turn DiT image editing by low-pass filtering latent discrepancies and aligning low-frequency components to an EMA of previous rounds in VAE space.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"We hope that we have provided meaningful insight into the future development of image generation and editing models. Funding disclosure.The authors received no specific funding for this work. 10 References [1] Gal Almog, Ariel Shamir, and Ohad Fried. Reed-vae: Re-encode decode training for iterative image editing with diffusion models. InComputer Graphics Forum, volume 44, page e70020. Wiley Online Library, 2025. [2] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392-18402, 2023. [3] Tianyu Chen, Yasi Zhang, Zhi Zhang, Peiyu Yu, Shu Wang, Zhendong Wang, Kevin Lin, Xiaofei Wang, Zhengyuan Yang, Linjie Li, et al."},{"citing_arxiv_id":"2605.05781","ref_index":4,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Steering Visual Generation in Unified Multimodal Models with Understanding Supervision","primary_cat":"cs.CV","submitted_at":"2026-05-07T07:20:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.17726","ref_index":5,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM","primary_cat":"cs.CV","submitted_at":"2025-05-23T10:43:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Slot-MLLM introduces a slot-attention-based object-centric visual tokenizer with Q-Former encoder, diffusion decoder, and residual vector quantization for improved local visual comprehension and generation in multimodal LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}