{"total":14,"items":[{"citing_arxiv_id":"2606.26872","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpatialFlow-GRPO: Where Spatial Credit Drives Image Editing","primary_cat":"cs.CV","submitted_at":"2026-06-25T10:58:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpatialFlow-GRPO adds region-level reward feedback and spatial alignment to Flow-GRPO-style RL for image editing, reporting gains on GEdit-Bench, ImgEdit-Bench, and a new MultiEditBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00931","ref_index":75,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences","primary_cat":"cs.CV","submitted_at":"2026-05-30T23:37:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CV-Arena is a new 12K-pair benchmark for instruction-guided real-image editing with 16 task types, CogRetriever curation, and Active Elo mixed human-AI evaluation that finds gaps in 21 models and presents CV-Agent.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25378","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-25T03:07:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A multi-teacher distillation framework that packs 50 effect LoRAs and fast sampling into a single adapter while aiming to avoid concept interference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16951","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Edit-GRPO: A Locality-Preserving Policy Optimization Framework for Image Editing","primary_cat":"cs.CV","submitted_at":"2026-05-16T12:05:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Edit-GRPO decouples editing and preservation objectives via region-specific signals in a policy optimization framework to improve locality in image editing tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15181","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing","primary_cat":"cs.CV","submitted_at":"2026-05-14T17:58:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13062","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling","primary_cat":"cs.CV","submitted_at":"2026-05-13T06:33:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08703","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RewardHarness: Self-Evolving Agentic Post-Training","primary_cat":"cs.AI","submitted_at":"2026-05-09T05:32:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"0-Flash as a closed-source Sub-Agent replacement (Table 1); unless otherwise stated, each reported REWARDHARNESSvariant uses the Library evolved with that fixed Sub-Agent. 3.1 Main Results on Image-Editing Evaluation We evaluate preference judgment accuracy on two established benchmarks for instruction-guided image editing evaluation: EditReward-Bench [29], which reports ranking accuracy at K=2, 3, and 4, and GenAI-Bench [9]. Main results.Table 1 compares REWARDHARNESSagainst proprietary models (GPT-4o, GPT-5, Gemini, Claude) and open-source baselines (Qwen2.5-VL, MiMo-VL, EditReward) on EditReward- Bench (K=2/3/4) and GenAI-Bench. With a frozen Qwen2.5-VL-7B Sub-Agent, REWARDHARNESS achieves 45.7 average accuracy, outperforming all listed baselines on average, including the strongest"},{"citing_arxiv_id":"2605.08354","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria","primary_cat":"cs.AI","submitted_at":"2026-05-08T18:05:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text-to-image and editing benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"evaluative reliability and downstream performance. 4.1 Experimental Setup Evaluation Benchmarks.Evaluator fidelity is measured on three established testbeds: MM- RewardBench2 [16], which provides fine-grained diagnostic splits across multimodal reward scenar- ios; HPDv3 (test set) [28], a large-scale text-to-image preference corpus comprising 14,400 pairwise human judgments; and EditReward-Bench [43], specifically curated to probe instruction adherence in image editing. For generative quality assessment, we adopt GenEval [ 11], DPG-Bench[ 15], TIIF(test-mini-short)[40], and UniGenBench++[37] for text-to-image synthesis, complemented by GEdit-Bench[24] and ImgEdit[49] for editing tasks. Baselines and Implementation.For human preference evaluation, we compare against a suite of"},{"citing_arxiv_id":"2605.07477","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2026-05-08T09:23:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"81 67.30 0.3821 0.3628 GPT-5 [37] 59.61 47.27 75.24 0.4085 0.4175 Gemini-2.0-Flash [12] 53.32 44.31 69.58 0.2369 0.3512 Gemini-2.5-Flash [6] 57.01 47.63 70.26 0.4162 0.4258 Qwen3-VL (7B) [45] 44.58 34.28 42.51 0.2426 0.3860 InternVL3.5 (7B) [39] 45.17 36.92 45.18 0.2071 0.3091 EditScore (Qwen3) [47] 50.24 42.18 72.30 0.3062 0.3805 EditReward (MiMo) [20] 65.72 63.62 71.25 0.3520 0.4643 LMM4Edit [59] 63.27 65.21 70.64 0.3726 0.5012 ReasonEdit (Ours) 83.90 72.22 78.48 0.7566 0.5830 6.2 RE-Reward performance To validate our reward model's alignment with human preferences, we compare its correlation with human annotations against diverse baselines-including vision-language metrics and state-of-the-art"},{"citing_arxiv_id":"2605.07457","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement","primary_cat":"cs.CV","submitted_at":"2026-05-08T09:05:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"4156 0.3559 0.2961 0.3680★Gemini-3.1-Pro [15] 0.3115 0.2874 0.3386 0.2927 0.2618 0.2966 0.3009 0.2396 0.3704 0.3410 0.2927 0.3651✩LMM4Edit [61] 0.3509 0.2424 0.3651 0.3385 0.2109 0.3457 0.3555 0.2145 0.3552 0.3883 0.3006 0.3851✩EditScore (Qwen3) [33] 0.2068 0.1515 0.1842 0.2632 0.2262 0.1316 0.2886 0.2451 0.1274 0.2829 0.2270 0.2497✩EditReward (MiMo) [55] 0.3276 0.2519 0.3357 0.0508 0.0336 0.0683 0.2627 0.2087 0.2679 0.3037 0.2647 0.3251✩EditHF [60] 0.2082 0.1427 0.1416 0.1325 0.0886 0.1191 0.1520 0.1023 0.1365 0.1412 0.1225 0.1323Ours (Ovis2.5-9B)✻ 0.72330.53710.74710.71000.52810.72120.67970.41260.72950.76430.66260.7826Ours (InternVL3.5-8B)✻ 0.73350.50440.72780.72440.54090.73530.69520.50880.73320."},{"citing_arxiv_id":"2605.06535","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance","primary_cat":"cs.CV","submitted_at":"2026-05-07T16:35:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25477","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing","primary_cat":"cs.CV","submitted_at":"2026-04-28T10:30:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"inspection, the resulting data forms our initial supervised training setD SFT. 2) Difficulty-Aware Refinement:To construct a more effec- tive training curriculum for the reinforcement learning stage, we perform difficulty-aware rejection sampling, as depicted in Stage 2 of Figure 2. This strategy is motivated by the observation that not all samples provide equally useful learning signals [46]. Trivial examples offer diminishing returns, while excessively difficult or noisy ones, where the model consistently fails, can introduce misleading gradients and hinder policy updates [46]. Our goal is therefore to curate a dataset of moderately difficult samples, which represent an effective \"sweet spot\" for learning [46], [47]. Specifically, we employ"},{"citing_arxiv_id":"2604.16272","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects","primary_cat":"cs.CV","submitted_at":"2026-04-17T17:28:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"IVE-Bench [15] provide instructions without edited outputs; OpenVE [16] offers scale but relies heavily on automated generation and filtering rather than human annotation; VE-Bench [17] included edited videos and human scores but reduced quality to a single scalar and are built on older editing systems. On the reward-model side, prior work focuses on image editing or video generation quality rather than video editing itself [18, 19]. As a result, there is a pressing need for a benchmark and evaluator that jointly capture instruction faithfulness, rendering quality, and preservation of unedited content. To address these gaps, we introduceVEFX-Dataset, VEFX-Reward, and VEFX-Bench. VEFX-Dataset contains 5,049 human-annotated video editing examples spanning 9 major categories and 32 fine-grained subcategories."},{"citing_arxiv_id":"2512.13592","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Image Diffusion Preview with Consistency Solver","primary_cat":"cs.LG","submitted_at":"2025-12-15T17:47:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ConsistencySolver enables high-quality low-step diffusion previews by adapting general linear multistep methods into a lightweight RL-optimized solver, matching multistep DPM-Solver FID with 47% fewer steps and cutting user interaction time by nearly 50%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}