Image editing models fail zero-shot visual planning on abstract mazes and queen puzzles but generalize after finetuning, yet still cannot match human zero-shot efficiency.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 4verdicts
UNVERDICTED 4representative citing papers
PGT generates synthetic tasks via geometric overlays on images to supply dense visual supervision, improving spatial and relational understanding in MLLMs by up to 20% on targeted benchmarks.
A Nash equilibrium framework for training-free multimodal step verification that uses cross-modal agreement and disagreement signals for filtering and ranking reasoning steps.
mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.
citing papers explorer
-
Probing Visual Planning in Image Editing Models
Image editing models fail zero-shot visual planning on abstract mazes and queen puzzles but generalize after finetuning, yet still cannot match human zero-shot efficiency.
-
PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs
PGT generates synthetic tasks via geometric overlays on images to supply dense visual supervision, improving spatial and relational understanding in MLLMs by up to 20% on targeted benchmarks.
-
A Nash Equilibrium Framework For Training-Free Multimodal Step Verification
A Nash equilibrium framework for training-free multimodal step verification that uses cross-modal agreement and disagreement signals for filtering and ranking reasoning steps.
-
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.