Title resolution pending

BLINK: Multimodal Large Language Models Can See but Not Perceive , author= · 2024

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Probing Visual Planning in Image Editing Models

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

Image editing models fail zero-shot visual planning on abstract mazes and queen puzzles but generalize after finetuning, yet still cannot match human zero-shot efficiency.

PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs

cs.CV · 2026-05-22 · unverdicted · novelty 6.0

PGT generates synthetic tasks via geometric overlays on images to supply dense visual supervision, improving spatial and relational understanding in MLLMs by up to 20% on targeted benchmarks.

A Nash Equilibrium Framework For Training-Free Multimodal Step Verification

cs.CV · 2026-05-19 · unverdicted · novelty 5.0

A Nash equilibrium framework for training-free multimodal step verification that uses cross-modal agreement and disagreement signals for filtering and ranking reasoning steps.

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

cs.CV · 2024-08-09 · unverdicted · novelty 5.0

mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.

citing papers explorer

Showing 4 of 4 citing papers.

Probing Visual Planning in Image Editing Models cs.CV · 2026-04-23 · unverdicted · none · ref 47
Image editing models fail zero-shot visual planning on abstract mazes and queen puzzles but generalize after finetuning, yet still cannot match human zero-shot efficiency.
PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs cs.CV · 2026-05-22 · unverdicted · none · ref 44
PGT generates synthetic tasks via geometric overlays on images to supply dense visual supervision, improving spatial and relational understanding in MLLMs by up to 20% on targeted benchmarks.
A Nash Equilibrium Framework For Training-Free Multimodal Step Verification cs.CV · 2026-05-19 · unverdicted · none · ref 53
A Nash equilibrium framework for training-free multimodal step verification that uses cross-modal agreement and disagreement signals for filtering and ranking reasoning steps.
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models cs.CV · 2024-08-09 · unverdicted · none · ref 165
mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer