Why are visually-grounded language models bad at image classification? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, Serena Yeung- Levy · 2024

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Bridging Coarse and Fine Recognition: A Hybrid Approach for Open-Ended Multi-Granularity Object Recognition in Interactive Educational Games

cs.CV · 2026-04-18 · unverdicted · novelty 5.0

HyMOR combines MLLM for coarse open-ended recognition with CLIP for fine-grained domain objects, achieving near-CLIP fine performance and 2.5% better general recognition plus 23.2% overall SBert gain on a new TBO textbook dataset.

citing papers explorer

Showing 1 of 1 citing paper.

Bridging Coarse and Fine Recognition: A Hybrid Approach for Open-Ended Multi-Granularity Object Recognition in Interactive Educational Games cs.CV · 2026-04-18 · unverdicted · none · ref 7
HyMOR combines MLLM for coarse open-ended recognition with CLIP for fine-grained domain objects, achieving near-CLIP fine performance and 2.5% better general recognition plus 23.2% overall SBert gain on a new TBO textbook dataset.

Why are visually-grounded language models bad at image classification? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

fields

years

verdicts

representative citing papers

citing papers explorer