WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
Large multilingual models pivot zero-shot multimodal learning across languages
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Adversarial images transfer across languages in MLLMs while apparent safety in weaker languages stems from comprehension and visual-grounding failures rather than genuine alignment.
TASM proposes a task-aware structured memory framework using task-vector compression, bipartite token merging, and a Core Memory plus Latent Bank hierarchy to enable efficient dynamic multi-modal in-context learning.
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
citing papers explorer
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.