WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
Large multilingual models pivot zero-shot multimodal learning across languages
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Adversarial images transfer across languages in MLLMs while apparent safety in weaker languages stems from comprehension and visual-grounding failures rather than genuine alignment.
TASM proposes a task-aware structured memory framework using task-vector compression, bipartite token merging, and a Core Memory plus Latent Bank hierarchy to enable efficient dynamic multi-modal in-context learning.
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
citing papers explorer
-
Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models
Adversarial images transfer across languages in MLLMs while apparent safety in weaker languages stems from comprehension and visual-grounding failures rather than genuine alignment.
-
Task-Aware Structured Memory for Dynamic Multi-modal In-Context Learning
TASM proposes a task-aware structured memory framework using task-vector compression, bipartite token merging, and a Core Memory plus Latent Bank hierarchy to enable efficient dynamic multi-modal in-context learning.