Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
Mixed citations
Mural: multimodal, multitask retrieval across languages
Mixed citation behavior. Most common role is background (60%).
citation-role summary
citation-polarity summary
representative citing papers
Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.
LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.
citing papers explorer
-
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
-
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.
-
Inner Monologue: Embodied Reasoning through Planning with Language Models
LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
-
Multilingual Vision-Language Models, A Survey
The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.