Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
hub Canonical reference
Flamingo: a Visual Language Model for Few-Shot Learning
Canonical reference. 88% of citing Pith papers cite this work as background.
abstract
Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning tasks, which evaluate the ability to describe a scene or an event; and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing
co-cited works
representative citing papers
Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
A spiked signal-plus-noise model yields separation ratios that partition multimodal problems into four regimes where alignment, prediction, both, or neither succeed.
Introduces OMTG benchmark with C-Acc and EtF1 metrics, a 56k dataset, and caption/temporal rewards, reaching 43.65% EtF1 SOTA on the new bench.
Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
Injecting pre-computed layout priors from RT-DETR into VLM prompts raises markdown F1 from 0.37 to 0.92 on a 10k-page OOD benchmark and cuts infinite-loop failures across domains.
PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
COHERENCE is a new benchmark for measuring MLLMs' ability to recover fine-grained image-text correspondences in interleaved multimodal contexts.
Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pair benchmark.
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero-shot VQAv2.
LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
Audit of four VideoQA benchmarks reveals text-only shortcuts in VLMs; new diagnostics Blind Gap, Visual Gain, and Shortcut Score quantify and filter visual dependence.
TacGen reports that vision-plus-generated-touch representations outperform vision-only models on mass, density, hardness, force, and manipulation tasks, with controls indicating the touch channel accounts for most of the gain.
S-Agent augments VLMs with spatial tools, scene and agent memory for evidence accumulation on multi-view and video tasks, and produces an 8B model via SFT on its own trajectories that beats same-scale baselines.
Q-K=V projection sharing in transformers matches standard QKV performance with 50% KV cache reduction and combines with GQA/MQA for up to 96.9% reduction across vision and language tasks.
MOSS-Video-Preview introduces a cross-attention architecture and synthesized real-time QA data to enable continuous perception, answer revision, and faster inference in video-language models compared to decoder-only designs.
VRCD prioritizes visually complementary positions during parallel decoding in dMLLMs by measuring attention overlap with the new Visual Redundancy Index, yielding accuracy gains over confidence-based baselines on M^3CoT and MMBench.
Stream mines streaming media to create and release StreamDial, a dataset of 87,498 structured task-oriented dialogue sessions across automotive, restaurant, and hotel domains using persona construction, Conversational Blueprints, and RAG.
citing papers explorer
-
When to Align, When to Predict: A Phase Diagram for Multimodal Learning
A spiked signal-plus-noise model yields separation ratios that partition multimodal problems into four regimes where alignment, prediction, both, or neither succeed.
-
Towards One-to-Many Temporal Grounding
Introduces OMTG benchmark with C-Acc and EtF1 metrics, a 56k dataset, and caption/temporal rewards, reaching 43.65% EtF1 SOTA on the new bench.
-
Toward Calibrated, Fair, and accurate Deepfake Detection
Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
-
Structured Layout Priors for Robust Out-of-Distribution Visual Document Understanding
Injecting pre-computed layout priors from RT-DETR into VLM prompts raises markdown F1 from 0.37 to 0.92 on a 10k-page OOD benchmark and cuts infinite-loop failures across domains.
-
Allegory of the Cave: Measurement-Grounded Vision-Language Learning
PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
-
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
-
COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
COHERENCE is a new benchmark for measuring MLLMs' ability to recover fine-grained image-text correspondences in interleaved multimodal contexts.
-
Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery
Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pair benchmark.
-
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
-
Bottleneck Tokens for Unified Multimodal Retrieval
Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
-
From Accuracy to Visual Dependence: Auditing and Filtering Modality Collapse in Traffic VideoQA
Audit of four VideoQA benchmarks reveals text-only shortcuts in VLMs; new diagnostics Blind Gap, Visual Gain, and Shortcut Score quantify and filter visual dependence.
-
TacGen: Touch Is a Necessary Dimension of Physical-World Representation -- Addressing Tactile Data Scarcity with Scalable Vision-to-Touch Alignment and Generation
TacGen reports that vision-plus-generated-touch representations outperform vision-only models on mass, density, hardness, force, and manipulation tasks, with controls indicating the touch channel accounts for most of the gain.
-
S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence
S-Agent augments VLMs with spatial tools, scene and agent memory for evidence accumulation on multi-view and video tasks, and produces an 8B model via SFT on its own trajectories that beats same-scale baselines.
-
Do Transformers Need Three Projections? Systematic Study of QKV Variants
Q-K=V projection sharing in transformers matches standard QKV performance with 50% KV cache reduction and combines with GQA/MQA for up to 96.9% reduction across vision and language tasks.
-
MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention
MOSS-Video-Preview introduces a cross-attention architecture and synthesized real-time QA data to enable continuous perception, answer revision, and faster inference in video-language models compared to decoder-only designs.
-
Visual-Redundancy-Controlled Parallel Decoding for Diffusion-Based Multimodal Large Language Models
VRCD prioritizes visually complementary positions during parallel decoding in dMLLMs by measuring attention overlap with the new Visual Redundancy Index, yielding accuracy gains over confidence-based baselines on M^3CoT and MMBench.
-
STREAM: A Data-Centric Framework for Mining High-Value Task-Oriented Dialogues from Streaming Media
Stream mines streaming media to create and release StreamDial, a dataset of 87,498 structured task-oriented dialogue sessions across automotive, restaurant, and hotel domains using persona construction, Conversational Blueprints, and RAG.
-
From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models
Demo2Reward optimizes VLM reward model language instructions at test time from a few demonstrations to reduce false positives and enable policy learning in simulated and real robotic tasks without manual reward design.
-
A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability
RRFP introduces a readiness-driven runtime for pipeline parallelism that uses schedules as hints and ready-set arbitration to improve utilization under runtime variability, reporting up to 2.77x speedup on multimodal workloads.
-
Can Multimodal Large Language Models Understand Pathologic Movements? A Pilot Study on Seizure Semiology
MLLMs achieve zero-shot recognition of seizure semiological features better than fine-tuned vision models on most tested features, with signal enhancement and faithful explanations.
-
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
-
Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift
MG-MTTA improves VLM accuracy under modality-specific shifts by replacing pure entropy minimization with majorization-guided adaptation that incorporates a reliability-aware gate prior.
-
MTT-Bench: Predicting Social Dominance in Mice via Multimodal Large Language Models
Fine-tuned multimodal LLMs predict mouse social dominance from raw tube test videos with high agreement to traditional rankings.
-
Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models
Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy at low token budgets.
-
Text Steganography with Dynamic Codebook and Multimodal Large Language Model
A black-box text steganography method using a dynamic codebook generated by multimodal LLMs and reject-sampling feedback achieves higher embedding capacity and text quality than prior white-box and fixed-codebook black-box approaches.
-
AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning
AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to novel compositions.
-
Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design
Machine interpreting should shift from fidelity metrics to three design priorities—agency, grounding, and experience—drawn from interpreting studies to close the usability gap with human-mediated communication.
-
Multimodal Alignment and Preference Optimization for Zero-Shot Conditional RNA Generation
Moirain models use multimodal SFT and DPO to generate novel RNA sequences with superior protein binding affinities in a zero-shot conditional setting.
-
Agentic AI for Remote Sensing: Technical Challenges and Research Directions
Position paper identifies structural challenges in applying generic agentic AI to Earth Observation and outlines design principles for EO-native agents focused on geospatial state and validity.
-
Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification
Detection-guided prompting raises small VLM hazard F1 from 34.5% to 50.6% and BERTScore from 0.61 to 0.82 on construction images with only 2.5 ms added latency.
-
Emotive Architectures: The Role of LLMs in Adjusting Work Environments
LLMs can turn static work settings into emotion-responsive hybrid environments that support focus and well-being.