arxiv: 2204.14198 · v2 · submitted 2022-04-29 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Flamingo: a Visual Language Model for Few-Shot Learning

Aida Nematzadeh, Andrew Brock, Andrew Zisserman, Antoine Miech, Arthur Mensch, Eliza Rutherford, Iain Barr, Jacob Menick, Jean-Baptiste Alayrac, Jeff Donahue, Karel Lenc, Karen Simonyan, Katie Millican, Malcolm Reynolds, Marianne Monteiro, Mikolaj Binkowski, Oriol Vinyals, Pauline Luc, Ricardo Barreira, Roman Ring, Sahand Sharifzadeh, Sebastian Borgeaud, Serkan Cabi, Sina Samangooei, Tengda Han, Yana Hasson, Zhitao Gong

Pith reviewed 2026-05-12 04:16 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords visual language modelfew-shot learningmultimodalin-context learningvisual question answeringimage captioningvideo understanding

0 comments

The pith

A single Flamingo visual language model reaches new state-of-the-art results on image and video tasks simply by receiving a few task examples in its prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Flamingo as a family of models that combine strong vision and language components to adapt to new tasks with minimal labeled data. Training occurs on large web collections where text and images appear mixed in natural order, allowing the model to learn in-context adaptation. Evaluation covers visual question answering, captioning of scenes or events, and multiple-choice questions on both still images and video clips. A sympathetic reader would care because most current systems demand thousands of task-specific examples for each new use case, whereas Flamingo suggests a single pretrained model can handle many uses through prompting alone. If the results hold, this reduces the data and compute barriers to deploying capable multimodal systems.

Core claim

Flamingo models, after pretraining on large-scale multimodal web corpora with arbitrarily interleaved text and images, can be prompted with a small number of task-specific examples to achieve state-of-the-art performance across a spectrum of vision-language tasks including open-ended visual question answering, image and video captioning, and multiple-choice visual question answering, often exceeding the results of models that were fine-tuned on thousands of times more labeled data for each individual task.

What carries the argument

Architectural innovations that bridge pretrained vision-only and language-only models to process sequences of arbitrarily interleaved visual and textual data while accepting images or videos as input.

If this is right

A single model handles both open-ended tasks such as describing scenes and close-ended tasks such as multiple-choice questions through the same prompting approach.
Performance on captioning, visual question answering, and video understanding improves by adding more examples directly in the input without any parameter updates.
The same pretrained weights apply to both still images and video inputs without task-specific retraining.
Flamingo sets new benchmark levels on numerous vision-language datasets while using far less task-specific data than prior approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach implies that one model could replace many separately fine-tuned systems if users supply fresh examples at inference time for each new application.
Similar interleaved-data pretraining might extend in-context adaptation to additional modalities such as audio or 3D scenes.
Downstream users could experiment with task variants by changing only the prompt examples rather than collecting new training sets.

Load-bearing premise

Training on large-scale multimodal web data with freely interleaved text and images produces in-context few-shot abilities that transfer to new downstream tasks without overfitting to patterns in the pretraining distribution.

What would settle it

If Flamingo with a handful of prompt examples fails to match or exceed the accuracy of a model fine-tuned on thousands of task-specific examples when both are tested on the same held-out set of image and video benchmarks, the central claim would not hold.

read the original abstract

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning tasks, which evaluate the ability to describe a scene or an event; and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Flamingo shows you can freeze big vision and language models, add gated cross-attention for interleaved sequences and video, and get broad few-shot gains from web-scale pretraining, though overlap with benchmarks is an open question.

read the letter

The main point is that Flamingo keeps a vision encoder and a language model frozen, then connects them with new gated cross-attention layers that handle any mix of images, text, and video frames. Training on large web corpora of interleaved data gives the model in-context few-shot ability across tasks without task-specific fine-tuning. A single model then reaches new few-shot numbers on VQA, captioning, and multiple-choice benchmarks, often beating models trained on far more labeled examples per task. This modular bridging is the concrete advance over prior VLMs that either fine-tuned everything or used less flexible fusion methods. The results section demonstrates the approach works on both open-ended and closed tasks, which is a practical strength. The evaluation covers a useful range of image and video settings and shows the prompting setup is straightforward to apply. The soft spot is data hygiene. Web-scale pretraining on arbitrary text-image pairs carries a real risk that test images or near-duplicate captions from COCO, VQAv2, or OK-VQA appeared in the training distribution. The paper does not appear to report decontamination checks or overlap statistics, so some of the reported gains could reflect memorization rather than robust in-context adaptation. Error bars and full ablation tables on the cross-attention design are also thin in the visible sections. This paper is for groups already working on large multimodal models who want a concrete recipe for reducing fine-tuning costs while keeping performance high. Readers focused on architecture choices and prompting will extract the most value. It deserves peer review because the architectural details are specific and the empirical claims are broad enough for referees to test directly.

Referee Report

2 major / 2 minor

Summary. The paper introduces Flamingo, a family of Visual Language Models (VLMs) for few-shot learning on image and video tasks. It proposes architectural innovations to bridge pretrained vision-only and language-only models, handle arbitrarily interleaved visual and textual sequences, and ingest images or videos. The models are trained on large-scale multimodal web corpora with interleaved text and images to enable in-context few-shot capabilities. Evaluations across open-ended tasks (VQA, captioning) and close-ended tasks (multiple-choice VQA) show a single Flamingo model achieving new state-of-the-art few-shot performance via prompting with task examples, often outperforming models fine-tuned on orders-of-magnitude more task-specific data.

Significance. If the empirical results hold after addressing controls for data contamination and statistical rigor, this would be a significant contribution to multimodal learning. It would establish that large-scale pretraining on interleaved web data can produce robust in-context adaptation across diverse visual tasks without task-specific fine-tuning, advancing flexible few-shot multimodal systems and reducing data requirements for adaptation.

major comments (2)

[§4 (Experiments and Results)] The performance tables reporting few-shot results on benchmarks such as VQAv2, COCO, and OK-VQA do not include error bars, standard deviations, or details on multiple runs/ablations. This makes it impossible to determine whether the claimed outperformance over fine-tuned baselines is statistically reliable or could arise from evaluation variance.
[§3 (Model and Training)] No analysis of potential overlap or decontamination between the large-scale multimodal web pretraining corpus and the downstream benchmarks (COCO, VQAv2, OK-VQA, TextVQA, etc.) is reported. Since the central claim attributes gains to learned in-context few-shot learning rather than memorization, this omission is load-bearing and requires explicit checks to rule out leakage as an alternative explanation.

minor comments (2)

[Abstract] The abstract and introduction refer to 'a family of' models but do not specify parameter counts or the exact model sizes evaluated, which would aid in interpreting scaling behavior and reproducibility.
[§2 (Architecture)] Notation for components such as the perceiver resampler and gated cross-attention layers is introduced without immediate cross-references to the equations defining their operation, reducing clarity for readers unfamiliar with the architecture.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of statistical reliability and data integrity. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [§4 (Experiments and Results)] The performance tables reporting few-shot results on benchmarks such as VQAv2, COCO, and OK-VQA do not include error bars, standard deviations, or details on multiple runs/ablations. This makes it impossible to determine whether the claimed outperformance over fine-tuned baselines is statistically reliable or could arise from evaluation variance.

Authors: We agree that error bars and details on run variance would improve interpretability. Due to the high computational cost of training and evaluating these large-scale models, we report results from single training runs, which is standard practice in this domain. In the revision we will add a dedicated limitations paragraph in §4 discussing this constraint, along with variance estimates obtained from multiple few-shot prompt orderings and from ablations on smaller Flamingo variants. These additions will clarify the robustness of the reported gains while acknowledging that full multi-seed statistics for the largest models remain impractical. revision: partial
Referee: [§3 (Model and Training)] No analysis of potential overlap or decontamination between the large-scale multimodal web pretraining corpus and the downstream benchmarks (COCO, VQAv2, OK-VQA, TextVQA, etc.) is reported. Since the central claim attributes gains to learned in-context few-shot learning rather than memorization, this omission is load-bearing and requires explicit checks to rule out leakage as an alternative explanation.

Authors: We recognize that explicit decontamination analysis is necessary to support the claim that performance stems from in-context learning. The current manuscript does not contain such an analysis. In the revised version we will add a new subsection (and appendix) that quantifies n-gram and image-level overlap between the pretraining corpus and each benchmark, reports the fraction of contaminated examples, and shows that Flamingo retains strong few-shot performance on the non-overlapping subsets. These checks will directly address the possibility of memorization. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; claims rest on empirical evaluation

full rationale

The paper presents architectural choices for bridging vision and language models and handling interleaved sequences, then reports empirical few-shot performance on held-out benchmarks such as VQA, captioning, and multiple-choice tasks. No equations, fitted parameters, or self-referential definitions are described that would make any result equivalent to its inputs by construction. The central claim—that a single model achieves SOTA few-shot results by prompting—depends on training data and benchmark evaluations that are external to any internal derivation, satisfying the criteria for a self-contained, non-circular presentation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the proposed architecture and training regime; no explicit free parameters, axioms, or invented entities are introduced in the abstract beyond standard assumptions about pretrained models and web-scale data.

pith-pipeline@v0.9.0 · 5651 in / 1104 out tokens · 30675 ms · 2026-05-12T04:16:37.977355+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples

Forward citations

Cited by 35 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Editing Models with Task Arithmetic
cs.LG 2022-12 accept novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
Allegory of the Cave: Measurement-Grounded Vision-Language Learning
cs.AI 2026-05 unverdicted novelty 7.0

PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
cs.CV 2026-05 unverdicted novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
cs.CV 2026-04 unverdicted novelty 7.0

COHERENCE is a new benchmark for measuring MLLMs' ability to recover fine-grained image-text correspondences in interleaved multimodal contexts.
COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
cs.CV 2026-04 unverdicted novelty 7.0

COHERENCE is a benchmark for MLLMs' fine-grained image-text alignment in interleaved multimodal contexts across four domains, with 6161 questions and six-type error analysis.
Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery
cs.MM 2026-04 unverdicted novelty 7.0

Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pa...
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
cs.CV 2026-04 conditional novelty 7.0

Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
Bottleneck Tokens for Unified Multimodal Retrieval
cs.LG 2026-04 unverdicted novelty 7.0

Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
Visual Instruction Tuning
cs.CV 2023-04 unverdicted novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
cs.CV 2023-01 unverdicted novelty 7.0

BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero...
LAION-5B: An open large-scale dataset for training next generation image-text models
cs.CV 2022-10 accept novelty 7.0

LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
A Generalist Agent
cs.AI 2022-05 accept novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
Can Multimodal Large Language Models Understand Pathologic Movements? A Pilot Study on Seizure Semiology
cs.CV 2026-05 unverdicted novelty 6.0

MLLMs achieve zero-shot recognition of seizure semiological features better than fine-tuned vision models on most tested features, with signal enhancement and faithful explanations.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
cs.LG 2026-05 unverdicted novelty 6.0

MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
Agentic AI for Remote Sensing: Technical Challenges and Research Directions
cs.CV 2026-04 unverdicted novelty 6.0

Agentic AI faces structural challenges in remote sensing due to geospatial data properties and workflow constraints, requiring EO-native agents built around structured state, tool-aware reasoning, and validity-aware e...
Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift
cs.CV 2026-04 unverdicted novelty 6.0

MG-MTTA improves VLM accuracy under modality-specific shifts by replacing pure entropy minimization with majorization-guided adaptation that incorporates a reliability-aware gate prior.
MTT-Bench: Predicting Social Dominance in Mice via Multimodal Large Language Models
eess.IV 2026-04 unverdicted novelty 6.0

Fine-tuned multimodal LLMs predict mouse social dominance from raw tube test videos with high agreement to traditional rankings.
Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy a...
Text Steganography with Dynamic Codebook and Multimodal Large Language Model
cs.CR 2026-04 unverdicted novelty 6.0

A black-box text steganography method using a dynamic codebook generated by multimodal LLMs and reject-sampling feedback achieves higher embedding capacity and text quality than prior white-box and fixed-codebook blac...
AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning
cs.CV 2026-04 unverdicted novelty 6.0

AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to nove...
LLaVA-Video: Video Instruction Tuning With Synthetic Data
cs.CV 2024-10 unverdicted novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
Vision Transformers Need Registers
cs.CV 2023-09 unverdicted novelty 6.0

Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
Improving Factuality and Reasoning in Language Models through Multiagent Debate
cs.CL 2023-05 unverdicted novelty 6.0

Multiagent debate among LLMs improves mathematical reasoning, strategic reasoning, and factual accuracy while reducing hallucinations.
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
cs.CV 2023-03 unverdicted novelty 6.0

MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
PaLM-E: An Embodied Multimodal Language Model
cs.LG 2023-03 conditional novelty 6.0

PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive t...
Inner Monologue: Embodied Reasoning through Planning with Language Models
cs.RO 2022-07 unverdicted novelty 6.0

LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Emergent Abilities of Large Language Models
cs.CL 2022-06 unverdicted novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
Agentic AI for Remote Sensing: Technical Challenges and Research Directions
cs.CV 2026-04 unverdicted novelty 5.0

Agentic AI for remote sensing requires new designs centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and physical validity rather than generic extensions.
Galactica: A Large Language Model for Science
cs.CL 2022-11 unverdicted novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification
cs.CV 2026-04 unverdicted novelty 4.0

Detection-guided prompting raises small VLM hazard F1 from 34.5% to 50.6% and BERTScore from 0.61 to 0.82 on construction images with only 2.5 ms added latency.
PaliGemma: A versatile 3B VLM for transfer
cs.CV 2024-07 unverdicted novelty 4.0

PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
cs.CV 2024-06 unverdicted novelty 4.0

VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
Improved Baselines with Visual Instruction Tuning
cs.CV 2023-10 conditional novelty 4.0

Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.
Emotive Architectures: The Role of LLMs in Adjusting Work Environments
cs.HC 2026-04 unverdicted novelty 3.0

LLMs can turn static work settings into emotion-responsive hybrid environments that support focus and well-being.

Reference graph

Works this paper leans on

165 extracted references · 165 canonical work pages · cited by 33 Pith papers · 17 internal anchors

[1]

Cm3: A causal masked multimodal model of the internet

Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, and Luke Zettlemoyer. CM3: A causal masked multimodal model of the internet. arXiv:2201.07520, 2022

work page arXiv 2022
[2]

Self- supervised multimodal versatile networks

Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelovi´c, Jason Rama- puram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. Self- supervised multimodal versatile networks. Conference on Neural Information Processing Systems, 2020

work page 2020
[3]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. In International Conference on Computer Vision, 2015

work page 2015
[4]

ReZero is all you need: Fast convergence at large depth

Thomas Bachlechner, Bodhisattwa Prasad Majumder, Henry Mao, Gary Cottrell, and Julian McAuley. ReZero is all you need: Fast convergence at large depth. In Uncertainty in Artiﬁcial Intelligence, 2021

work page 2021
[5]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In International Conference on Computer Vision, 2021

work page 2021
[6]

Henriques, Jack Valmadre, Philip Torr, and Andrea Vedaldi

Luca Bertinetto, João F. Henriques, Jack Valmadre, Philip Torr, and Andrea Vedaldi. Learning feed-forward one-shot learners. Conference on Neural Information Processing Systems, 2016

work page 2016
[7]

Meta-learning with differentiable closed-form solvers

Luca Bertinetto, Joao F. Henriques, Philip H. S. Torr, and Andrea Vedaldi. Meta-learning with differentiable closed-form solvers. arXiv:1805.08136, 2018

work page Pith review arXiv 2018
[8]

JAX: composable transformations of Python+NumPy programs, 2018

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax

work page 2018
[9]

John S. Bridle. Probabilistic interpretation of feedforward classiﬁcation network outputs, with relationships to statistical pattern recognition. In Neurocomputing, 1990

work page 1990
[10]

arXiv preprint arXiv:2102.06171 , year=

Andrew Brock, Soham De, Samuel L. Smith, and Karen Simonyan. High-performance large-scale image recognition without normalization. arXiv:2102.06171, 2021

work page arXiv 2021
[11]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Lit...

work page 2020
[12]

Gender shades: Intersectional accuracy disparities in commercial gender classiﬁcation

Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classiﬁcation. In ACM Conference on Fairness, Accountability, and Transparency, 2018

work page 2018
[13]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, 2020

work page 2020
[14]

Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In IEEE Computer Vision and Pattern Recognition, 2021

work page 2021
[15]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv:1504.00325, 2015. 11

work page internal anchor Pith review arXiv 2015
[16]

UNITER: Universal image-text representation learning

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. UNITER: Universal image-text representation learning. In European Conference on Computer Vision, 2020

work page 2020
[17]

Unifying vision-and-language tasks via text generation

Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning, 2021

work page 2021
[18]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Enabling multimodal generation on clip via vision-language knowledge distillation

Wenliang Dai, Lu Hou, Lifeng Shang, Xin Jiang, Qun Liu, and Pascale Fung. Enabling multimodal generation on clip via vision-language knowledge distillation. In ACL Findings, 2022

work page 2022
[20]

Visual dialog

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In IEEE Computer Vision and Pattern Recognition, 2017

work page 2017
[21]

Does object recognition work for everyone? In IEEE Computer Vision and Pattern Recognition, 2019

Terrance De Vries, Ishan Misra, Changhan Wang, and Laurens Van der Maaten. Does object recognition work for everyone? In IEEE Computer Vision and Pattern Recognition, 2019

work page 2019
[22]

VirTex: Learning visual representations from textual annota- tions

Karan Desai and Justin Johnson. VirTex: Learning visual representations from textual annota- tions. In IEEE Computer Vision and Pattern Recognition, 2021

work page 2021
[23]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[24]

CrossTransformers: spatially-aware few-shot transfer

Carl Doersch, Ankush Gupta, and Andrew Zisserman. CrossTransformers: spatially-aware few-shot transfer. Conference on Neural Information Processing Systems, 2020

work page 2020
[25]

Long-term recurrent convolutional networks for visual recognition and description

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In IEEE Computer Vision and Pattern Recognition, 2015

work page 2015
[26]

Magma–multimodal augmentation of generative models through adapter-based ﬁnetuning

Constantin Eichenberg, Sidney Black, Samuel Weinbach, Letitia Parcalabescu, and Anette Frank. MAGMA–multimodal augmentation of generative models through adapter-based ﬁnetuning. arXiv:2112.05253, 2021

work page arXiv 2021
[27]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017

work page 2017
[28]

Violet: End- to-end video-language transformers with masked visual-token modeling

Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. VIOLET: End-to-end video-language transformers with masked visual-token modeling. arXiv:2111.12681, 2021

work page arXiv 2021
[29]

Large-scale adversarial training for vision-and-language representation learning

Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. Large-scale adversarial training for vision-and-language representation learning. In Conference on Neural Information Processing Systems, 2020

work page 2020
[30]

Datasheets for datasets

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 2021. 12

work page 2021
[31]

Jonathan Gordon, John Bronskill, Matthias Bauer, Sebastian Nowozin, and Richard E. Turner. Meta-learning probabilistic inference for prediction. arXiv:1805.09921, 2018

work page arXiv 2018
[32]

arXiv preprint arXiv:1308.0850 (2013) 4, 5

Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013

work page arXiv 2013
[33]

Grifﬁths, Frederick Callaway, Michael B

Thomas L. Grifﬁths, Frederick Callaway, Michael B. Chang, Erin Grant, Paul M. Krueger, and Falk Lieder. Doing more with less: meta-reasoning and meta-learning in humans and machines. Current Opinion in Behavioral Sciences, 2019

work page 2019
[34]

KAT: A knowledge augmented transformer for vision-and-language

Liangke Gui, Borui Wang, Qiuyuan Huang, Alex Hauptmann, Yonatan Bisk, and Jianfeng Gao. KAT: A knowledge augmented transformer for vision-and-language. arXiv:2112.08614, 2021

work page arXiv 2021
[35]

Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P

Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. VizWiz grand challenge: Answering visual questions from blind people. In IEEE Computer Vision and Pattern Recognition, 2018

work page 2018
[36]

Transformer language models without positional encodings still learn positional information

Adi Haviv, Ori Ram, Oﬁr Press, Peter Izsak, and Omer Levy. Transformer language models without positional encodings still learn positional information. arXiv:2203.16634, 2022

work page arXiv 2022
[37]

Women also snowboard: Overcoming bias in captioning models

Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Overcoming bias in captioning models. In European Conference on Computer Vision, 2018

work page 2018
[38]

Decoupling the role of data, attention, and losses in multimodal transformers

Lisa Anne Hendricks, John Mellor, Rosalia Schneider, Jean-Baptiste Alayrac, and Aida Nematzadeh. Decoupling the role of data, attention, and losses in multimodal transformers. Annual Meeting of the Association for Computational Linguistics, 2021

work page 2021
[39]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs). arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[40]

Haiku: Sonnet for JAX,

Tom Hennigan, Trevor Cai, Tamara Norman, and Igor Babuschkin. Haiku: Sonnet for JAX,

work page
[41]

URL http://github.com/deepmind/dm-haiku

work page
[42]

Long short-term memory

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 1997

work page 1997
[43]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Eric Noland Tom Hennigan, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre....

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

Parameter-efﬁcient transfer learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efﬁcient transfer learning for NLP. In International Conference on Machine Learning, 2019

work page 2019
[45]

Universal language model fine-tuning for text classification

Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuning for text classiﬁca- tion. arXiv:1801.06146, 2018

work page arXiv 2018
[46]

Scaling up vision-language pre-training for image captioning

Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pre-training for image captioning. arXiv:2111.12233, 2021

work page arXiv 2021
[47]

Attention on attention for image captioning

Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. Attention on attention for image captioning. In International Conference on Computer Vision, 2019

work page 2019
[48]

Derpanis, and Neil D

Md Amirul Islam, Matthew Kowal, Sen Jia, Konstantinos G. Derpanis, and Neil D. B. Bruce. Global pooling, more than meets the eye: Position information is encoded channel-wise in CNNs. In International Conference on Computer Vision, 2021

work page 2021
[49]

Perceiver: General perception with iterative attention

Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. In International Conference on Machine Learning, 2021. 13

work page 2021
[50]

Mural: multimodal, multitask retrieval across languages

Aashi Jain, Mandy Guo, Krishna Srinivasan, Ting Chen, Sneha Kudugunta, Chao Jia, Yinfei Yang, and Jason Baldridge. MURAL: multimodal, multitask retrieval across languages. arXiv:2109.05125, 2021

work page arXiv 2021
[51]

13 Published as a conference paper at ICLR 2026 Sheng Jin, Xueying Jiang, Jiaxing Huang, Lewei Lu, and Shijian Lu

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv:2102.05918, 2021

work page arXiv 2021
[52]

All in one: Exploring unified video-language pre-training

Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. All in one: Exploring uniﬁed video-language pre-training. arXiv:2203.07303, 2022

work page arXiv 2022
[53]

Exploring the Limits of Language Modeling

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv:1602.02410, 2016

work page Pith review arXiv 2016
[54]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[55]

The Hateful Memes Challenge: Detecting hate speech in multimodal memes

Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. The Hateful Memes Challenge: Detecting hate speech in multimodal memes. Conference on Neural Information Processing Systems, 2020

work page 2020
[56]

Few-shot classiﬁcation by recycling deep learning

Hugo Larochelle. Few-shot classiﬁcation by recycling deep learning. Invited Talk at the S2D-OLAD Workshop, ICLR 2021 , 2021. URL https://slideslive.com/38955350/ fewshot-classification-by-recycling-deep-learning

work page arXiv 2021
[58]

Align before fuse: Vision and language representation learning with momentum distillation

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shaﬁq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In Conference on Neural Information Processing Systems, 2021

work page 2021
[59]

Blip: Bootstrapping language- image pre-training for uniﬁed vision-language understanding and generation.arXiv preprint arXiv:2201.12086, 2022

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for uniﬁed vision-language understanding and generation. arXiv:2201.12086, 2022

work page arXiv 2022
[60]

HERO: Hierarchical encoder for video+language omni-representation pre-training

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. HERO: Hierarchical encoder for video+language omni-representation pre-training. arXiv:2005.00200, 2020

work page arXiv 2005
[61]

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Xiang Lisa Li and Percy Liang. Preﬁx-tuning: Optimizing continuous prompts for generation. arXiv:2101.00190, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[62]

Oscar: Object-semantics aligned pre-training for vision-language tasks

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, 2020

work page 2020
[63]

A multimodal framework for the detection of hateful memes

Phillip Lippe, Nithin Holla, Shantanu Chandra, Santhosh Rajamanickam, Georgios Antoniou, Ekaterina Shutova, and Helen Yannakoudakis. A multimodal framework for the detection of hateful memes. arXiv:2012.12871, 2020

work page arXiv 2012
[64]

11 Wendy Johnson and Thomas J Bouchard Jr

Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for GPT-3? arXiv:2101.06804, 2021

work page arXiv 2021
[65]

Optimization of image description metrics using policy gradient methods

Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. Optimization of image description metrics using policy gradient methods. In International Conference on Computer Vision, 2017

work page 2017
[66]

Enhancing textual cues in multi-modal transformers for VQA

Yu Liu, Lianghua Huang, Liuyihang Song, Bin Wang, Yingya Zhang, and Pan Pan. Enhancing textual cues in multi-modal transformers for VQA. VizWiz Challenge 2021, 2021. 14

work page 2021
[67]

ViLBERT: Pretraining task-agnostic vi- siolinguistic representations for vision-and-language tasks

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. ViLBERT: Pretraining task-agnostic vi- siolinguistic representations for vision-and-language tasks. Conference on Neural Information Processing Systems, 2019

work page 2019
[68]

ArXiv preprint abs/2002.06353 (2020)

Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. UniVL: A uniﬁed video and language pre-training model for multi- modal understanding and generation. arXiv:2002.06353, 2020

work page arXiv 2002
[69]

VC-GPT: Visual conditioned GPT for end-to-end generative vision-and-language pre-training

Ziyang Luo, Yadong Xi, Rongsheng Zhang, and Jing Ma. VC-GPT: Visual conditioned GPT for end-to-end generative vision-and-language pre-training. arXiv:2201.12723, 2022

work page arXiv 2022
[70]

OK-VQA: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A visual question answering benchmark requiring external knowledge. In IEEE Computer Vision and Pattern Recognition, 2019

work page 2019
[71]

Ellen M. Markman. Categorization and naming in children: Problems of induction . MIT Press, 1989

work page 1989
[72]

Michael McCloskey and Neil J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. The Psychology of Learning and Motivation, 1989

work page 1989
[73]

Teaching Language Models to Support Answers with Verified Quotes.CoRR, abs/2203.11147,

Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, and Nat McAleese. Teaching language models to support answers with veriﬁed quotes. arXiv:2203.11147, 2022

work page arXiv 2022
[74]

RareAct: A video dataset of unusual interactions

Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, and Andrew Zisserman. RareAct: A video dataset of unusual interactions. arxiv:2008.01018, 2020

work page arXiv 2008
[75]

End-to-end learning of visual representations from uncurated instructional videos

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instructional videos. In IEEE Computer Vision and Pattern Recognition, 2020

work page 2020
[76]

Recurrent neural network based language model

Tomas Mikolov, Martin Karaﬁát, Lukas Burget, Jan Cernock `y, and Sanjeev Khudanpur. Recurrent neural network based language model. Interspeech, 2010

work page 2010
[77]

arXiv preprint arXiv:2202.12837 , year=

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv:2202.12837, 2022

work page arXiv 2022
[78]

Model cards for model reporting

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. In ACM Conference on Fairness, Accountability, and Transparency, 2019

work page 2019
[79]

Ron Mokady, Amir Hertz, and Amit H. Bermano. ClipCap: CLIP preﬁx for image captioning. arXiv:2111.09734, 2021

work page arXiv 2021
[80]

Large-scale pretraining for visual dialog: A simple state-of-the-art baseline

Vishvak Murahari, Dhruv Batra, Devi Parikh, and Abhishek Das. Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. In European Conference on Computer Vision, 2020

work page 2020
[81]

True few-shot learning with language models

Ethan Perez, Douwe Kiela, and Kyunghyun Cho. True few-shot learning with language models. Conference on Neural Information Processing Systems, 2021

work page 2021

Showing first 80 references.