pith. machine review for the scientific record. sign in

arxiv: 2204.14198 · v2 · submitted 2022-04-29 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Flamingo: a Visual Language Model for Few-Shot Learning

Aida Nematzadeh, Andrew Brock, Andrew Zisserman, Antoine Miech, Arthur Mensch, Eliza Rutherford, Iain Barr, Jacob Menick, Jean-Baptiste Alayrac, Jeff Donahue, Karel Lenc, Karen Simonyan, Katie Millican, Malcolm Reynolds, Marianne Monteiro, Mikolaj Binkowski, Oriol Vinyals, Pauline Luc, Ricardo Barreira, Roman Ring, Sahand Sharifzadeh, Sebastian Borgeaud, Serkan Cabi, Sina Samangooei, Tengda Han, Yana Hasson, Zhitao Gong

Pith reviewed 2026-05-12 04:16 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords visual language modelfew-shot learningmultimodalin-context learningvisual question answeringimage captioningvideo understanding
0
0 comments X

The pith

A single Flamingo visual language model reaches new state-of-the-art results on image and video tasks simply by receiving a few task examples in its prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Flamingo as a family of models that combine strong vision and language components to adapt to new tasks with minimal labeled data. Training occurs on large web collections where text and images appear mixed in natural order, allowing the model to learn in-context adaptation. Evaluation covers visual question answering, captioning of scenes or events, and multiple-choice questions on both still images and video clips. A sympathetic reader would care because most current systems demand thousands of task-specific examples for each new use case, whereas Flamingo suggests a single pretrained model can handle many uses through prompting alone. If the results hold, this reduces the data and compute barriers to deploying capable multimodal systems.

Core claim

Flamingo models, after pretraining on large-scale multimodal web corpora with arbitrarily interleaved text and images, can be prompted with a small number of task-specific examples to achieve state-of-the-art performance across a spectrum of vision-language tasks including open-ended visual question answering, image and video captioning, and multiple-choice visual question answering, often exceeding the results of models that were fine-tuned on thousands of times more labeled data for each individual task.

What carries the argument

Architectural innovations that bridge pretrained vision-only and language-only models to process sequences of arbitrarily interleaved visual and textual data while accepting images or videos as input.

If this is right

  • A single model handles both open-ended tasks such as describing scenes and close-ended tasks such as multiple-choice questions through the same prompting approach.
  • Performance on captioning, visual question answering, and video understanding improves by adding more examples directly in the input without any parameter updates.
  • The same pretrained weights apply to both still images and video inputs without task-specific retraining.
  • Flamingo sets new benchmark levels on numerous vision-language datasets while using far less task-specific data than prior approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach implies that one model could replace many separately fine-tuned systems if users supply fresh examples at inference time for each new application.
  • Similar interleaved-data pretraining might extend in-context adaptation to additional modalities such as audio or 3D scenes.
  • Downstream users could experiment with task variants by changing only the prompt examples rather than collecting new training sets.

Load-bearing premise

Training on large-scale multimodal web data with freely interleaved text and images produces in-context few-shot abilities that transfer to new downstream tasks without overfitting to patterns in the pretraining distribution.

What would settle it

If Flamingo with a handful of prompt examples fails to match or exceed the accuracy of a model fine-tuned on thousands of task-specific examples when both are tested on the same held-out set of image and video benchmarks, the central claim would not hold.

read the original abstract

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning tasks, which evaluate the ability to describe a scene or an event; and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Flamingo, a family of Visual Language Models (VLMs) for few-shot learning on image and video tasks. It proposes architectural innovations to bridge pretrained vision-only and language-only models, handle arbitrarily interleaved visual and textual sequences, and ingest images or videos. The models are trained on large-scale multimodal web corpora with interleaved text and images to enable in-context few-shot capabilities. Evaluations across open-ended tasks (VQA, captioning) and close-ended tasks (multiple-choice VQA) show a single Flamingo model achieving new state-of-the-art few-shot performance via prompting with task examples, often outperforming models fine-tuned on orders-of-magnitude more task-specific data.

Significance. If the empirical results hold after addressing controls for data contamination and statistical rigor, this would be a significant contribution to multimodal learning. It would establish that large-scale pretraining on interleaved web data can produce robust in-context adaptation across diverse visual tasks without task-specific fine-tuning, advancing flexible few-shot multimodal systems and reducing data requirements for adaptation.

major comments (2)
  1. [§4 (Experiments and Results)] The performance tables reporting few-shot results on benchmarks such as VQAv2, COCO, and OK-VQA do not include error bars, standard deviations, or details on multiple runs/ablations. This makes it impossible to determine whether the claimed outperformance over fine-tuned baselines is statistically reliable or could arise from evaluation variance.
  2. [§3 (Model and Training)] No analysis of potential overlap or decontamination between the large-scale multimodal web pretraining corpus and the downstream benchmarks (COCO, VQAv2, OK-VQA, TextVQA, etc.) is reported. Since the central claim attributes gains to learned in-context few-shot learning rather than memorization, this omission is load-bearing and requires explicit checks to rule out leakage as an alternative explanation.
minor comments (2)
  1. [Abstract] The abstract and introduction refer to 'a family of' models but do not specify parameter counts or the exact model sizes evaluated, which would aid in interpreting scaling behavior and reproducibility.
  2. [§2 (Architecture)] Notation for components such as the perceiver resampler and gated cross-attention layers is introduced without immediate cross-references to the equations defining their operation, reducing clarity for readers unfamiliar with the architecture.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of statistical reliability and data integrity. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [§4 (Experiments and Results)] The performance tables reporting few-shot results on benchmarks such as VQAv2, COCO, and OK-VQA do not include error bars, standard deviations, or details on multiple runs/ablations. This makes it impossible to determine whether the claimed outperformance over fine-tuned baselines is statistically reliable or could arise from evaluation variance.

    Authors: We agree that error bars and details on run variance would improve interpretability. Due to the high computational cost of training and evaluating these large-scale models, we report results from single training runs, which is standard practice in this domain. In the revision we will add a dedicated limitations paragraph in §4 discussing this constraint, along with variance estimates obtained from multiple few-shot prompt orderings and from ablations on smaller Flamingo variants. These additions will clarify the robustness of the reported gains while acknowledging that full multi-seed statistics for the largest models remain impractical. revision: partial

  2. Referee: [§3 (Model and Training)] No analysis of potential overlap or decontamination between the large-scale multimodal web pretraining corpus and the downstream benchmarks (COCO, VQAv2, OK-VQA, TextVQA, etc.) is reported. Since the central claim attributes gains to learned in-context few-shot learning rather than memorization, this omission is load-bearing and requires explicit checks to rule out leakage as an alternative explanation.

    Authors: We recognize that explicit decontamination analysis is necessary to support the claim that performance stems from in-context learning. The current manuscript does not contain such an analysis. In the revised version we will add a new subsection (and appendix) that quantifies n-gram and image-level overlap between the pretraining corpus and each benchmark, reports the fraction of contaminated examples, and shows that Flamingo retains strong few-shot performance on the non-overlapping subsets. These checks will directly address the possibility of memorization. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; claims rest on empirical evaluation

full rationale

The paper presents architectural choices for bridging vision and language models and handling interleaved sequences, then reports empirical few-shot performance on held-out benchmarks such as VQA, captioning, and multiple-choice tasks. No equations, fitted parameters, or self-referential definitions are described that would make any result equivalent to its inputs by construction. The central claim—that a single model achieves SOTA few-shot results by prompting—depends on training data and benchmark evaluations that are external to any internal derivation, satisfying the criteria for a self-contained, non-circular presentation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the proposed architecture and training regime; no explicit free parameters, axioms, or invented entities are introduced in the abstract beyond standard assumptions about pretrained models and web-scale data.

pith-pipeline@v0.9.0 · 5651 in / 1104 out tokens · 30675 ms · 2026-05-12T04:16:37.977355+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 35 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Editing Models with Task Arithmetic

    cs.LG 2022-12 accept novelty 8.0

    Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

  2. Allegory of the Cave: Measurement-Grounded Vision-Language Learning

    cs.AI 2026-05 unverdicted novelty 7.0

    PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.

  3. VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

    cs.CV 2026-05 unverdicted novelty 7.0

    VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

  4. COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

    cs.CV 2026-04 unverdicted novelty 7.0

    COHERENCE is a new benchmark for measuring MLLMs' ability to recover fine-grained image-text correspondences in interleaved multimodal contexts.

  5. COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

    cs.CV 2026-04 unverdicted novelty 7.0

    COHERENCE is a benchmark for MLLMs' fine-grained image-text alignment in interleaved multimodal contexts across four domains, with 6161 questions and six-type error analysis.

  6. Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery

    cs.MM 2026-04 unverdicted novelty 7.0

    Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pa...

  7. Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation

    cs.CV 2026-04 conditional novelty 7.0

    Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.

  8. Bottleneck Tokens for Unified Multimodal Retrieval

    cs.LG 2026-04 unverdicted novelty 7.0

    Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.

  9. Visual Instruction Tuning

    cs.CV 2023-04 unverdicted novelty 7.0

    LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

  10. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    cs.CV 2023-01 unverdicted novelty 7.0

    BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero...

  11. LAION-5B: An open large-scale dataset for training next generation image-text models

    cs.CV 2022-10 accept novelty 7.0

    LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.

  12. A Generalist Agent

    cs.AI 2022-05 accept novelty 7.0

    Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

  13. Can Multimodal Large Language Models Understand Pathologic Movements? A Pilot Study on Seizure Semiology

    cs.CV 2026-05 unverdicted novelty 6.0

    MLLMs achieve zero-shot recognition of seizure semiological features better than fine-tuned vision models on most tested features, with signal enhancement and faithful explanations.

  14. A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

    cs.LG 2026-05 unverdicted novelty 6.0

    MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.

  15. Agentic AI for Remote Sensing: Technical Challenges and Research Directions

    cs.CV 2026-04 unverdicted novelty 6.0

    Agentic AI faces structural challenges in remote sensing due to geospatial data properties and workflow constraints, requiring EO-native agents built around structured state, tool-aware reasoning, and validity-aware e...

  16. Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift

    cs.CV 2026-04 unverdicted novelty 6.0

    MG-MTTA improves VLM accuracy under modality-specific shifts by replacing pure entropy minimization with majorization-guided adaptation that incorporates a reliability-aware gate prior.

  17. MTT-Bench: Predicting Social Dominance in Mice via Multimodal Large Language Models

    eess.IV 2026-04 unverdicted novelty 6.0

    Fine-tuned multimodal LLMs predict mouse social dominance from raw tube test videos with high agreement to traditional rankings.

  18. Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy a...

  19. Text Steganography with Dynamic Codebook and Multimodal Large Language Model

    cs.CR 2026-04 unverdicted novelty 6.0

    A black-box text steganography method using a dynamic codebook generated by multimodal LLMs and reject-sampling feedback achieves higher embedding capacity and text quality than prior white-box and fixed-codebook blac...

  20. AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning

    cs.CV 2026-04 unverdicted novelty 6.0

    AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to nove...

  21. LLaVA-Video: Video Instruction Tuning With Synthetic Data

    cs.CV 2024-10 unverdicted novelty 6.0

    LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.

  22. Vision Transformers Need Registers

    cs.CV 2023-09 unverdicted novelty 6.0

    Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.

  23. Improving Factuality and Reasoning in Language Models through Multiagent Debate

    cs.CL 2023-05 unverdicted novelty 6.0

    Multiagent debate among LLMs improves mathematical reasoning, strategic reasoning, and factual accuracy while reducing hallucinations.

  24. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    cs.CV 2023-03 unverdicted novelty 6.0

    MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.

  25. PaLM-E: An Embodied Multimodal Language Model

    cs.LG 2023-03 conditional novelty 6.0

    PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive t...

  26. Inner Monologue: Embodied Reasoning through Planning with Language Models

    cs.RO 2022-07 unverdicted novelty 6.0

    LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.

  27. Language Models (Mostly) Know What They Know

    cs.CL 2022-07 unverdicted novelty 6.0

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  28. Emergent Abilities of Large Language Models

    cs.CL 2022-06 unverdicted novelty 6.0

    Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.

  29. Agentic AI for Remote Sensing: Technical Challenges and Research Directions

    cs.CV 2026-04 unverdicted novelty 5.0

    Agentic AI for remote sensing requires new designs centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and physical validity rather than generic extensions.

  30. Galactica: A Large Language Model for Science

    cs.CL 2022-11 unverdicted novelty 5.0

    Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.

  31. Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification

    cs.CV 2026-04 unverdicted novelty 4.0

    Detection-guided prompting raises small VLM hazard F1 from 34.5% to 50.6% and BERTScore from 0.61 to 0.82 on construction images with only 2.5 ms added latency.

  32. PaliGemma: A versatile 3B VLM for transfer

    cs.CV 2024-07 unverdicted novelty 4.0

    PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

  33. VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    cs.CV 2024-06 unverdicted novelty 4.0

    VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.

  34. Improved Baselines with Visual Instruction Tuning

    cs.CV 2023-10 conditional novelty 4.0

    Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.

  35. Emotive Architectures: The Role of LLMs in Adjusting Work Environments

    cs.HC 2026-04 unverdicted novelty 3.0

    LLMs can turn static work settings into emotion-responsive hybrid environments that support focus and well-being.

Reference graph

Works this paper leans on

165 extracted references · 165 canonical work pages · cited by 33 Pith papers · 17 internal anchors

  1. [1]

    Cm3: A causal masked multimodal model of the internet

    Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, and Luke Zettlemoyer. CM3: A causal masked multimodal model of the internet. arXiv:2201.07520, 2022

  2. [2]

    Self- supervised multimodal versatile networks

    Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelovi´c, Jason Rama- puram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. Self- supervised multimodal versatile networks. Conference on Neural Information Processing Systems, 2020

  3. [3]

    Lawrence Zitnick, and Devi Parikh

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. In International Conference on Computer Vision, 2015

  4. [4]

    ReZero is all you need: Fast convergence at large depth

    Thomas Bachlechner, Bodhisattwa Prasad Majumder, Henry Mao, Gary Cottrell, and Julian McAuley. ReZero is all you need: Fast convergence at large depth. In Uncertainty in Artificial Intelligence, 2021

  5. [5]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In International Conference on Computer Vision, 2021

  6. [6]

    Henriques, Jack Valmadre, Philip Torr, and Andrea Vedaldi

    Luca Bertinetto, João F. Henriques, Jack Valmadre, Philip Torr, and Andrea Vedaldi. Learning feed-forward one-shot learners. Conference on Neural Information Processing Systems, 2016

  7. [7]

    Meta-learning with differentiable closed-form solvers

    Luca Bertinetto, Joao F. Henriques, Philip H. S. Torr, and Andrea Vedaldi. Meta-learning with differentiable closed-form solvers. arXiv:1805.08136, 2018

  8. [8]

    JAX: composable transformations of Python+NumPy programs, 2018

    James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax

  9. [9]

    John S. Bridle. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing, 1990

  10. [10]

    arXiv preprint arXiv:2102.06171 , year=

    Andrew Brock, Soham De, Samuel L. Smith, and Karen Simonyan. High-performance large-scale image recognition without normalization. arXiv:2102.06171, 2021

  11. [11]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Lit...

  12. [12]

    Gender shades: Intersectional accuracy disparities in commercial gender classification

    Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In ACM Conference on Fairness, Accountability, and Transparency, 2018

  13. [13]

    End-to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, 2020

  14. [14]

    Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In IEEE Computer Vision and Pattern Recognition, 2021

  15. [15]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv:1504.00325, 2015. 11

  16. [16]

    UNITER: Universal image-text representation learning

    Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. UNITER: Universal image-text representation learning. In European Conference on Computer Vision, 2020

  17. [17]

    Unifying vision-and-language tasks via text generation

    Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning, 2021

  18. [18]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

  19. [19]

    Enabling multimodal generation on clip via vision-language knowledge distillation

    Wenliang Dai, Lu Hou, Lifeng Shang, Xin Jiang, Qun Liu, and Pascale Fung. Enabling multimodal generation on clip via vision-language knowledge distillation. In ACL Findings, 2022

  20. [20]

    Visual dialog

    Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In IEEE Computer Vision and Pattern Recognition, 2017

  21. [21]

    Does object recognition work for everyone? In IEEE Computer Vision and Pattern Recognition, 2019

    Terrance De Vries, Ishan Misra, Changhan Wang, and Laurens Van der Maaten. Does object recognition work for everyone? In IEEE Computer Vision and Pattern Recognition, 2019

  22. [22]

    VirTex: Learning visual representations from textual annota- tions

    Karan Desai and Justin Johnson. VirTex: Learning visual representations from textual annota- tions. In IEEE Computer Vision and Pattern Recognition, 2021

  23. [23]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2018

  24. [24]

    CrossTransformers: spatially-aware few-shot transfer

    Carl Doersch, Ankush Gupta, and Andrew Zisserman. CrossTransformers: spatially-aware few-shot transfer. Conference on Neural Information Processing Systems, 2020

  25. [25]

    Long-term recurrent convolutional networks for visual recognition and description

    Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In IEEE Computer Vision and Pattern Recognition, 2015

  26. [26]

    Magma–multimodal augmentation of generative models through adapter-based finetuning

    Constantin Eichenberg, Sidney Black, Samuel Weinbach, Letitia Parcalabescu, and Anette Frank. MAGMA–multimodal augmentation of generative models through adapter-based finetuning. arXiv:2112.05253, 2021

  27. [27]

    Model-agnostic meta-learning for fast adaptation of deep networks

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017

  28. [28]

    Violet: End- to-end video-language transformers with masked visual-token modeling

    Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. VIOLET: End-to-end video-language transformers with masked visual-token modeling. arXiv:2111.12681, 2021

  29. [29]

    Large-scale adversarial training for vision-and-language representation learning

    Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. Large-scale adversarial training for vision-and-language representation learning. In Conference on Neural Information Processing Systems, 2020

  30. [30]

    Datasheets for datasets

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 2021. 12

  31. [31]

    Jonathan Gordon, John Bronskill, Matthias Bauer, Sebastian Nowozin, and Richard E. Turner. Meta-learning probabilistic inference for prediction. arXiv:1805.09921, 2018

  32. [32]

    arXiv preprint arXiv:1308.0850 (2013) 4, 5

    Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013

  33. [33]

    Griffiths, Frederick Callaway, Michael B

    Thomas L. Griffiths, Frederick Callaway, Michael B. Chang, Erin Grant, Paul M. Krueger, and Falk Lieder. Doing more with less: meta-reasoning and meta-learning in humans and machines. Current Opinion in Behavioral Sciences, 2019

  34. [34]

    KAT: A knowledge augmented transformer for vision-and-language

    Liangke Gui, Borui Wang, Qiuyuan Huang, Alex Hauptmann, Yonatan Bisk, and Jianfeng Gao. KAT: A knowledge augmented transformer for vision-and-language. arXiv:2112.08614, 2021

  35. [35]

    Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P

    Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. VizWiz grand challenge: Answering visual questions from blind people. In IEEE Computer Vision and Pattern Recognition, 2018

  36. [36]

    Transformer language models without positional encodings still learn positional information

    Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, and Omer Levy. Transformer language models without positional encodings still learn positional information. arXiv:2203.16634, 2022

  37. [37]

    Women also snowboard: Overcoming bias in captioning models

    Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Overcoming bias in captioning models. In European Conference on Computer Vision, 2018

  38. [38]

    Decoupling the role of data, attention, and losses in multimodal transformers

    Lisa Anne Hendricks, John Mellor, Rosalia Schneider, Jean-Baptiste Alayrac, and Aida Nematzadeh. Decoupling the role of data, attention, and losses in multimodal transformers. Annual Meeting of the Association for Computational Linguistics, 2021

  39. [39]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs). arXiv:1606.08415, 2016

  40. [40]

    Haiku: Sonnet for JAX,

    Tom Hennigan, Trevor Cai, Tamara Norman, and Igor Babuschkin. Haiku: Sonnet for JAX,

  41. [41]

    URL http://github.com/deepmind/dm-haiku

  42. [42]

    Long short-term memory

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 1997

  43. [43]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Eric Noland Tom Hennigan, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre....

  44. [44]

    Parameter-efficient transfer learning for NLP

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, 2019

  45. [45]

    Universal language model fine-tuning for text classification

    Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classifica- tion. arXiv:1801.06146, 2018

  46. [46]

    Scaling up vision-language pre-training for image captioning

    Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pre-training for image captioning. arXiv:2111.12233, 2021

  47. [47]

    Attention on attention for image captioning

    Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. Attention on attention for image captioning. In International Conference on Computer Vision, 2019

  48. [48]

    Derpanis, and Neil D

    Md Amirul Islam, Matthew Kowal, Sen Jia, Konstantinos G. Derpanis, and Neil D. B. Bruce. Global pooling, more than meets the eye: Position information is encoded channel-wise in CNNs. In International Conference on Computer Vision, 2021

  49. [49]

    Perceiver: General perception with iterative attention

    Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. In International Conference on Machine Learning, 2021. 13

  50. [50]

    Mural: multimodal, multitask retrieval across languages

    Aashi Jain, Mandy Guo, Krishna Srinivasan, Ting Chen, Sneha Kudugunta, Chao Jia, Yinfei Yang, and Jason Baldridge. MURAL: multimodal, multitask retrieval across languages. arXiv:2109.05125, 2021

  51. [51]

    13 Published as a conference paper at ICLR 2026 Sheng Jin, Xueying Jiang, Jiaxing Huang, Lewei Lu, and Shijian Lu

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv:2102.05918, 2021

  52. [52]

    All in one: Exploring unified video-language pre-training

    Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. All in one: Exploring unified video-language pre-training. arXiv:2203.07303, 2022

  53. [53]

    Exploring the Limits of Language Modeling

    Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv:1602.02410, 2016

  54. [54]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv:2001.08361, 2020

  55. [55]

    The Hateful Memes Challenge: Detecting hate speech in multimodal memes

    Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. The Hateful Memes Challenge: Detecting hate speech in multimodal memes. Conference on Neural Information Processing Systems, 2020

  56. [56]

    Few-shot classification by recycling deep learning

    Hugo Larochelle. Few-shot classification by recycling deep learning. Invited Talk at the S2D-OLAD Workshop, ICLR 2021 , 2021. URL https://slideslive.com/38955350/ fewshot-classification-by-recycling-deep-learning

  57. [58]

    Align before fuse: Vision and language representation learning with momentum distillation

    Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In Conference on Neural Information Processing Systems, 2021

  58. [59]

    Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation.arXiv preprint arXiv:2201.12086, 2022

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv:2201.12086, 2022

  59. [60]

    HERO: Hierarchical encoder for video+language omni-representation pre-training

    Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. HERO: Hierarchical encoder for video+language omni-representation pre-training. arXiv:2005.00200, 2020

  60. [61]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv:2101.00190, 2021

  61. [62]

    Oscar: Object-semantics aligned pre-training for vision-language tasks

    Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, 2020

  62. [63]

    A multimodal framework for the detection of hateful memes

    Phillip Lippe, Nithin Holla, Shantanu Chandra, Santhosh Rajamanickam, Georgios Antoniou, Ekaterina Shutova, and Helen Yannakoudakis. A multimodal framework for the detection of hateful memes. arXiv:2012.12871, 2020

  63. [64]

    11 Wendy Johnson and Thomas J Bouchard Jr

    Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for GPT-3? arXiv:2101.06804, 2021

  64. [65]

    Optimization of image description metrics using policy gradient methods

    Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. Optimization of image description metrics using policy gradient methods. In International Conference on Computer Vision, 2017

  65. [66]

    Enhancing textual cues in multi-modal transformers for VQA

    Yu Liu, Lianghua Huang, Liuyihang Song, Bin Wang, Yingya Zhang, and Pan Pan. Enhancing textual cues in multi-modal transformers for VQA. VizWiz Challenge 2021, 2021. 14

  66. [67]

    ViLBERT: Pretraining task-agnostic vi- siolinguistic representations for vision-and-language tasks

    Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. ViLBERT: Pretraining task-agnostic vi- siolinguistic representations for vision-and-language tasks. Conference on Neural Information Processing Systems, 2019

  67. [68]

    ArXiv preprint abs/2002.06353 (2020)

    Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. UniVL: A unified video and language pre-training model for multi- modal understanding and generation. arXiv:2002.06353, 2020

  68. [69]

    VC-GPT: Visual conditioned GPT for end-to-end generative vision-and-language pre-training

    Ziyang Luo, Yadong Xi, Rongsheng Zhang, and Jing Ma. VC-GPT: Visual conditioned GPT for end-to-end generative vision-and-language pre-training. arXiv:2201.12723, 2022

  69. [70]

    OK-VQA: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A visual question answering benchmark requiring external knowledge. In IEEE Computer Vision and Pattern Recognition, 2019

  70. [71]

    Ellen M. Markman. Categorization and naming in children: Problems of induction . MIT Press, 1989

  71. [72]

    Michael McCloskey and Neil J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. The Psychology of Learning and Motivation, 1989

  72. [73]

    Teaching Language Models to Support Answers with Verified Quotes.CoRR, abs/2203.11147,

    Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, and Nat McAleese. Teaching language models to support answers with verified quotes. arXiv:2203.11147, 2022

  73. [74]

    RareAct: A video dataset of unusual interactions

    Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, and Andrew Zisserman. RareAct: A video dataset of unusual interactions. arxiv:2008.01018, 2020

  74. [75]

    End-to-end learning of visual representations from uncurated instructional videos

    Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instructional videos. In IEEE Computer Vision and Pattern Recognition, 2020

  75. [76]

    Recurrent neural network based language model

    Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernock `y, and Sanjeev Khudanpur. Recurrent neural network based language model. Interspeech, 2010

  76. [77]

    arXiv preprint arXiv:2202.12837 , year=

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv:2202.12837, 2022

  77. [78]

    Model cards for model reporting

    Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. In ACM Conference on Fairness, Accountability, and Transparency, 2019

  78. [79]

    Ron Mokady, Amir Hertz, and Amit H. Bermano. ClipCap: CLIP prefix for image captioning. arXiv:2111.09734, 2021

  79. [80]

    Large-scale pretraining for visual dialog: A simple state-of-the-art baseline

    Vishvak Murahari, Dhruv Batra, Devi Parikh, and Abhishek Das. Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. In European Conference on Computer Vision, 2020

  80. [81]

    True few-shot learning with language models

    Ethan Perez, Douwe Kiela, and Kyunghyun Cho. True few-shot learning with language models. Conference on Neural Information Processing Systems, 2021

Showing first 80 references.