pith. sign in

hub Canonical reference

Flamingo: a Visual Language Model for Few-Shot Learning

Canonical reference. 88% of citing Pith papers cite this work as background.

67 Pith papers citing it
Background 88% of classified citations
abstract

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning tasks, which evaluate the ability to describe a scene or an event; and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.

hub tools

citation-role summary

background 16 baseline 1

citation-polarity summary

claims ledger

  • abstract Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing

co-cited works

clear filters

representative citing papers

Editing Models with Task Arithmetic

cs.LG · 2022-12-08 · accept · novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

Towards One-to-Many Temporal Grounding

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

Introduces OMTG benchmark with C-Acc and EtF1 metrics, a 56k dataset, and caption/temporal rewards, reaching 43.65% EtF1 SOTA on the new bench.

Bottleneck Tokens for Unified Multimodal Retrieval

cs.LG · 2026-04-13 · unverdicted · novelty 7.0

Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.

Visual Instruction Tuning

cs.CV · 2023-04-17 · unverdicted · novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

PaLI: A Jointly-Scaled Multilingual Language-Image Model

cs.CV · 2022-09-14 · conditional · novelty 7.0

PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.

A Generalist Agent

cs.AI · 2022-05-12 · accept · novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

citing papers explorer

Showing 9 of 9 citing papers after filters.

  • Visual Instruction Tuning cs.CV · 2023-04-17 · unverdicted · none · ref 2 · internal anchor

    LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

  • Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift cs.CV · 2026-04-27 · unverdicted · none · ref 1 · internal anchor

    MG-MTTA improves VLM accuracy under modality-specific shifts by replacing pure entropy minimization with majorization-guided adaptation that incorporates a reliability-aware gate prior.

  • LLaVA-Video: Video Instruction Tuning With Synthetic Data cs.CV · 2024-10-03 · unverdicted · none · ref 1 · internal anchor

    LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.

  • Demystifying CLIP Data cs.CV · 2023-09-28 · accept · none · ref 46 · internal anchor

    MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.

  • MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action cs.CV · 2023-03-20 · unverdicted · none · ref 2 · internal anchor

    MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.

  • Agentic AI for Remote Sensing: Technical Challenges and Research Directions cs.CV · 2026-04-27 · unverdicted · none · ref 1 · 2 links · internal anchor

    Position paper identifies structural challenges in applying generic agentic AI to Earth Observation and outlines design principles for EO-native agents focused on geospatial state and validity.

  • Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification cs.CV · 2026-04-06 · unverdicted · none · ref 38 · internal anchor

    Detection-guided prompting raises small VLM hazard F1 from 34.5% to 50.6% and BERTScore from 0.61 to 0.82 on construction images with only 2.5 ms added latency.

  • PaliGemma: A versatile 3B VLM for transfer cs.CV · 2024-07-10 · unverdicted · none · ref 6 · internal anchor

    PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

  • Improved Baselines with Visual Instruction Tuning cs.CV · 2023-10-05 · conditional · none · ref 2 · internal anchor

    Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.