Recognition: 2 theorem links
· Lean TheoremOpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Pith reviewed 2026-05-14 01:47 UTC · model grok-4.3
The pith
OpenFlamingo delivers open-source vision-language models that reach 80-89 percent of Flamingo performance across seven datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce OpenFlamingo, a family of autoregressive vision-language models ranging from 3B to 9B parameters. OpenFlamingo is an ongoing effort to produce an open-source replication of DeepMind's Flamingo models. On seven vision-language datasets, OpenFlamingo models average between 80 and 89 percent of corresponding Flamingo performance. This technical report describes our models, training data, hyperparameters, and evaluation suite, and we share the models and code publicly.
What carries the argument
The OpenFlamingo model family, which uses autoregressive next-token prediction over interleaved image and text sequences to support few-shot multimodal reasoning.
If this is right
- Researchers can now train comparable multimodal models from public resources without starting from proprietary checkpoints.
- The 80-89 percent performance band shows that core few-shot vision-language capabilities transfer to open training pipelines.
- Scaling experiments across the 3B to 9B range supply baselines for studying how size affects multimodal task accuracy.
- The shared evaluation suite enables direct head-to-head comparisons of new open models against the Flamingo reference.
- Public code and weights allow the community to iterate on data mixtures, training schedules, and architectural tweaks.
Where Pith is reading between the lines
- If further open training closes the remaining gap, the results would indicate that Flamingo's gains come mainly from architecture and scale rather than from exclusive data or methods.
- The released framework could serve as a template for open replications of other large closed multimodal systems.
- Extending the same open training recipe to higher parameter counts or additional modalities such as video would test how far the replication approach generalizes.
Load-bearing premise
The reported performance numbers were measured under evaluation conditions comparable to the original Flamingo models and the released code and data suffice for independent reproduction.
What would settle it
Re-running the released code and data on the seven datasets under the same evaluation protocol and obtaining average scores below 70 percent of Flamingo's reported numbers would undermine the replication claim.
read the original abstract
We introduce OpenFlamingo, a family of autoregressive vision-language models ranging from 3B to 9B parameters. OpenFlamingo is an ongoing effort to produce an open-source replication of DeepMind's Flamingo models. On seven vision-language datasets, OpenFlamingo models average between 80 - 89% of corresponding Flamingo performance. This technical report describes our models, training data, hyperparameters, and evaluation suite. We share our models and code at https://github.com/mlfoundations/open_flamingo.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OpenFlamingo, an open-source family of autoregressive vision-language models (3B–9B parameters) intended as a replication of DeepMind’s Flamingo. It reports that these models achieve 80–89% of the corresponding Flamingo performance on average across seven vision-language datasets, provides details on architecture, training data, hyperparameters, and evaluation protocols, and releases models and code at the linked GitHub repository.
Significance. If the relative performance numbers hold under matched evaluation conditions, the work is significant as the first public, reproducible alternative to the closed-source Flamingo models. The explicit release of code, models, and training details directly addresses reproducibility concerns in large-scale vision-language research and lowers the barrier for follow-on work.
major comments (3)
- [Abstract / Evaluation section] Abstract and Evaluation section: the central claim that OpenFlamingo reaches 80–89% of Flamingo performance on seven datasets rests on the unverified assumption that few-shot prompts, example selection, image preprocessing, shot count, and metric computation are identical to those used in the original (closed-source) Flamingo evaluations. The manuscript describes its own suite and shares code, but provides no explicit side-by-side protocol table or verification that would allow independent confirmation of equivalence.
- [Evaluation section] Evaluation section: reported performance figures lack error bars, multiple random seeds, or statistical comparisons against the Flamingo baselines. Without these, it is impossible to determine whether the observed 80–89% range reflects a stable gap or is sensitive to evaluation variance.
- [Training section] Training and ablation discussion: the manuscript describes hyperparameters and data but contains no systematic ablations (e.g., effect of vision encoder choice, cross-attention layers, or data mixture ratios). This omission makes it difficult to isolate which design decisions are responsible for closing most of the gap to Flamingo.
minor comments (2)
- [Abstract] The GitHub URL in the abstract should be accompanied by a permanent archive link (e.g., Zenodo DOI) to guard against repository changes.
- [Model Architecture] Notation for model sizes (3B, 9B) should be defined consistently with parameter counts reported in Table 1 or the model architecture section.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below and clarify our evaluation approach, limitations, and the scope of this work as an open-source replication effort.
read point-by-point responses
-
Referee: [Abstract / Evaluation section] Abstract and Evaluation section: the central claim that OpenFlamingo reaches 80–89% of Flamingo performance on seven datasets rests on the unverified assumption that few-shot prompts, example selection, image preprocessing, shot count, and metric computation are identical to those used in the original (closed-source) Flamingo evaluations. The manuscript describes its own suite and shares code, but provides no explicit side-by-side protocol table or verification that would allow independent confirmation of equivalence.
Authors: We agree that a side-by-side protocol comparison would improve transparency. While we cannot access the closed-source Flamingo code for exact verification, our evaluation suite was designed to follow the protocols described in the Flamingo paper as closely as possible, including shot counts, prompt templates, and metrics. The released code at https://github.com/mlfoundations/open_flamingo contains the precise evaluation scripts, data loaders, and preprocessing steps used. In the revision we will add an explicit comparison table in the Evaluation section detailing our choices against the Flamingo paper descriptions, along with any unavoidable differences. revision: yes
-
Referee: [Evaluation section] Evaluation section: reported performance figures lack error bars, multiple random seeds, or statistical comparisons against the Flamingo baselines. Without these, it is impossible to determine whether the observed 80–89% range reflects a stable gap or is sensitive to evaluation variance.
Authors: We acknowledge that error bars and multi-seed statistics would strengthen the presentation. However, the computational cost of repeated full evaluations on 3B–9B models across seven datasets is prohibitive. We followed the single-run reporting convention common in large-scale vision-language papers and observed consistent relative performance across diverse tasks, which suggests the gap is not driven by outlier variance. In the revision we will add a limitations paragraph noting this constraint and the consistency evidence. revision: partial
-
Referee: [Training section] Training and ablation discussion: the manuscript describes hyperparameters and data but contains no systematic ablations (e.g., effect of vision encoder choice, cross-attention layers, or data mixture ratios). This omission makes it difficult to isolate which design decisions are responsible for closing most of the gap to Flamingo.
Authors: We agree that systematic ablations would be informative. The primary objective of this technical report is to release reproducible models and code that close most of the gap to Flamingo, rather than to perform an exhaustive ablation study that would require orders-of-magnitude more compute. We do describe key hyperparameter choices and data mixtures in the Training section. In the revision we will expand the discussion to highlight the design decisions we found most impactful during development, while noting that a full ablation study remains future work. revision: partial
Circularity Check
No circularity; empirical replication claims rest on external baselines without internal self-definition or fitted predictions
full rationale
The paper is a technical report describing an open-source replication of Flamingo models, including architecture, training data, hyperparameters, and an evaluation suite. The headline result (80-89% relative performance on seven datasets) is an empirical measurement against the closed-source Flamingo baseline on public datasets. No equations, derivations, or first-principles results are presented that reduce to the paper's own inputs by construction. There are no self-definitional quantities, no parameters fitted to a subset and then relabeled as predictions, and no load-bearing self-citations. The comparison assumes protocol equivalence, but this is an external validity concern rather than circularity per the enumerated patterns. The derivation chain consists of standard model training and benchmarking steps that are independently verifiable from the released code and data.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 25 Pith papers
-
Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks
MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
-
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
MathVista benchmark shows GPT-4V achieves 49.9% accuracy on visual mathematical reasoning tasks, outperforming other models but trailing humans by 10.4%.
-
BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning
BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.
-
AffectGPT-RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition
AffectGPT-RL applies reinforcement learning to optimize non-differentiable emotion wheel metrics in open-vocabulary multimodal emotion recognition, yielding performance gains and state-of-the-art results on basic emot...
-
QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding
Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.
-
Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework
Introduces VietPET-RoI dataset with fine-grained RoI annotations for Vietnamese 3D PET/CT and HiRRA graph framework that improves report generation by modeling region dependencies, claiming large gains over prior models.
-
Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models
UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.
-
Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding
Omni-NegCLIP improves CLIP's negation understanding by up to 52.65% on presence-based and 12.50% on absence-based tasks through front-layer fine-tuning with specialized contrastive losses.
-
When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models
A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.
-
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...
-
S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models
S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.
-
Phantasia: Context-Adaptive Backdoors in Vision Language Models
Phantasia is a new backdoor attack on VLMs that dynamically aligns malicious outputs with input context to achieve higher stealth and state-of-the-art success rates compared to static-pattern attacks.
-
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
-
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...
-
Long Context Transfer from Language to Vision
Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.
-
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
-
LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering
LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.
-
Make Your LVLM KV Cache More Lightweight
LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.
-
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
-
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
-
A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models
A patch-augmented cross-view regularization method reduces backdoor attack success rates in multimodal LLMs by enforcing output differences between original and perturbed views while using entropy constraints to prese...
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
Beyond Standard Benchmarks: A Systematic Audit of Vision-Language Model's Robustness to Natural Semantic Variation Across Diverse Tasks
Robust CLIP models amplify vulnerabilities to natural adversarial scenarios while standard CLIP shows large performance drops on natural language-induced adversarial examples in zero-shot classification, segmentation,...
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2201.07520 , year=
Armen Aghajanyan, Po-Yao (Bernie) Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Na- man Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, and Luke Zettlemoyer. Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520 , 2022
-
[2]
Lawrence Zitnick, Devi Parikh, and Dhruv Batra
Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Devi Parikh, and Dhruv Batra. Vqa: Visual question 10 answering. International Journal of Computer Vision, 123:4–31, 2015
work page 2015
-
[3]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems , 35: 23716–23736, 2022
work page 2022
-
[4]
Clip retrieval: Easily compute clip embeddings and build a clip re- trieval system with them
Romain Beaumont. Clip retrieval: Easily compute clip embeddings and build a clip re- trieval system with them. https://github.com/ rom1504/clip-retrieval, 2022
work page 2022
-
[5]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the oppor- tunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Pali: A jointly-scaled mul- tilingual language-image model
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Se- bastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled mul- tilingual language-image model. arXiv preprint arXiv:2209.06794, 2022
-
[7]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- ishna Vedantam, Saurabh Gupta, Piotr Doll´ ar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[8]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, F. Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Ho Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Haus- man, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Peter R. Florence. Palm-e: An e...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Datacomp: In search of the next generation of multimodal datasets
Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108 , 2023
-
[10]
arXiv preprint arXiv:2305.04790 , year=
Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qianmengke Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023
-
[11]
Danna Gurari, Qing Li, Abigale Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. Vizwiz grand challenge: An- swering visual questions from blind people. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3608–3617, 2018
work page 2018
-
[12]
Language is not all you need: Aligning perception with language models
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023
-
[13]
Perceiver: General perception with iterative at- tention
Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative at- tention. In International conference on machine learning, pages 4651–4664. PMLR, 2021
work page 2021
-
[14]
Scaling up vi- sual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up vi- sual and vision-language representation learning with noisy text supervision. In International Con- ference on Machine Learning, pages 4904–4916. PMLR, 2021
work page 2021
-
[15]
The hateful memes challenge: Detecting hate speech in multi- modal memes
Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. The hateful memes challenge: Detecting hate speech in multi- modal memes. arXiv preprint arXiv:2005.04790 , 2020
-
[16]
Grounding language models to images for multimodal generation
Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823, 2023
-
[17]
Gokul Karthik Kumar and Karthik Nandakumar. Hate-clipper: Multimodal hateful meme classifi- cation based on cross-modal interaction of clip features. arXiv preprint arXiv:2210.05916 , 2022
-
[18]
Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh
Hugo Lauren¸ con, Lucile Saulnier, L´ eo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, 11 Thomas Wang, Siddharth Karamcheti, Alexan- der M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. Obelisc: An open web-scale fil- tered dataset of interleaved image-text docu- ments. arXiv preprint arXiv:2306.16527 , 2023
-
[19]
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, C. Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruc- tion tuning. arXiv preprint arXiv:2306.05425 , 2023
-
[20]
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 , 2023
work page internal anchor Pith review arXiv 2023
-
[21]
mplug: Effective and efficient vision-language learning by cross-modal skip-connections
Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng da Cao, Ji Zhang, Songfang Huang, Feiran Huang, Jingren Zhou, and Luo Si. mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005, 2022
-
[22]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language under- standing and generation. In International Con- ference on Machine Learning, pages 12888–12900. PMLR, 2022
work page 2022
-
[23]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Pi- otr Doll´ ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Pro- ceedings, Part V 13 , pages 740–755. Springer, 2014
work page 2014
-
[25]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Ok-vqa: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 3190–3199, 2019
work page 2019
-
[27]
Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023
MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023
work page 2023
-
[28]
R OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Bryan A Plummer, Liwei Wang, Chris M Cer- vantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collect- ing region-to-phrase correspondences for richer image-to-sentence models. In IEEE international conference on computer vision, pages 2641–2649, 2015
work page 2015
-
[30]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning , pages 8748–8763. PMLR, 2021
work page 2021
-
[31]
Is a caption worth a thousand images? a controlled study for representation learning
Shibani Santurkar, Yann Dubois, Rohan Taori, Percy Liang, and Tatsunori Hashimoto. Is a caption worth a thousand images? a controlled study for representation learning. arXiv preprint arXiv:2207.07635, 2022
-
[32]
LAION-5B: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[33]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 8309–8318, 2019
work page 2019
-
[34]
Thomas Breuel. WebDataset: A high- performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch. Available at: https: //github.com/webdataset/webdataset, 2020. 12
work page 2020
-
[35]
Together.xyz. Releasing 3b and 7b redpajama- incite family of models including base, instruction-tuned & chat models. https://www. together.xyz/blog/redpajama-models-v1, 2023
work page 2023
-
[36]
Lawrence Zitnick, and Devi Parikh
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition , pages 4566–4575, 2014
work page 2014
-
[37]
Finetuned Language Models Are Zero-Shot Learners
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[38]
An empirical study of gpt-3 for few-shot knowledge-based vqa
Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xi- aowei Hu, Yumao Lu, Zicheng Liu, and Lijuan Wang. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3081–3089, 2022
work page 2022
-
[39]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large lan- guage models with multimodality. arXiv preprint arXiv:2304.14178, 2023
work page Pith review arXiv 2023
-
[40]
Peter Young, Alice Lai, Micah Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for se- mantic inference over event descriptions. Trans- actions of the Association for Computational Lin- guistics, 2:67–78, 2014
work page 2014
-
[41]
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Renrui Zhang, Jiaming Han, Aojun Zhou, Xi- angfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init at- tention. arXiv preprint arXiv:2303.16199 , 2023
work page Pith review arXiv 2023
-
[42]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Vic- toria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. Unified vision-language pre-training for image captioning and vqa. arXiv preprint arXiv:1909.11059, 2019
-
[45]
Multimodal C4: An open, billion-scale corpus of images interleaved with text
Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939 , 2023. 13 Table 8: Fine-tuned state-of-the-art numbers used in this report. Method Dataset Score...
-
[46]
tell stories, reference real-world entities/events, etc
be creative. tell stories, reference real-world entities/events, etc. The images/sentence can play off each-other in fun ways
-
[47]
be interesting. generate sequences that are cool, fun, compelling and require interesting commonsense reasoning across and between images/sentences
-
[48]
(Image A, Image B, Sentence 1, Image C, Image D, Sentence 2, Image E, Image F, Sentence 3)
make sure the image descriptions are self-contained, and the output format follows the requested format. user(human authored) Generate a creative, interesting sequence of sentences/images with the following format: (image A, sentence 1, image B, sentence 2, image C, sentence 3) assistant(human authored) Sure! Sequence format: (image A, sentence 1, image B...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.