Recognition: 2 theorem links
· Lean TheoremBLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Pith reviewed 2026-05-12 00:04 UTC · model grok-4.3
The pith
BLIP-2 connects frozen image encoders and large language models with a lightweight Querying Transformer to bootstrap efficient vision-language pre-training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BLIP-2 is a generic pre-training strategy that freezes a pre-trained image encoder and a pre-trained large language model, then trains only a lightweight Querying Transformer in two stages to bridge the modality gap. The first stage bootstraps vision-language representation learning from the frozen image encoder. The second stage bootstraps vision-to-language generative learning from the frozen language model. This produces models that reach state-of-the-art results on vision-language tasks despite using significantly fewer trainable parameters than existing approaches, such as outperforming Flamingo80B by 8.7 percent on zero-shot VQAv2 with 54 times fewer trainable parameters, and that also
What carries the argument
The Querying Transformer (Q-Former), a small transformer module that extracts a fixed set of visual query embeddings from the frozen image encoder and feeds them as input to the frozen language model.
If this is right
- Zero-shot visual question answering performance can exceed that of models with orders of magnitude more trainable parameters.
- Zero-shot image-to-text generation becomes possible that follows free-form natural language instructions.
- Pre-training compute is limited to the small Querying Transformer rather than the full size of the image encoder or language model.
- The same bridging approach works across different choices of frozen image encoders and language models.
Where Pith is reading between the lines
- Modality gaps between separately trained models may be bridgeable modularly, reducing the need to retrain large components when new data or tasks appear.
- The separation of representation learning and generative alignment into two stages could be applied to connect other frozen models such as audio encoders to language models.
- Efficiency gains from bootstrapping suggest that scaling curves for multimodal systems should consider the cost of the bridge module separately from the frozen backbones.
Load-bearing premise
A small Querying Transformer trained on frozen components can learn sufficient alignment between image features and language model inputs without any end-to-end updates to the large frozen models.
What would settle it
An end-to-end fine-tuned version of the same base image encoder and language model on identical pre-training data and compute budget would need to be evaluated on zero-shot VQAv2 to check whether the frozen approach loses accuracy that joint training recovers.
read the original abstract
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BLIP-2, a vision-language pre-training approach that bootstraps from off-the-shelf frozen image encoders and frozen large language models by inserting a lightweight Querying Transformer (Q-Former). The Q-Former is trained in two stages: first using image-text contrastive and matching objectives on the frozen vision encoder, then using language modeling objectives on the frozen LLM. The method reports state-of-the-art results across vision-language tasks while using far fewer trainable parameters than prior work; the headline empirical claim is an 8.7% gain over Flamingo-80B on zero-shot VQAv2 with 54× fewer trainable parameters, plus emerging zero-shot image-to-text generation that follows natural-language instructions.
Significance. If the central empirical claims hold, the work demonstrates a practical and computationally efficient route to high-performing vision-language models that avoids end-to-end training of billion-parameter backbones. The two-stage bootstrapping strategy and the parameter-efficiency result are the primary contributions; they directly address the prohibitive cost of full multimodal pre-training and could influence future model design by showing that a modest bridging module can extract usable visual information from frozen encoders. The zero-shot instruction-following capability is an additional positive signal.
major comments (2)
- [§3.2–3.3] §3.2–3.3: The two-stage Q-Former training procedure is described in detail, yet no ablation is presented that unfreezes either the image encoder or the LLM (or both) and measures the resulting change in downstream performance. Without this comparison it is impossible to determine whether the reported performance ceiling is limited by the frozen-backbone constraint or whether the Q-Former truly extracts all necessary visual information.
- [Abstract] Abstract and experimental claims: The 8.7% zero-shot VQAv2 improvement over Flamingo-80B is presented as a key result, but the visible text provides no table or section that lists the exact training data, evaluation splits, prompt templates, or hyper-parameters used for both models. This information is load-bearing for verifying that the efficiency advantage is not an artifact of mismatched experimental conditions.
minor comments (2)
- The notation for the Q-Former queries and the two-stage loss functions should be introduced with explicit equations and a diagram that shows which components are frozen at each stage.
- Figure captions and table footnotes should explicitly state the number of trainable parameters for every compared model so that the 54× claim can be checked at a glance.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments highlight important aspects of our experimental design and presentation. We address each major comment below and propose targeted revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3.2–3.3] The two-stage Q-Former training procedure is described in detail, yet no ablation is presented that unfreezes either the image encoder or the LLM (or both) and measures the resulting change in downstream performance. Without this comparison it is impossible to determine whether the reported performance ceiling is limited by the frozen-backbone constraint or whether the Q-Former truly extracts all necessary visual information.
Authors: We agree that a direct ablation unfreezing the image encoder or LLM would provide valuable additional evidence. However, such experiments would require training models with billions of parameters end-to-end, which is computationally prohibitive and directly contradicts the paper's central goal of demonstrating high performance while keeping the backbones frozen. Our results already show that the lightweight Q-Former can extract sufficient visual information to achieve state-of-the-art zero-shot performance. We will add a new paragraph in Section 4 (or a dedicated limitations subsection) discussing the rationale for the frozen setting, the expected trade-offs of unfreezing, and why we consider the current results sufficient to support our claims. revision: partial
-
Referee: [Abstract] The 8.7% zero-shot VQAv2 improvement over Flamingo-80B is presented as a key result, but the visible text provides no table or section that lists the exact training data, evaluation splits, prompt templates, or hyper-parameters used for both models. This information is load-bearing for verifying that the efficiency advantage is not an artifact of mismatched experimental conditions.
Authors: We thank the referee for pointing out the need for greater transparency. The training data, evaluation splits, and prompt templates for BLIP-2 are detailed in Sections 4.1, 4.2, and the appendix; the Flamingo-80B numbers are taken directly from the original Flamingo paper using the identical zero-shot VQAv2 protocol. To eliminate any ambiguity, we will insert a new table (or expanded subsection in Section 4) that explicitly tabulates the data sources, splits, prompts, and hyper-parameter settings for both models, along with citations to the Flamingo paper for the comparison numbers. revision: yes
Circularity Check
No significant circularity; central claims are empirical benchmarks against external models
full rationale
The paper's derivation consists of a two-stage training procedure for the Q-Former on frozen ViT and LLM backbones, with performance evaluated on standard external benchmarks (VQAv2, etc.) and compared to Flamingo-80B. These results do not reduce to any internal fitted parameter or self-defined quantity by construction. No equations or steps in the provided text exhibit self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations that replace independent verification. The efficiency claim (54x fewer parameters) is a direct count of trainable parameters, not a derived prediction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Frozen pre-trained image encoders and large language models retain useful cross-modal information that a lightweight interface can extract without further updating the large models.
invented entities (1)
-
Querying Transformer (Q-Former)
no independent evidence
Forward citations
Cited by 45 Pith papers
-
UniShield: Unified Face Attack Detection via KG-Informed Multimodal Reasoning
UniShield introduces a knowledge-graph-informed multimodal framework that improves unified detection of physical and digital face attacks through instruction tuning and consistency-optimized reasoning.
-
Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery
Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pa...
-
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
-
Bottleneck Tokens for Unified Multimodal Retrieval
Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
-
Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models
UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.
-
3D-VLA: A 3D Vision-Language-Action Generative World Model
3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
-
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
-
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
-
Evaluating Object Hallucination in Large Vision-Language Models
Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
-
VideoChat: Chat-Centric Video Understanding
VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
-
Visual Instruction Tuning
LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
-
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.
-
Can Multimodal Large Language Models Understand Pathologic Movements? A Pilot Study on Seizure Semiology
MLLMs achieve zero-shot recognition of seizure semiological features better than fine-tuned vision models on most tested features, with signal enhancement and faithful explanations.
-
VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models
Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.
-
MTT-Bench: Predicting Social Dominance in Mice via Multimodal Large Language Models
Fine-tuned multimodal LLMs predict mouse social dominance from raw tube test videos with high agreement to traditional rankings.
-
ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams
ReactBench benchmark shows MLLMs suffer over 30% performance drop on complex topological reasoning tasks versus basic ones when evaluated on chemical reaction diagrams.
-
AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning
AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to nove...
-
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
-
Qwen3-Omni Technical Report
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
-
LLaVA-Video: Video Instruction Tuning With Synthetic Data
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
-
Are We on the Right Way for Evaluating Large Vision-Language Models?
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
-
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
-
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
-
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
-
Kosmos-2: Grounding Multimodal Large Language Models to the World
Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.
-
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.
-
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
-
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, cr...
-
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.
-
EVA-CLIP: Improved Training Techniques for CLIP at Scale
EVA-CLIP delivers improved CLIP training recipes that yield 82.0% zero-shot ImageNet-1K accuracy for a 5B-parameter model after only 9 billion samples.
-
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
-
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
-
Qwen3.5-Omni Technical Report
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...
-
ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality
ClickAIXR combines controller-based object selection in XR with on-device VLM inference to enable private, precise multimodal queries about real objects.
-
Qwen2.5-Omni Technical Report
Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
SmoGVLM: A Small, Graph-enhanced Vision-Language Model
A graph-enhanced 1.3B-parameter VLM achieves up to 16.24% gains and outperforms larger VLMs by integrating structured knowledge via GNNs.
-
A Study of Failure Modes in Two-Stage Human-Object Interaction Detection
A diagnostic study shows that two-stage HOI models fail differently across scene configurations like multi-person and rare interactions, revealing that aggregate benchmark accuracy does not imply robust visual reasoning.
-
Improved Baselines with Visual Instruction Tuning
Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.
-
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
OpenFlamingo provides open-source autoregressive vision-language models that achieve 80-89% of Flamingo performance on seven vision-language datasets.
-
From Pixels to Prompts: Vision-Language Models
An explanatory book offering a clear mental map of Vision-Language Models to help readers move from buzzwords to practical understanding.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a Visual Language Model for Few-Shot Learning
Alayrac, J., Donahue, J., Luc, P., Miech, A., Barr, I., Has- son, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., and Si- monyan, K. Flami...
work page internal anchor Pith review arXiv
-
[2]
Pali: A jointly-scaled mul- tilingual language-image model
Chen, J., Guo, H., Yi, K., Li, B., and Elhoseiny, M. Visu- algpt: Data-efficient adaptation of pretrained language models for image captioning. InCVPR, pp. 18009–18019, 2022a. Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A. J., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., Kolesnikov, A., Puigcerver, J., Ding, N., Rong, K...
-
[3]
Unifying vision-and-language tasks via text generation
Cho, J., Lei, J., Tan, H., and Bansal, M. Unifying vision- and-language tasks via text generation. arXiv preprint arXiv:2102.02779,
-
[4]
Scaling Instruction-Finetuned Language Models
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y ., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Narang, S., Mishra, G., Yu, A., Zhao, V . Y ., Huang, Y ., Dai, A. M., Yu, H., Petrov, S., Chi, E. H., Dean, J., Devlin, J., Roberts, A., Zhou, D., Le, Q. V ., and Wei, J. Scali...
work page internal anchor Pith review arXiv
-
[5]
Eva: Exploring the limits of masked visual represen- tation learning at scale
Fang, Y ., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., and Cao, Y . Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636,
-
[6]
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. v. d., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. Training compute-optimal large language models. arXiv prepr...
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Jia, C., Yang, Y ., Xia, Y ., Chen, Y .-T., Parekh, Z., Pham, H., Le, Q. V ., Sung, Y ., Li, Z., and Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918,
-
[8]
Decoupled Weight Decay Regularization
Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization. arXiv preprint arXiv:1711.05101,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Learning Transferable Visual Models From Natural Language Supervision
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., and Komatsuzaki, A. Laion-400m: Open dataset of clip- filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114,
work page internal anchor Pith review arXiv
-
[11]
Vlmo: Unified vision-language pre- training with mixture-of-modality-experts
Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., and Yang, H. OFA: unifying architec- tures, tasks, and modalities through a simple sequence-to- sequence learning framework. In Chaudhuri, K., Jegelka, S., Song, L., Szepesv´ari, C., Niu, G., and Sabato, S. (eds.), ICML, pp. 23318–23340, 2022a. Wang, W., Bao, H., Dong, L., ...
-
[12]
Coca: Contrastive captioners are image- text foundation models
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Yu, J., Wang, Z., Vasudevan, V ., Yeung, L., Seyedhosseini, M., and Wu, Y . Coca: Contrastive captioners are image- text foundation models. arXiv preprint arXiv:2205.01917,
-
[13]
Florence: A new foundation model for computer vision
Yuan, L., Chen, D., Chen, Y ., Codella, N., Dai, X., Gao, J., Hu, H., Huang, X., Li, B., Li, C., Liu, C., Liu, M., Liu, Z., Lu, Y ., Shi, Y ., Wang, L., Wang, J., Xiao, B., Xiao, Z., Yang, J., Zeng, M., Zhou, L., and Zhang, P. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432,
-
[14]
Vinvl: Making visual representations matter in vision- language models
Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y ., and Gao, J. Vinvl: Making visual representa- tions matter in vision-language models. arXiv preprint arXiv:2101.00529,
-
[15]
OPT: Open Pre-trained Transformer Language Models
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M. T., Li, X., Lin, X. V ., Mihaylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D., Koura, P. S., Sridhar, A., Wang, T., and Zettlemoyer, L. OPT: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.