arxiv: 2604.18452 · v1 · submitted 2026-04-20 · 💻 cs.CV · cs.CL

Recognition: unknown

ESsEN: Training Compact Discriminative Vision-Language Transformers in a Low-Resource Setting

Casey Kennington, Clayton Fields

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:25 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords vision-language modelscompact transformerslow-resource trainingtwo-tower architectureCNN integrationdiscriminative tasksparameter efficiency

0 comments

The pith

ESsEN trains a compact two-tower vision-language model end-to-end with few resources to match larger models on discriminative tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops methods for building smaller vision-language models that still work well when training data and compute power are scarce. It finds that two-tower designs, which keep vision and language processing separate before combining them, outperform single-tower versions in these limited settings. Adding standard convolutional networks to the vision side further reduces the number of parameters needed while preserving performance. The authors also show that the module fusing the two modalities can be resized or reshaped without hurting results. Their resulting model, ESsEN, serves as an example that such compact systems can be trained from scratch and reach competitive accuracy on English image-text tasks using only a small fraction of the parameters common in larger models.

Core claim

In low-resource settings for discriminative English vision-language tasks, two-tower encoder models are superior to one-tower encoders. Incorporating traditional convolutional networks into the two-tower transformer architecture helps produce parameter-efficient models. The cross-modal fusion module of two-tower encoders can vary significantly in shape and size while producing the same results. This enables ESsEN, a compact vision-language model that can be trained end-to-end with relatively few resources and performs as well on several tasks with only a fraction of the parameters compared to other models.

What carries the argument

Two-tower encoder architecture with CNN integration, in which separate vision and language encoders process inputs before a flexible cross-modal fusion step.

If this is right

Two-tower encoders outperform one-tower encoders in low-resource discriminative English vision-language tasks.
Adding CNNs to the two-tower transformer produces more parameter-efficient vision-language models.
The fusion module between towers can be resized or reshaped without changing task performance.
ESsEN reaches comparable accuracy to larger models while using only a fraction of the parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-tower plus CNN pattern could be tested on non-English languages or non-discriminative tasks to check whether the efficiency gains persist.
Hardware-constrained settings such as mobile devices or robots become more feasible for vision-language capabilities without requiring large pre-trained backbones.
Systematically varying fusion-module size offers a direct way to trade compute for accuracy on a per-deployment basis.

Load-bearing premise

The low-resource conditions and the specific discriminative English tasks tested are representative of vision-language modeling more broadly.

What would settle it

An experiment that trains a one-tower model under identical low-resource conditions on the same tasks and finds it matches or exceeds the two-tower performance would falsify the claimed superiority.

Figures

Figures reproduced from arXiv: 2604.18452 by Casey Kennington, Clayton Fields.

read the original abstract

Vision-language modeling is rapidly increasing in popularity with an ever expanding list of available models. In most cases, these vision-language models have parameters in the tens of billions, which is necessary for some needs, but in many cases smaller models are necessary (e.g., on edge devices or independent robotic platforms). Unfortunately, there is little research in producing light-weight models or in training them with small datasets. Inspired by the language learning progression and data sparsity in child development, in this paper, we address both of these goals in a systematic fashion. We show that two-tower encoder models are superior to one-tower encoders in low-resource settings for discriminative English tasks. We show also that incorporating traditional convolutional networks into the two-tower transformer architecture can help produce parameter efficient vision-language models. Finally, we show that the cross-modal fusion module of two-tower encoders can vary significantly in shape and size while producing the same results. In addition, we present ESsEN, a compact vision-language model that can be trained end-to-end with relatively few resources that performs as well on several tasks with only a fraction of the parameters compared to other models. The experimental results and the tools we present here make vision-language modeling more accessible to a wider variety of researchers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ESsEN shows two-tower designs with CNNs can match bigger models on low-resource English discriminative VL tasks with far fewer parameters, but the edge is demonstrated only in those narrow conditions.

read the letter

The key point here is that two-tower encoder setups with CNN integration can produce compact vision-language models that hold their own on low-resource discriminative tasks in English, using far fewer parameters than typical large models. The paper does a good job laying out why smaller models matter for edge devices and robotics. They motivate the work with ideas from child development and data sparsity. The experiments compare one-tower versus two-tower architectures directly in these constrained settings, and they test different fusion module designs to show flexibility. Presenting ESsEN as an end-to-end trainable example with reported performance parity is concrete and helpful. The tools they mention could lower the barrier for others. These findings are new in the sense that they focus on the low-resource regime where most prior work does not. The parameter efficiency claims and the fusion flexibility are backed by their comparisons. The soft spots are around generalizability. The advantages are shown only for discriminative English tasks, so it's unclear if two-tower plus CNN would win in generative settings or with other languages. The paper does not include the kind of controls that would show one-tower models failing for the same reasons across different VL problems. Dataset sizes and exact protocols are referenced but the strength of the statistical evidence depends on those details being robust. This is the kind of paper that matters to people working on real-world deployment of vision-language tech under compute limits. A practitioner or applied researcher would find the architecture insights and training notes worth reading. It has enough of a contribution to warrant peer review, even if it needs more experiments to strengthen the broader claims. I would send it to referees with instructions to check the experimental rigor and scope.

Referee Report

3 major / 2 minor

Summary. The paper introduces ESsEN, a compact two-tower vision-language transformer that integrates CNNs for parameter efficiency. It claims two-tower encoders outperform one-tower encoders in low-resource settings for discriminative English tasks, that CNN integration produces more efficient models, that cross-modal fusion modules can vary in shape/size with equivalent results, and that ESsEN achieves comparable task performance with far fewer parameters while being trainable end-to-end with limited resources, inspired by child language acquisition.

Significance. If the empirical claims hold under rigorous verification, the work would meaningfully advance accessible vision-language modeling by showing that compact two-tower designs with CNN components can match larger models in data-scarce regimes. This could enable deployment on edge devices and broaden participation in VL research, with the flexibility result on fusion modules offering a practical design insight.

major comments (3)

[Experimental section] Experimental section: the central claims of superiority and parameter efficiency rest on comparisons whose dataset sizes, exact baselines, training protocols, hyperparameter schedules, and statistical tests are not specified, preventing verification of the reported performance gaps and efficiency gains.
[§4] §4 (or equivalent results section): evaluation is restricted to discriminative English tasks without ablations or controls testing one-tower failure modes, multilingual settings, generative tasks, or out-of-distribution data; this makes the architectural recommendation (two-tower + CNN) load-bearing only for the tested regime and risks overgeneralization.
[Abstract and model description] Abstract and model description: the claim that ESsEN 'performs as well on several tasks with only a fraction of the parameters' requires explicit tables of parameter counts, metrics, and baselines; without these, the efficiency advantage cannot be assessed quantitatively.

minor comments (2)

[Model architecture] Notation for the two-tower architecture and fusion module could be clarified with a diagram or explicit equations to distinguish the CNN integration from standard transformer blocks.
[Introduction] The inspiration from child development is mentioned but not operationalized; a brief discussion of how data sparsity or progression is mimicked in the training schedule would strengthen the narrative.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We appreciate the emphasis on verifiability, scope, and quantitative clarity. We address each major comment below and will revise the manuscript to strengthen these aspects where possible.

read point-by-point responses

Referee: [Experimental section] Experimental section: the central claims of superiority and parameter efficiency rest on comparisons whose dataset sizes, exact baselines, training protocols, hyperparameter schedules, and statistical tests are not specified, preventing verification of the reported performance gaps and efficiency gains.

Authors: We agree that the experimental details were insufficiently specified in the initial submission, which hinders verification. In the revised manuscript, we will add a dedicated Experimental Setup subsection that explicitly reports dataset sizes, the precise baselines and their implementations, full training protocols, hyperparameter schedules, and any statistical tests or significance measures used. This will enable full reproducibility and assessment of the claimed performance and efficiency gains. revision: yes
Referee: [§4] §4 (or equivalent results section): evaluation is restricted to discriminative English tasks without ablations or controls testing one-tower failure modes, multilingual settings, generative tasks, or out-of-distribution data; this makes the architectural recommendation (two-tower + CNN) load-bearing only for the tested regime and risks overgeneralization.

Authors: Our work is deliberately scoped to low-resource discriminative English tasks, motivated by the data-sparse progression observed in child language acquisition. We acknowledge that this limits the generalizability of the two-tower + CNN recommendation. We will add a Limitations section that explicitly discusses the tested regime, notes the absence of multilingual/generative/OOD evaluations, and outlines directions for future work. Where feasible within our resource constraints, we will include additional ablations on one-tower failure modes. The core claims remain tied to the evaluated setting. revision: partial
Referee: [Abstract and model description] Abstract and model description: the claim that ESsEN 'performs as well on several tasks with only a fraction of the parameters' requires explicit tables of parameter counts, metrics, and baselines; without these, the efficiency advantage cannot be assessed quantitatively.

Authors: We agree that an explicit quantitative comparison is required to substantiate the efficiency claim. The revised manuscript will include a new table (placed in the model description or results section) that reports parameter counts for ESsEN alongside all baselines, together with the corresponding task metrics. This will allow direct, quantitative assessment of the parameter-efficiency advantage. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons are self-contained

full rationale

The paper's central claims rest on direct experimental comparisons of two-tower vs. one-tower encoders, CNN integration, and cross-modal fusion variants under low-resource English discriminative tasks, plus the introduction and benchmarking of the ESsEN model. No equations, derivations, or first-principles predictions are presented that reduce by construction to fitted inputs, self-definitions, or self-citation chains. Architecture choices and performance results are tested independently on held-out data rather than being renamed or forced by prior self-referential assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claims rest on standard machine-learning training assumptions plus the domain assumption that two-tower separation plus CNNs will remain advantageous under data sparsity; no new physical entities are postulated and free parameters are the usual ML hyperparameters whose specific values are not enumerated in the abstract.

free parameters (1)

model hyperparameters and training schedule
Typical in end-to-end neural training; exact values for ESsEN not provided in abstract.

axioms (1)

domain assumption Two-tower encoders outperform one-tower encoders for discriminative vision-language tasks under low data regimes
Invoked as the first main finding and motivated by child-language-learning analogy.

invented entities (1)

ESsEN architecture no independent evidence
purpose: Compact end-to-end trainable vision-language model
New named model presented; no independent falsifiable evidence outside the paper's own experiments.

pith-pipeline@v0.9.0 · 5522 in / 1240 out tokens · 61255 ms · 2026-05-10T04:25:03.631781+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 16 canonical work pages · 2 internal anchors

[1]

Catalan Speecon database

Speecon Consortium. Catalan Speecon database. 2011

2011
[2]

The EMILLE/CIIL Corpus

Anthony McEnery and others. The EMILLE/CIIL Corpus. 2004

2004
[3]

The OrienTel Moroccan MCA (Modern Colloquial Arabic) database

Khalid Choukri and Niklas Paullson. The OrienTel Moroccan MCA (Modern Colloquial Arabic) database. 2004

2004
[4]

ItalWordNet v.2

Roventini, Adriana and Marinelli, Rita and Bertagna, Francesca. ItalWordNet v.2
[5]

Advances in neural information processing systems , volume=

Flamingo: a visual language model for few-shot learning , author=. Advances in neural information processing systems , volume=
[6]

Proceedings of the First BabyLM Workshop. 2025. doi:10.18653/v1/2025.babylm-main.0

work page doi:10.18653/v1/2025.babylm-main.0 2025
[7]

Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling

Ganescu, Bianca-Mihaela and Salhan, Suchir and Caines, Andrew and Buttery, Paula. Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling. Proceedings of the First BabyLM Workshop
[8]

What is the Best Sequence Length for B aby LM ?

Salhan, Suchir and Diehl Martinez, Richard and Goriely, Z \'e bulon and Buttery, Paula. What is the Best Sequence Length for B aby LM ?. Proceedings of the First BabyLM Workshop. 2025. doi:10.18653/v1/2025.babylm-main.10

work page doi:10.18653/v1/2025.babylm-main.10 2025
[9]

Exploring smaller batch sizes for a high-performing BabyLM model architecture

Loáiciga, Sharid and Fysikoudi, Eleni and Sayeed, Asad B. Exploring smaller batch sizes for a high-performing BabyLM model architecture. Proceedings of the First BabyLM Workshop
[10]

GPT -wee: How Small Can a Small Language Model Really Get?

Bunzeck, Bastian and Zarrieß, Sina. GPT -wee: How Small Can a Small Language Model Really Get?. Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning
[11]

2016 , eprint=

VQA: Visual Question Answering , author=. 2016 , eprint=

2016
[12]

Physica D: Nonlinear Phenomena , keywords =

Harnad, Stevan , doi =. Physica D: Nonlinear Phenomena , keywords =. arXiv , arxivId =:9906002 , file =
[13]

How Children Learn to Learn Language

McCune, Lorraine. How Children Learn to Learn Language
[14]

American journal of speech-language pathology , volume=

Mapping the early language environment using all-day recordings and automated analysis , author=. American journal of speech-language pathology , volume=. 2017 , publisher=

2017
[15]

Proceedings of the IEEE international conference on computer vision , pages=

Vqa: Visual question answering , author=. Proceedings of the IEEE international conference on computer vision , pages=
[16]

Transactions of the Association for Computational Linguistics , volume=

Multimodal pretraining unmasked: A meta-analysis and a unified framework of vision-and-language BERTs , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=

2021
[17]

Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning , pages=

GPT-wee: How Small Can a Small Language Model Really Get? , author=. Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning , pages=
[18]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Emerging properties in self-supervised vision transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[19]

ELECTRA: Pre-training text encoders as discriminators rather than generators

Electra: Pre-training text encoders as discriminators rather than generators , author=. arXiv preprint arXiv:2003.10555 , year=

work page arXiv 2003
[20]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=. 2019 , doi=

2019
[21]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010
[22]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Compressing visual-linguistic model via knowledge distillation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[23]

Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL) , pages=

Exploring Transformers as Compact, Data-efficient Language Models , author=. Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL) , pages=
[24]

arXiv preprint arXiv:2307.03254 , year=

Vision language transformers: A survey , author=. arXiv preprint arXiv:2307.03254 , year=

work page arXiv
[25]

2024 , eprint=

Renaissance: Investigating the Pretraining of Vision-Language Encoders , author=. 2024 , eprint=

2024
[26]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Playing lottery tickets with vision and language , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[27]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[28]

arXiv preprint arXiv:2412.05149 , year=

Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora , author=. arXiv preprint arXiv:2412.05149 , year=

work page arXiv
[29]

International Conference on Machine Learning , pages=

Scaling up visual and vision-language representation learning with noisy text supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021
[30]

arXiv preprint arXiv:1909.10351 , year=

Tinybert: Distilling bert for natural language understanding , author=. arXiv preprint arXiv:1909.10351 , year=

work page arXiv 1909
[31]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Mdetr-modulated detection for end-to-end multi-modal understanding , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[32]

Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=

Referitgame: Referring to objects in photographs of natural scenes , author=. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=

2014
[33]

arXiv preprint arXiv:1908.03557 , year=

Visualbert: A simple and performant baseline for vision and language , author=. arXiv preprint arXiv:1908.03557 , year=

work page arXiv 1908
[34]

Resolving References to Objects in Photographs using the Words-As-Classifiers Model

Schlangen, David and Zarrie , Sina and Kennington, Casey. Resolving References to Objects in Photographs using the Words-As-Classifiers Model. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1115

work page doi:10.18653/v1/p16-1115 2016
[35]

International journal of computer vision , volume=

Visual genome: Connecting language and vision using crowdsourced dense image annotations , author=. International journal of computer vision , volume=. 2017 , publisher=

2017
[36]

Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages=

Microsoft coco: Common objects in context , author=. Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages=. 2014 , organization=

2014
[37]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Swin transformer: Hierarchical vision transformer using shifted windows , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[38]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

DQ-DETR: Dual query detection transformer for phrase extraction and grounding , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[39]

International Conference on Machine Learning , pages=

Learning transferable visual models from natural language supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021
[40]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. arXiv preprint arXiv:1910.01108 , year=

work page internal anchor Pith review arXiv 1910
[41]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Sharma, Piyush and Ding, Nan and Goodman, Sebastian and Soricut, Radu. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1238

work page doi:10.18653/v1/p18-1238 2018
[42]

arXiv preprint arXiv:1811.00491 , year=

A corpus for reasoning about natural language grounded in photographs , author=. arXiv preprint arXiv:1811.00491 , year=

work page arXiv
[43]

arXiv preprint arXiv:2004.02984 , year=

Mobilebert: a compact task-agnostic bert for resource-limited devices , author=. arXiv preprint arXiv:2004.02984 , year=

work page arXiv 2004
[44]

International conference on machine learning , pages=

Efficientnet: Rethinking model scaling for convolutional neural networks , author=. International conference on machine learning , pages=. 2019 , organization=

2019
[45]

Communications of the ACM , volume=

YFCC100M: The new data in multimedia research , author=. Communications of the ACM , volume=. 2016 , publisher=

2016
[46]

International Conference on Machine Learning , pages=

Training data-efficient image transformers & distillation through attention , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021
[47]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=
[48]

arXiv preprint arXiv:2012.06946 , year=

Minivlm: A smaller and faster vision-language model , author=. arXiv preprint arXiv:2012.06946 , year=

work page arXiv 2012
[49]

arXiv preprint arXiv:2210.07795 , year=

Efficientvlm: Fast and accurate vision-language models via knowledge distillation and modal-adaptive pruning , author=. arXiv preprint arXiv:2210.07795 , year=

work page arXiv
[50]

arXiv preprint arXiv:1901.06706 , year=

Visual entailment: A novel task for fine-grained image understanding , author=. arXiv preprint arXiv:1901.06706 , year=

work page arXiv 1901
[51]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Bridgetower: Building bridges between encoders in vision-language representation learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=