Recognition: no theorem link
Mind the Gap No More: Achieving Zero-Gap Multimodal Integration via One Tokenizer
Pith reviewed 2026-05-16 12:46 UTC · model grok-4.3
The pith
Unified vocabularies keep multimodal models in a zero-gap state across every hidden layer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors formally characterize the geometric modality gap and prove that native architectures using a unified vocabulary intrinsically maintain a zero-gap state across all hidden layers. They introduce One Tokenizer, which maps all modalities directly into a shared token space, and show on a DNA–text testbed that it delivers superior performance for deep biological reasoning compared with encoder-based modular counterparts.
What carries the argument
One Tokenizer, a native architecture that maps every modality directly into one shared token space and thereby maintains zero geometric gap across layers.
Load-bearing premise
The geometric modality gap is the main limit on deep cross-modal reasoning and a single unified tokenizer removes it without creating new capacity or training problems.
What would settle it
A carefully tuned modular encoder-based model matching or exceeding One Tokenizer performance on the identical DNA-text benchmark would show the gap is not the dominant bottleneck.
Figures
read the original abstract
A central challenge in developing Multimodal Large Language Models (MLLMs) is effectively integrating heterogeneous inputs into a cohesive reasoning engine. Current paradigms predominantly rely on modular architectures that introduce modality-specific encoders and cross-modal fusion mechanisms. However, these designs are fundamentally bottlenecked by a geometric modality gap, forcing the LLM to expend significant computational capacity on geometric reconciliation rather than deep cross-modal reasoning. In this work, we formally characterize this modality gap and theoretically demonstrate that native architectures, specifically those employing a unified vocabulary, intrinsically maintain a zero-gap state across all hidden layers. Guided by these theoretical findings, we propose \textit{One Tokenizer}, a native architecture that maps all modalities directly into a shared token space. We empirically validate this framework on a DNA--text multimodal testbed. Our extensive evaluations reveal that by achieving seamless integration within the LLM's native latent space, One Tokenizer consistently outperforms encoder-based modular counterparts, providing a fundamentally superior framework for deep biological reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that modular MLLMs suffer from a geometric modality gap due to separate encoders and fusion modules, while native architectures using a unified vocabulary (proposed as One Tokenizer) intrinsically maintain a zero-gap state across all hidden layers. It provides a theoretical characterization of the gap and shows that One Tokenizer outperforms encoder-based models on a DNA-text multimodal testbed for biological reasoning.
Significance. If the zero-gap result holds without hidden assumptions on joint training, the work would offer a substantial alternative paradigm for MLLM design, removing the computational overhead of geometric reconciliation and enabling more direct cross-modal reasoning. The DNA-text empirical results, if robust, would indicate practical gains in biological applications where sequence and textual data must be integrated deeply.
major comments (2)
- [theoretical demonstration] The central theoretical claim (abstract and theoretical demonstration) that a unified vocabulary 'intrinsically' maintains zero-gap across hidden layers lacks any derivation, equations, or explicit assumptions; without showing how token sharing alone aligns embedding vectors and subsequent states (independent of joint optimization), the result risks circularity with training dynamics.
- [empirical validation] The empirical section reports that One Tokenizer 'consistently outperforms' encoder-based counterparts on the DNA-text testbed, but provides no data statistics, baseline descriptions, controls, or quantitative metrics, leaving the superiority claim without verifiable support.
minor comments (1)
- [abstract] The abstract refers to 'extensive evaluations' and 'seamless integration' without defining the exact architecture details or loss functions used in One Tokenizer.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas where additional rigor and detail are needed. We address each major comment below and will incorporate the suggested revisions in the next version of the paper.
read point-by-point responses
-
Referee: [theoretical demonstration] The central theoretical claim (abstract and theoretical demonstration) that a unified vocabulary 'intrinsically' maintains zero-gap across hidden layers lacks any derivation, equations, or explicit assumptions; without showing how token sharing alone aligns embedding vectors and subsequent states (independent of joint optimization), the result risks circularity with training dynamics.
Authors: We acknowledge that the current manuscript presents the zero-gap property at a conceptual level without full mathematical derivations. In the revision, we will add a dedicated theoretical section containing explicit equations for embedding vector alignment and hidden-state evolution under a shared vocabulary. The derivation will state all assumptions explicitly (including that alignment follows from identical tokenization and embedding lookup independent of any joint optimization) and will separate the architectural property from training dynamics to avoid circularity. revision: yes
-
Referee: [empirical validation] The empirical section reports that One Tokenizer 'consistently outperforms' encoder-based counterparts on the DNA-text testbed, but provides no data statistics, baseline descriptions, controls, or quantitative metrics, leaving the superiority claim without verifiable support.
Authors: We agree that the empirical claims require substantially more supporting detail. The revised manuscript will expand the experimental section to report full dataset statistics, precise descriptions of all encoder-based baselines and fusion modules, experimental controls for fair comparison, and quantitative results including tables with performance metrics, standard deviations, and statistical tests. revision: yes
Circularity Check
Theoretical zero-gap demonstration remains independent of fitted parameters or self-referential definitions
full rationale
The paper's central theoretical step formally characterizes the geometric modality gap in modular encoder-based designs and then shows that native unified-vocabulary architectures maintain a zero-gap state across hidden layers. This characterization is presented as a direct consequence of the architectural choice (shared token space) rather than a fitted quantity or a result imported solely via self-citation. No equations or claims in the abstract reduce the zero-gap property to a redefinition of the input gap itself, nor does the empirical DNA-text evaluation rely on parameters tuned to the same testbed used for the theoretical claim. The derivation chain therefore stays self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A geometric modality gap exists as the central bottleneck in modular multimodal architectures
invented entities (1)
-
One Tokenizer
no independent evidence
Reference graph
Works this paper leans on
-
[1]
[Alayracet al., 2022 ] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information pro- cessing systems, 35:23716–23736,
work page 2022
-
[2]
[Chenet al., 2023 ] Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhang, Jing Shi, Shuang Xu, and Bo Xu. X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages.arXiv preprint arXiv:2305.04160,
-
[3]
Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks
[Chenet al., 2024 ] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 24185–24198,
work page 2024
-
[4]
Nucleotide transformer: building and evaluating robust foundation models for human genomics
[Dalla-Torreet al., 2025 ] Hugo Dalla-Torre, Liam Gonza- lez, Javier Mendoza-Revilla, Nicolas Lopez Carranza, Adam Henryk Grzywaczewski, Francesco Oteri, Chris- tian Dallago, Evan Trop, Bernardo P de Almeida, Hassan Sirelkhatim, et al. Nucleotide transformer: building and evaluating robust foundation models for human genomics. Nature Methods, 22(2):287–297,
work page 2025
-
[5]
[de Almeidaet al., 2025 ] Bernardo P de Almeida, Guil- laume Richard, Hugo Dalla-Torre, Christopher Blum, Lorenz Hexemer, Priyanka Pandey, Stefan Laurent, Chan- dana Rajesh, Marie Lopez, Alexandre Laterre, et al. A multimodal conversational agent for dna, rna and protein tasks.Nature Machine Intelligence, pages 1–14,
work page 2025
-
[6]
Genechat: A multi-modal large language model for gene function prediction
[Dhanasekaret al., 2025 ] Shashi Dhanasekar, Akash Saranathan, and Pengtao Xie. Genechat: A multi-modal large language model for gene function prediction. bioRxiv, pages 2025–06,
work page 2025
-
[7]
Janusdna: A powerful bi-directional hybrid dna foundation model.arXiv preprint arXiv:2505.17257,
[Duanet al., 2025 ] Qihao Duan, Bingding Huang, Zhenqiao Song, Irina Lehmann, Lei Gu, Roland Eils, and Ben- jamin Wild. Janusdna: A powerful bi-directional hybrid dna foundation model.arXiv preprint arXiv:2505.17257,
-
[8]
[Fallahpouret al., 2025 ] Adibvafa Fallahpour, Andrew Mag- nuson, Purav Gupta, Shihao Ma, Jack Naimer, Arnav Shah, Haonan Duan, Omar Ibrahim, Hani Goodarzi, Chris J Maddison, et al. Bioreason: Incentivizing multi- modal biological reasoning within a dna-llm model.arXiv preprint arXiv:2505.23579,
-
[9]
[Jiet al., 2021 ] Yanrong Ji, Zhihan Zhou, Han Liu, and Ra- mana V Davuluri. Dnabert: pre-trained bidirectional en- coder representations from transformers model for dna- language in genome.Bioinformatics, 37(15):2112–2120,
work page 2021
-
[10]
[Liet al., 2023 ] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR,
work page 2023
-
[11]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916,
[Liuet al., 2023 ] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916,
work page 2023
-
[12]
Improved baselines with visual instruction tuning
[Liuet al., 2024 ] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296– 26306,
work page 2024
-
[13]
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
[McInneset al., 2018 ] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426,
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[14]
A comprehensive overview of large language models
[Naveedet al., 2025 ] Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Us- man, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models. ACM Transactions on Intelligent Systems and Technology, 16(5):1–72,
work page 2025
-
[15]
[Puglieseet al., 2025 ] Raffaele Pugliese, Silvia Badini, Emanuele Frontoni, and Stefano Regondi. Generative ar- tificial intelligence for advancing discovery and design in biomateriomics.Intelligent Computing, 4:0117,
work page 2025
-
[16]
Learning transferable visual models from nat- ural language supervision
[Radfordet al., 2021 ] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from nat- ural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR,
work page 2021
-
[17]
[Refahiet al., 2025 ] Mohammadsaleh Refahi, Mahdi Abav- isani, Bahrad A Sokhansanj, James R Brown, and Gail Rosen. Context-aware regularization with markovian in- tegration for attention-based nucleotide analysis.arXiv preprint arXiv:2507.09378,
-
[18]
[Schiffet al., 2024 ] Yair Schiff, Chia-Hsiang Kao, Aaron Gokaslan, Tri Dao, Albert Gu, and V olodymyr Kuleshov. Caduceus: Bi-directional equivariant long-range dna se- quence modeling.Proceedings of machine learning re- search, 235:43632,
work page 2024
-
[19]
Chatgpt: Optimizing language models for dialogue.OpenAI blog, 2(4),
[Schulmanet al., 2022 ] John Schulman, Barret Zoph, Christina Kim, Jacob Hilton, Jacob Menick, Jiayi Weng, Juan Felipe Ceron Uribe, Liam Fedus, Luke Metz, Michael Pokorny, et al. Chatgpt: Optimizing language models for dialogue.OpenAI blog, 2(4),
work page 2022
-
[20]
Neural machine translation of rare words with subword units
[Sennrichet al., 2016 ] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. InProceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), pages 1715–1725,
work page 2016
-
[21]
PandaGPT: One Model To Instruction-Follow Them All
[Suet al., 2023 ] Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355,
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Emu: Generative Pretraining in Multimodality
[Sunet al., 2023 ] Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Emu: Generative pretraining in multimodality.arXiv preprint arXiv:2307.05222,
work page internal anchor Pith review arXiv 2023
-
[23]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
[Team, 2024] Chameleon Team. Chameleon: Mixed- modal early-fusion foundation models.arXiv preprint arXiv:2405.09818,
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Attention is all you need.Advances in neural information processing systems, 30,
[Vaswaniet al., 2017 ] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30,
work page 2017
-
[25]
Emu3: Next-Token Prediction is All You Need
[Wanget al., 2024 ] Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869,
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
[Wanget al., 2025 ] Aowen Wang, Jiaqi Li, Hongyu Dong, Bocheng Xu, Qingyu Yin, Yanchao Xu, Jie Fu, and Junbo Zhao. Omnireg-gpt: a high-efficiency foundation model for comprehensive genomic sequence understanding.Na- ture Communications, 16(1):10139,
work page 2025
-
[27]
[Yanget al., 2024 ] Xiaodong Yang, Guole Liu, Guihai Feng, Dechao Bu, Pengfei Wang, Jie Jiang, Shubai Chen, Qin- meng Yang, Hefan Miao, Yiyang Zhang, et al. Genecom- pass: deciphering universal gene regulatory mecha- nisms with a knowledge-informed cross-species founda- tion model.Cell Research, 34(12):830–845,
work page 2024
-
[28]
[Yanget al., 2025 ] An Yang, Anfeng Li, Baosong Yang, Be- ichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Sigmoid loss for language image pre-training
[Zhaiet al., 2023 ] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 11975– 11986,
work page 2023
-
[30]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
[Zhanget al., 2023a ] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
[Zhanget al., 2023b ] Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
[Zhanget al., 2025 ] Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Jiakui Hu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, et al. Unified multimodal understanding and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.