arxiv: 2602.12286 · v2 · submitted 2026-01-21 · 🧬 q-bio.GN · cs.CL

Recognition: no theorem link

Mind the Gap No More: Achieving Zero-Gap Multimodal Integration via One Tokenizer

Yanan Li , Christina Yi Jin , Yuan Jin , Manli Luo , Tie Xu , Shuai Jiao , Wei He , Qing Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:46 UTC · model grok-4.3

classification 🧬 q-bio.GN cs.CL

keywords multimodal LLMsmodality gapunified tokenizernative architectureDNA-text reasoningzero-gap integrationcross-modal reasoning

0 comments

The pith

Unified vocabularies keep multimodal models in a zero-gap state across every hidden layer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current multimodal models use separate encoders for each input type, creating a geometric gap that the language model must constantly bridge before it can reason across modalities. The paper demonstrates that this gap is not inevitable: architectures built around a single shared tokenizer start and stay in perfect alignment at every layer. On a DNA-sequence and text task, the resulting One Tokenizer model outperforms the usual modular designs by letting the core network focus on biology rather than alignment. This matters because removing the reconciliation tax could unlock deeper cross-modal inference in any mixed-data domain.

Core claim

The authors formally characterize the geometric modality gap and prove that native architectures using a unified vocabulary intrinsically maintain a zero-gap state across all hidden layers. They introduce One Tokenizer, which maps all modalities directly into a shared token space, and show on a DNA–text testbed that it delivers superior performance for deep biological reasoning compared with encoder-based modular counterparts.

What carries the argument

One Tokenizer, a native architecture that maps every modality directly into one shared token space and thereby maintains zero geometric gap across layers.

Load-bearing premise

The geometric modality gap is the main limit on deep cross-modal reasoning and a single unified tokenizer removes it without creating new capacity or training problems.

What would settle it

A carefully tuned modular encoder-based model matching or exceeding One Tokenizer performance on the identical DNA-text benchmark would show the gap is not the dominant bottleneck.

Figures

Figures reproduced from arXiv: 2602.12286 by Christina Yi Jin, Manli Luo, Qing Zhang, Shuai Jiao, Tie Xu, Wei He, Yanan Li, Yuan Jin.

**Figure 2.** Figure 2: We systematically investigate three DNA-language fusion strategies, where the latter two are underexplored in the field. (a) The [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Average MCC across 18 NT tasks. Our two methods [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: One KEGG reasoning case study (full reasoning steps are provided in the supplement). The red box indicates the ground truth [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: UMAP visualizations of gene and text embeddings across [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

A central challenge in developing Multimodal Large Language Models (MLLMs) is effectively integrating heterogeneous inputs into a cohesive reasoning engine. Current paradigms predominantly rely on modular architectures that introduce modality-specific encoders and cross-modal fusion mechanisms. However, these designs are fundamentally bottlenecked by a geometric modality gap, forcing the LLM to expend significant computational capacity on geometric reconciliation rather than deep cross-modal reasoning. In this work, we formally characterize this modality gap and theoretically demonstrate that native architectures, specifically those employing a unified vocabulary, intrinsically maintain a zero-gap state across all hidden layers. Guided by these theoretical findings, we propose \textit{One Tokenizer}, a native architecture that maps all modalities directly into a shared token space. We empirically validate this framework on a DNA--text multimodal testbed. Our extensive evaluations reveal that by achieving seamless integration within the LLM's native latent space, One Tokenizer consistently outperforms encoder-based modular counterparts, providing a fundamentally superior framework for deep biological reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

One Tokenizer claims a unified vocabulary fixes the modality gap by architecture alone, but the theory and results are too thin to judge yet.

read the letter

The main point is that this paper argues native architectures with a single shared tokenizer keep zero geometric gap across layers and beat modular encoder setups on DNA-text tasks. They call the approach One Tokenizer and position it as a simpler way to do multimodal integration without separate encoders or fusion steps. That architectural simplification is the clearest new piece, especially for sequence-heavy biological data where you want the LLM to reason directly in one latent space. If the zero-gap property holds without extra machinery, it could cut down on the usual reconciliation overhead. The idea builds on existing unified-vocabulary work but applies it specifically to close the gap in MLLMs for biology, which is a reasonable direction. The stress-test note is fair to raise: sharing tokens does not automatically align embeddings or hidden states unless the training objective forces it, and the abstract gives no derivation showing why the alignment is intrinsic rather than training-dependent. On the empirical side, the claim of consistent outperformance is stated but comes with no numbers, baselines, dataset sizes, or controls, so it is impossible to tell whether the gain comes from the architecture or from other implementation choices. This leaves the central result hard to evaluate from what is shown. The paper is aimed at people building multimodal models for genomics or biological sequences. A reader already working on MLLM efficiency would find the proposal worth examining if the full manuscript supplies the missing derivations and controlled experiments. It is coherent enough on its own terms to deserve a serious referee who can check the theory and the testbed details, even if heavy revision is likely.

Referee Report

2 major / 1 minor

Summary. The paper claims that modular MLLMs suffer from a geometric modality gap due to separate encoders and fusion modules, while native architectures using a unified vocabulary (proposed as One Tokenizer) intrinsically maintain a zero-gap state across all hidden layers. It provides a theoretical characterization of the gap and shows that One Tokenizer outperforms encoder-based models on a DNA-text multimodal testbed for biological reasoning.

Significance. If the zero-gap result holds without hidden assumptions on joint training, the work would offer a substantial alternative paradigm for MLLM design, removing the computational overhead of geometric reconciliation and enabling more direct cross-modal reasoning. The DNA-text empirical results, if robust, would indicate practical gains in biological applications where sequence and textual data must be integrated deeply.

major comments (2)

[theoretical demonstration] The central theoretical claim (abstract and theoretical demonstration) that a unified vocabulary 'intrinsically' maintains zero-gap across hidden layers lacks any derivation, equations, or explicit assumptions; without showing how token sharing alone aligns embedding vectors and subsequent states (independent of joint optimization), the result risks circularity with training dynamics.
[empirical validation] The empirical section reports that One Tokenizer 'consistently outperforms' encoder-based counterparts on the DNA-text testbed, but provides no data statistics, baseline descriptions, controls, or quantitative metrics, leaving the superiority claim without verifiable support.

minor comments (1)

[abstract] The abstract refers to 'extensive evaluations' and 'seamless integration' without defining the exact architecture details or loss functions used in One Tokenizer.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas where additional rigor and detail are needed. We address each major comment below and will incorporate the suggested revisions in the next version of the paper.

read point-by-point responses

Referee: [theoretical demonstration] The central theoretical claim (abstract and theoretical demonstration) that a unified vocabulary 'intrinsically' maintains zero-gap across hidden layers lacks any derivation, equations, or explicit assumptions; without showing how token sharing alone aligns embedding vectors and subsequent states (independent of joint optimization), the result risks circularity with training dynamics.

Authors: We acknowledge that the current manuscript presents the zero-gap property at a conceptual level without full mathematical derivations. In the revision, we will add a dedicated theoretical section containing explicit equations for embedding vector alignment and hidden-state evolution under a shared vocabulary. The derivation will state all assumptions explicitly (including that alignment follows from identical tokenization and embedding lookup independent of any joint optimization) and will separate the architectural property from training dynamics to avoid circularity. revision: yes
Referee: [empirical validation] The empirical section reports that One Tokenizer 'consistently outperforms' encoder-based counterparts on the DNA-text testbed, but provides no data statistics, baseline descriptions, controls, or quantitative metrics, leaving the superiority claim without verifiable support.

Authors: We agree that the empirical claims require substantially more supporting detail. The revised manuscript will expand the experimental section to report full dataset statistics, precise descriptions of all encoder-based baselines and fusion modules, experimental controls for fair comparison, and quantitative results including tables with performance metrics, standard deviations, and statistical tests. revision: yes

Circularity Check

0 steps flagged

Theoretical zero-gap demonstration remains independent of fitted parameters or self-referential definitions

full rationale

The paper's central theoretical step formally characterizes the geometric modality gap in modular encoder-based designs and then shows that native unified-vocabulary architectures maintain a zero-gap state across hidden layers. This characterization is presented as a direct consequence of the architectural choice (shared token space) rather than a fitted quantity or a result imported solely via self-citation. No equations or claims in the abstract reduce the zero-gap property to a redefinition of the input gap itself, nor does the empirical DNA-text evaluation rely on parameters tuned to the same testbed used for the theoretical claim. The derivation chain therefore stays self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review limited to abstract; no explicit free parameters, axioms, or invented entities are detailed beyond the high-level claim of a geometric modality gap and unified tokenizer.

axioms (1)

domain assumption A geometric modality gap exists as the central bottleneck in modular multimodal architectures
Abstract positions this gap as the fundamental limit forcing computational effort on reconciliation rather than reasoning.

invented entities (1)

One Tokenizer no independent evidence
purpose: Maps all modalities directly into a shared token space to achieve zero-gap integration
New native architecture proposed to replace modular encoders and fusion mechanisms

pith-pipeline@v0.9.0 · 5483 in / 1332 out tokens · 36865 ms · 2026-05-16T12:46:17.960612+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 8 internal anchors

[1]

Flamingo: a visual language model for few-shot learning.Advances in neural information pro- cessing systems, 35:23716–23736,

[Alayracet al., 2022 ] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information pro- cessing systems, 35:23716–23736,

work page 2022
[2]

X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages.arXiv preprint arXiv:2305.04160,

[Chenet al., 2023 ] Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhang, Jing Shi, Shuang Xu, and Bo Xu. X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages.arXiv preprint arXiv:2305.04160,

work page arXiv 2023
[3]

Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks

[Chenet al., 2024 ] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 24185–24198,

work page 2024
[4]

Nucleotide transformer: building and evaluating robust foundation models for human genomics

[Dalla-Torreet al., 2025 ] Hugo Dalla-Torre, Liam Gonza- lez, Javier Mendoza-Revilla, Nicolas Lopez Carranza, Adam Henryk Grzywaczewski, Francesco Oteri, Chris- tian Dallago, Evan Trop, Bernardo P de Almeida, Hassan Sirelkhatim, et al. Nucleotide transformer: building and evaluating robust foundation models for human genomics. Nature Methods, 22(2):287–297,

work page 2025
[5]

A multimodal conversational agent for dna, rna and protein tasks.Nature Machine Intelligence, pages 1–14,

[de Almeidaet al., 2025 ] Bernardo P de Almeida, Guil- laume Richard, Hugo Dalla-Torre, Christopher Blum, Lorenz Hexemer, Priyanka Pandey, Stefan Laurent, Chan- dana Rajesh, Marie Lopez, Alexandre Laterre, et al. A multimodal conversational agent for dna, rna and protein tasks.Nature Machine Intelligence, pages 1–14,

work page 2025
[6]

Genechat: A multi-modal large language model for gene function prediction

[Dhanasekaret al., 2025 ] Shashi Dhanasekar, Akash Saranathan, and Pengtao Xie. Genechat: A multi-modal large language model for gene function prediction. bioRxiv, pages 2025–06,

work page 2025
[7]

Janusdna: A powerful bi-directional hybrid dna foundation model.arXiv preprint arXiv:2505.17257,

[Duanet al., 2025 ] Qihao Duan, Bingding Huang, Zhenqiao Song, Irina Lehmann, Lei Gu, Roland Eils, and Ben- jamin Wild. Janusdna: A powerful bi-directional hybrid dna foundation model.arXiv preprint arXiv:2505.17257,

work page arXiv 2025
[8]

Bioreason: Incentivizing multi- modal biological reasoning within a dna-llm model.arXiv preprint arXiv:2505.23579,

[Fallahpouret al., 2025 ] Adibvafa Fallahpour, Andrew Mag- nuson, Purav Gupta, Shihao Ma, Jack Naimer, Arnav Shah, Haonan Duan, Omar Ibrahim, Hani Goodarzi, Chris J Maddison, et al. Bioreason: Incentivizing multi- modal biological reasoning within a dna-llm model.arXiv preprint arXiv:2505.23579,

work page arXiv 2025
[9]

Dnabert: pre-trained bidirectional en- coder representations from transformers model for dna- language in genome.Bioinformatics, 37(15):2112–2120,

[Jiet al., 2021 ] Yanrong Ji, Zhihan Zhou, Han Liu, and Ra- mana V Davuluri. Dnabert: pre-trained bidirectional en- coder representations from transformers model for dna- language in genome.Bioinformatics, 37(15):2112–2120,

work page 2021
[10]

Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models

[Liet al., 2023 ] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR,

work page 2023
[11]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916,

[Liuet al., 2023 ] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916,

work page 2023
[12]

Improved baselines with visual instruction tuning

[Liuet al., 2024 ] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296– 26306,

work page 2024
[13]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

[McInneset al., 2018 ] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426,

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

A comprehensive overview of large language models

[Naveedet al., 2025 ] Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Us- man, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models. ACM Transactions on Intelligent Systems and Technology, 16(5):1–72,

work page 2025
[15]

Generative ar- tificial intelligence for advancing discovery and design in biomateriomics.Intelligent Computing, 4:0117,

[Puglieseet al., 2025 ] Raffaele Pugliese, Silvia Badini, Emanuele Frontoni, and Stefano Regondi. Generative ar- tificial intelligence for advancing discovery and design in biomateriomics.Intelligent Computing, 4:0117,

work page 2025
[16]

Learning transferable visual models from nat- ural language supervision

[Radfordet al., 2021 ] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from nat- ural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR,

work page 2021
[17]

Context-aware regularization with markovian in- tegration for attention-based nucleotide analysis.arXiv preprint arXiv:2507.09378,

[Refahiet al., 2025 ] Mohammadsaleh Refahi, Mahdi Abav- isani, Bahrad A Sokhansanj, James R Brown, and Gail Rosen. Context-aware regularization with markovian in- tegration for attention-based nucleotide analysis.arXiv preprint arXiv:2507.09378,

work page arXiv 2025
[18]

Caduceus: Bi-directional equivariant long-range dna se- quence modeling.Proceedings of machine learning re- search, 235:43632,

[Schiffet al., 2024 ] Yair Schiff, Chia-Hsiang Kao, Aaron Gokaslan, Tri Dao, Albert Gu, and V olodymyr Kuleshov. Caduceus: Bi-directional equivariant long-range dna se- quence modeling.Proceedings of machine learning re- search, 235:43632,

work page 2024
[19]

Chatgpt: Optimizing language models for dialogue.OpenAI blog, 2(4),

[Schulmanet al., 2022 ] John Schulman, Barret Zoph, Christina Kim, Jacob Hilton, Jacob Menick, Jiayi Weng, Juan Felipe Ceron Uribe, Liam Fedus, Luke Metz, Michael Pokorny, et al. Chatgpt: Optimizing language models for dialogue.OpenAI blog, 2(4),

work page 2022
[20]

Neural machine translation of rare words with subword units

[Sennrichet al., 2016 ] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. InProceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), pages 1715–1725,

work page 2016
[21]

PandaGPT: One Model To Instruction-Follow Them All

[Suet al., 2023 ] Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Emu: Generative Pretraining in Multimodality

[Sunet al., 2023 ] Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Emu: Generative pretraining in multimodality.arXiv preprint arXiv:2307.05222,

work page internal anchor Pith review arXiv 2023
[23]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

[Team, 2024] Chameleon Team. Chameleon: Mixed- modal early-fusion foundation models.arXiv preprint arXiv:2405.09818,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Attention is all you need.Advances in neural information processing systems, 30,

[Vaswaniet al., 2017 ] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30,

work page 2017
[25]

Emu3: Next-Token Prediction is All You Need

[Wanget al., 2024 ] Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Omnireg-gpt: a high-efficiency foundation model for comprehensive genomic sequence understanding.Na- ture Communications, 16(1):10139,

[Wanget al., 2025 ] Aowen Wang, Jiaqi Li, Hongyu Dong, Bocheng Xu, Qingyu Yin, Yanchao Xu, Jie Fu, and Junbo Zhao. Omnireg-gpt: a high-efficiency foundation model for comprehensive genomic sequence understanding.Na- ture Communications, 16(1):10139,

work page 2025
[27]

Genecom- pass: deciphering universal gene regulatory mecha- nisms with a knowledge-informed cross-species founda- tion model.Cell Research, 34(12):830–845,

[Yanget al., 2024 ] Xiaodong Yang, Guole Liu, Guihai Feng, Dechao Bu, Pengfei Wang, Jie Jiang, Shubai Chen, Qin- meng Yang, Hefan Miao, Yiyang Zhang, et al. Genecom- pass: deciphering universal gene regulatory mecha- nisms with a knowledge-informed cross-species founda- tion model.Cell Research, 34(12):830–845,

work page 2024
[28]

Qwen3 Technical Report

[Yanget al., 2025 ] An Yang, Anfeng Li, Baosong Yang, Be- ichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Sigmoid loss for language image pre-training

[Zhaiet al., 2023 ] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 11975– 11986,

work page 2023
[30]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

[Zhanget al., 2023a ] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

[Zhanget al., 2023b ] Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Unified multimodal understanding and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567, 2025

[Zhanget al., 2025 ] Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Jiakui Hu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, et al. Unified multimodal understanding and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567, 2025

work page arXiv 2025