Recognition: 2 theorem links
· Lean TheoremInfoTok: Information-Theoretic Regularization for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs
Pith reviewed 2026-05-16 08:06 UTC · model grok-4.3
The pith
InfoTok imposes mutual-information constraints on shared visual tokens to improve both understanding and generation in unified MLLMs without extra data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
InfoTok is an information-regularized tokenization mechanism grounded in the Information Bottleneck principle that explicitly controls information flow from images to shared tokens by imposing mutual-information constraints, instantiated via variational IB and HSIC estimators, thereby encouraging compression of task-irrelevant variation while preserving cross-modal consistency for both understanding and generation.
What carries the argument
InfoTok, the information-regularized tokenization mechanism that imposes mutual-information constraints on the shared visual tokenizer to enforce a compression-versus-relevance trade-off.
If this is right
- Shared visual tokens become more reusable across understanding and generation tasks under explicit information constraints.
- No additional training data is required to obtain consistent gains in both modalities.
- The same capacity-constrained perspective can be applied to other tokenization stages inside unified models.
- Practical MI estimators such as variational IB and HSIC suffice to realize the regularization in high-dimensional visual settings.
Where Pith is reading between the lines
- The same regularization idea could be tested on tokenizers for video or audio to check whether capacity-aware compression generalizes across modalities.
- If the MI constraints prove stable, they might reduce the need for hand-tuned architecture choices in future unified models.
- Extending the approach to later layers of the language model itself could further tighten the overall information budget.
Load-bearing premise
The chosen differentiable estimators for mutual information sufficiently enforce the intended constraints and produce tokens that simultaneously support semantic abstraction and visual detail.
What would settle it
Applying InfoTok to any of the three tested unified MLLMs and observing no gain or a net loss on standard image-understanding and image-generation benchmarks would falsify the central claim.
Figures
read the original abstract
Unified multimodal large language models (MLLMs) aim to unify image understanding and image generation within a single framework, where a shared visual tokenizer serves as the sole interface that maps high-dimensional images into a limited token budget for downstream multimodal reasoning and synthesis. However, existing shared-token designs are largely architecture-driven and lack an explicit criterion for what information should be preserved to simultaneously support semantic abstraction and visual detail. In this paper, we adopt a capacity-constrained perspective, viewing the shared tokenizer as a compute-bounded learner whose finite representational budget should prioritize reusable structure over hard-to-exploit high-entropy variations and redundancy. Motivated by this view, we propose \textbf{\textit{InfoTok}}, an information-regularized tokenization mechanism grounded in the Information Bottleneck (IB) principle. InfoTok explicitly controls information flow from images to shared tokens to multimodal outputs by imposing mutual-information (MI) constraints that enforce a principled trade-off between compression and task relevance, while also encouraging cross-modal consistency. Because MI is intractable for high-dimensional visual representations, we instantiate InfoTok with practical, differentiable dependence estimators, including a variational IB formulation and a Hilbert Schmidt Independence Criterion (HSIC) based alternative. Integrated into three representative unified MLLMs without introducing any additional training data, InfoTok consistently improves both image understanding and generation performance. These results support information-regularized visual tokenization as a sound basis for token learning in unified MLLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes InfoTok, an information-regularized tokenization mechanism for shared visual tokenizers in unified MLLMs. Grounded in the Information Bottleneck principle, it imposes mutual-information constraints via differentiable estimators (variational IB and HSIC) to enforce a compression-relevance trade-off that supports both semantic abstraction and visual detail. The approach is integrated into three representative unified MLLMs without extra training data and is reported to yield consistent gains in image understanding and generation performance.
Significance. If the central claim holds, InfoTok offers a principled, architecture-agnostic way to regularize capacity-constrained tokenizers using information theory, addressing a gap in existing shared-token designs. The no-extra-data integration and dual-task improvements would be practically valuable for unified MLLMs. However, significance hinges on whether the surrogate MI estimators reliably enforce the intended constraints or merely provide generic regularization; without that verification the contribution reduces to an empirical regularization trick.
major comments (2)
- [Abstract, §3] Abstract and §3 (method): The central claim that variational IB and HSIC estimators enforce explicit MI constraints for a principled compression-relevance trade-off is load-bearing, yet the manuscript provides no quantitative verification (e.g., estimated I(image; tokens) bounds, ablation on estimator tightness, or comparison against true MI proxies) that these surrogates actually achieve the intended capacity control in high-dimensional visual feature regimes. If the estimators exhibit the known high variance/bias documented for HSIC and variational bounds on visual data, observed gains may not stem from the information-theoretic mechanism.
- [§4] §4 (experiments): The statement that InfoTok 'consistently improves' both understanding and generation across three models lacks reported effect sizes, statistical significance, or controls that isolate the MI regularization from other training differences. Without these, the cross-model claim cannot be evaluated as evidence for the IB-based design.
minor comments (2)
- [§3] Notation for the two MI estimators should be unified and clearly distinguished from the true mutual information I(·;·) to avoid implying exact enforcement.
- [Abstract] The abstract claims 'no additional training data' but does not specify whether the original MLLM training recipes were held exactly constant or whether hyper-parameters were re-tuned for the new regularizers.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the need for stronger validation of the information-theoretic claims and more rigorous experimental reporting. We address each point below and will revise the manuscript to incorporate additional analyses and controls.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3 (method): The central claim that variational IB and HSIC estimators enforce explicit MI constraints for a principled compression-relevance trade-off is load-bearing, yet the manuscript provides no quantitative verification (e.g., estimated I(image; tokens) bounds, ablation on estimator tightness, or comparison against true MI proxies) that these surrogates actually achieve the intended capacity control in high-dimensional visual feature regimes. If the estimators exhibit the known high variance/bias documented for HSIC and variational bounds on visual data, observed gains may not stem from the information-theoretic mechanism.
Authors: We agree that explicit verification of the MI constraints would strengthen the central claim. The variational IB formulation provides a tractable surrogate lower bound on the relevant mutual information terms by design, and HSIC serves as a kernel-based dependence measure whose consistency properties are established in the literature. However, we acknowledge the absence of direct quantitative checks such as reported I(image; tokens) estimates or tightness ablations in the current manuscript. In the revision we will add these: (i) estimated MI values computed via the variational bounds and HSIC on held-out visual features before and after regularization, (ii) sensitivity analysis varying the regularization coefficients to demonstrate the compression-relevance trade-off, and (iii) discussion of known estimator biases with empirical evidence that performance gains scale with the strength of the MI terms rather than generic regularization. revision: yes
-
Referee: [§4] §4 (experiments): The statement that InfoTok 'consistently improves' both understanding and generation across three models lacks reported effect sizes, statistical significance, or controls that isolate the MI regularization from other training differences. Without these, the cross-model claim cannot be evaluated as evidence for the IB-based design.
Authors: We accept that the current experimental section would benefit from greater statistical rigor. The reported improvements are based on standard benchmark metrics across three distinct unified MLLM architectures, but we did not include effect sizes, multiple-run statistics, or explicit ablations that remove only the IB/HSIC terms while keeping all other training elements fixed. In the revised version we will: (i) report mean improvements with standard deviations over at least three random seeds, (ii) include statistical significance tests (e.g., paired t-tests or Wilcoxon tests) for the key metrics, (iii) add a dedicated ablation table isolating the contribution of the mutual-information regularizers, and (iv) provide controls that compare against equivalent-capacity models trained with non-information-theoretic regularizers to better attribute gains to the IB mechanism. revision: yes
Circularity Check
No circularity; derivation adopts external IB principle with standard estimators
full rationale
The paper grounds InfoTok in the established Information Bottleneck principle and instantiates it via known differentiable estimators (variational IB and HSIC) because exact MI is intractable. No equations or steps are shown that reduce claimed performance gains to a fitted parameter renamed as prediction, a self-defined quantity, or a self-citation chain. The capacity constraint is enforced through external regularization objectives whose validity rests on prior literature rather than the present work's outputs. Empirical improvements on three unified MLLMs without extra data constitute independent evidence, not a tautology. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling is visible in the derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Mutual information for high-dimensional visual data can be reliably estimated via variational IB and HSIC without introducing bias that harms downstream task performance
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
InfoTok explicitly controls information flow ... by imposing mutual-information (MI) constraints that enforce a principled trade-off between compression and task relevance ... instantiated with ... variational IB formulation and ... HSIC
-
IndisputableMonolith/Foundation/LogicAsFunctionalEquation.leanTranslation Theorem / J-uniqueness corollary echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
L_IB = I(Z;I) − β I(Z;Y_GT) ... compactness term ... KL upper bounds ... sufficiency lower bounds
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Janus: Decoupling visual encoding for unified multimodal understanding and generation,
C. Wu, X. Chen, Z. Wu, Y . Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruan, and P. Luo, “Janus: Decoupling visual encoding for unified multimodal understanding and generation,” inCVPR. IEEE, 2025, pp. 12 966–12 977
work page 2025
-
[2]
Emerging Properties in Unified Multimodal Pretraining
C. Deng, D. Zhu, K. Li, C. Gou, F. Li, Z. Wang, S. Zhong, W. Yu, X. Nie, Z. Song, S. Guang, and H. Fan, “Emerging properties in unified multimodal pretraining,”CoRR, vol. abs/2505.14683, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
VILA-U: a unified foundation model integrating visual understanding and generation,
Y . Wu, Z. Zhang, J. Chen, H. Tang, D. Li, Y . Fang, L. Zhu, E. Xie, H. Yin, L. Yi, S. Han, and Y . Lu, “VILA-U: a unified foundation model integrating visual understanding and generation,” inICLR. OpenReview.net, 2025
work page 2025
-
[4]
Harmonizing visual representations for unified multimodal understanding and generation,
S. Wu, W. Zhang, L. Xu, S. Jin, Z. Wu, Q. Tao, W. Liu, W. Li, and C. C. Loy, “Harmonizing visual representations for unified multimodal understanding and generation,” inICCV. IEEE, 2025
work page 2025
-
[5]
Ming-univision: Joint image under- standing and generation with a unified continuous tokenizer,
Z. Huang, D. Zheng, C. Zou, R. Liu, X. Wang, K. Ji, W. Chai, J. Sun, L. Wang, Y . Lvet al., “Ming-univision: Joint image under- standing and generation with a unified continuous tokenizer,”CoRR, vol. abs/2510.06590, 2025
-
[6]
Unitok: A unified tokenizer for visual generation and understanding,
C. Ma, Y . Jiang, J. Wu, J. Yang, X. Yu, Z. Yuan, B. Peng, and X. Qi, “Unitok: A unified tokenizer for visual generation and understanding,” CoRR, vol. abs/2502.20321, 2025
-
[7]
Vision as a dialect: Unifying visual understanding and generation via text-aligned representations,
J. Han, H. Chen, Y . Zhao, H. Wang, Q. Zhao, Z. Yang, H. He, X. Yue, and L. Jiang, “Vision as a dialect: Unifying visual understanding and generation via text-aligned representations,”CoRR, vol. abs/2506.18898, 2025
-
[8]
Openuni: A simple baseline for unified multimodal understanding and generation,
S. Wu, Z. Wu, Z. Gong, Q. Tao, S. Jin, Q. Li, W. Li, and C. C. Loy, “Openuni: A simple baseline for unified multimodal understanding and generation,”CoRR, vol. abs/2505.23661, 2025
-
[9]
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
B. Lin, Z. Li, X. Cheng, Y . Niu, Y . Ye, X. He, S. Yuan, W. Yu, S. Wang, Y . Ge, Y . Pang, and L. Yuan, “Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation,”CoRR, vol. abs/2506.03147, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Show-o2: Improved native unified multimodal models,
J. Xie, Z. Yang, and M. Z. Shou, “Show-o2: Improved native unified multimodal models,” inNeurIPS, 2025
work page 2025
-
[11]
Unieval: Unified holistic evaluation for unified multimodal understanding and generation,
Y . Li, H. Wang, Q. Zhang, B. Xiao, C. Hu, H. Wang, and X. Li, “Unieval: Unified holistic evaluation for unified multimodal understanding and generation,”CoRR, vol. abs/2505.10483, 2025
-
[12]
GQA: A new dataset for real-world visual reasoning and compositional question answering,
D. A. Hudson and C. D. Manning, “GQA: A new dataset for real-world visual reasoning and compositional question answering,” inCVPR. IEEE, 2019, pp. 6700–6709
work page 2019
-
[13]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
B. Li, R. Wang, G. Wang, Y . Ge, Y . Ge, and Y . Shan, “Seed-bench: Benchmarking multimodal llms with generative comprehension,”CoRR, vol. abs/2307.16125, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Evaluating object hallucination in large vision-language models,
Y . Li, Y . Du, K. Zhou, J. Wang, W. X. Zhao, and J. Wen, “Evaluating object hallucination in large vision-language models,” inEMNLP. Association for Computational Linguistics, 2023, pp. 292–305
work page 2023
-
[15]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, Z. Qiu, W. Lin, J. Yang, X. Zheng, K. Li, X. Sun, and R. Ji, “MME: A comprehensive evaluation benchmark for multimodal large language models,”CoRR, vol. abs/2306.13394, 2023. 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Mm-vet: Evaluating large multimodal models for integrated capabilities,
W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang, “Mm-vet: Evaluating large multimodal models for integrated capabilities,” inICML. OpenReview.net, 2024
work page 2024
-
[17]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,
X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sunet al., “Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,” inCVPR, 2024, pp. 9556–9567
work page 2024
-
[18]
Geneval: An object-focused framework for evaluating text-to-image alignment,
D. Ghosh, H. Hajishirzi, and L. Schmidt, “Geneval: An object-focused framework for evaluating text-to-image alignment,” inNeurIPS, 2023
work page 2023
-
[19]
Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation,
J. Ye, D. Jiang, Z. Wang, L. Zhu, Z. Hu, Z. Huang, J. He, Z. Yan, J. Yu, H. Li, C. He, and W. Li, “Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation,”CoRR, vol. abs/2508.09987, 2025
-
[20]
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Y . Niu, M. Ning, M. Zheng, B. Lin, P. Jin, J. Liao, K. Ning, B. Zhu, and L. Yuan, “WISE: A world knowledge-informed semantic evaluation for text-to-image generation,”CoRR, vol. abs/2503.07265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
The information bottleneck method
N. Tishby, F. C. N. Pereira, and W. Bialek, “The information bottleneck method,”CoRR, vol. physics/0004057, 2000
work page internal anchor Pith review Pith/arXiv arXiv 2000
-
[22]
Deep learning and the information bottleneck principle,
N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” inITW. IEEE, 2015, pp. 1–5
work page 2015
-
[23]
Deep variational information bottleneck,
A. A. Alemi, I. Fischer, J. V . Dillon, and K. Murphy, “Deep variational information bottleneck,” inICLR. OpenReview.net, 2017
work page 2017
-
[24]
Revisiting hilbert-schmidt information bottleneck for adversarial robustness,
Z. Wang, T. Jian, A. Masoomi, S. Ioannidis, and J. G. Dy, “Revisiting hilbert-schmidt information bottleneck for adversarial robustness,” in NeurIPS, 2021, pp. 586–597
work page 2021
-
[25]
A survey on multimodal large language models,
S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,”National Science Review, vol. 11, no. 12, p. nwae403, 2024
work page 2024
-
[26]
A survey of multimodal learning: Methods, applications, and future,
Y . Yuan, Z. Li, and B. Zhao, “A survey of multimodal learning: Methods, applications, and future,”ACM Comput. Surv., vol. 57, no. 7, pp. 167:1– 167:34, 2025
work page 2025
-
[27]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” in NeurIPS, 2023
work page 2023
-
[28]
Minigpt-4: Enhancing vision-language understanding with advanced large language models,
D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” in ICLR. OpenReview.net, 2024
work page 2024
-
[29]
Instructblip: Towards general-purpose vision-language models with instruction tuning,
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. C. H. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,” inNeurIPS, 2023
work page 2023
-
[30]
J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” inICML, ser. Proceedings of Machine Learning Research, vol. 202. PMLR, 2023, pp. 19 730–19 742
work page 2023
-
[31]
Flamingo: a visual language model for few-shot learning,
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynoldset al., “Flamingo: a visual language model for few-shot learning,” inNeurIPS, 2022, pp. 23 716–23 736
work page 2022
-
[32]
Querying as prompt: Parameter-efficient learning for multimodal language model,
T. Liang, J. Huang, M. Kong, L. Chen, and Q. Zhu, “Querying as prompt: Parameter-efficient learning for multimodal language model,” inCVPR, 2024, pp. 26 855–26 865
work page 2024
-
[33]
High- resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inCVPR. IEEE, 2022, pp. 10 674–10 685
work page 2022
-
[34]
Scalable diffusion models with transformers,
W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inICCV. IEEE, 2023, pp. 4172–4182
work page 2023
-
[35]
Text-to-image diffusion models in generative AI: A survey,
C. Zhang, C. Zhang, M. Zhang, and I. S. Kweon, “Text-to-image diffusion models in generative AI: A survey,”CoRR, vol. abs/2303.07909, 2023
-
[36]
Diffusion-4k: Ultra- high-resolution image synthesis with latent diffusion models,
J. Zhang, Q. Huang, J. Liu, X. Guo, and D. Huang, “Diffusion-4k: Ultra- high-resolution image synthesis with latent diffusion models,” inCVPR. IEEE, 2025, pp. 23 464–23 473
work page 2025
-
[37]
Hierarchical Text-Conditional Image Generation with CLIP Latents
A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchi- cal text-conditional image generation with CLIP latents,”CoRR, vol. abs/2204.06125, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[38]
Editar: Unified conditional generation with autoregressive models,
J. Mu, N. Vasconcelos, and X. Wang, “Editar: Unified conditional generation with autoregressive models,” inCVPR. IEEE, 2025, pp. 7899–7909
work page 2025
-
[39]
Dreamllm: Synergistic multimodal comprehension and creation,
R. Dong, C. Han, Y . Peng, Z. Qi, Z. Ge, J. Yang, L. Zhao, J. Sun, H. Zhou, H. Wei, X. Kong, X. Zhang, K. Ma, and L. Yi, “Dreamllm: Synergistic multimodal comprehension and creation,” inICLR. OpenReview.net, 2024
work page 2024
-
[40]
Making llama SEE and draw with SEED tokenizer,
Y . Ge, S. Zhao, Z. Zeng, Y . Ge, C. Li, X. Wang, and Y . Shan, “Making llama SEE and draw with SEED tokenizer,” inICLR. OpenReview.net, 2024
work page 2024
-
[41]
Generative multimodal models are in-context learners,
Q. Sun, Y . Cui, X. Zhang, F. Zhang, Q. Yu, Y . Wang, Y . Rao, J. Liu, T. Huang, and X. Wang, “Generative multimodal models are in-context learners,” inCVPR. IEEE, 2024, pp. 14 398–14 409
work page 2024
-
[42]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
C. Team, “Chameleon: Mixed-modal early-fusion foundation models,” CoRR, vol. abs/2405.09818, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Fast autoregressive models for continuous latent generation,
T. Hang, J. Bao, F. Wei, and D. Chen, “Fast autoregressive models for continuous latent generation,”CoRR, vol. abs/2504.18391, 2025
-
[44]
Emu3: Next-Token Prediction is All You Need
X. Wang, X. Zhang, Z. Luo, Q. Sun, Y . Cui, J. Wang, F. Zhang, Y . Wang, Z. Li, Q. Yu, Y . Zhao, Y . Ao, X. Min, T. Li, B. Wu, B. Zhao, B. Zhang, L. Wang, G. Liu, Z. He, X. Yang, J. Liu, Y . Lin, T. Huang, and Z. Wang, “Emu3: Next-token prediction is all you need,”CoRR, vol. abs/2409.18869, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
MUSE-VL: modeling unified VLM through semantic discrete encoding,
R. Xie, C. Du, P. Song, and C. Liu, “MUSE-VL: modeling unified VLM through semantic discrete encoding,”CoRR, vol. abs/2411.17762, 2024
-
[46]
Growing visual generative capacity for pre-trained mllms,
H. Wang, J. Han, Z. Yang, Q. Zhao, S. Lin, X. Yue, A. Shrivastava, Z. Yang, and H. Chen, “Growing visual generative capacity for pre-trained mllms,”CoRR, vol. abs/2510.01546, 2025
-
[47]
Toklip: Marry visual tokens to CLIP for multimodal comprehension and generation,
H. Lin, T. Wang, Y . Ge, Y . Ge, Z. Lu, Y . Wei, Q. Zhang, Z. Sun, and Y . Shan, “Toklip: Marry visual tokens to CLIP for multimodal comprehension and generation,”CoRR, vol. abs/2505.05422, 2025
-
[48]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
J. Chen, Z. Xu, X. Pan, Y . Hu, C. Qin, T. Goldstein, L. Huang, T. Zhou, S. Xie, S. Savarese, L. Xue, C. Xiong, and R. Xu, “Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset,” CoRR, vol. abs/2505.09568, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and Z. Qiu, “Qwen2...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Every FLOP counts: Scaling a 300b mixture-of-experts LING LLM without premium gpus,
L. Team, B. Zeng, C. Huang, C. Zhang, C. Tian, C. Chen, D. Jin, F. Yu, F. Zhu, F. Yuan, F. Wang, G. Wang, G. Zhai, H. Zhang, H. Li, J. Zhou, J. Liu, J. Fang, J. Ou, J. Hu, J. Luo, J. Zhang, J. Liu, J. Sha, J. Qian, J. Wu, J. Zhao, J. Li, J. Feng, J. Di, J. Xu, J. Yao, K. Xu, K. Du, L. Li, L. Liang, L. Yu, L. Tang, L. Ju, P. Xu, Q. Cui, S. Liu, S. Li, S. S...
-
[51]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Show-o: One single transformer to unify multimodal understanding and generation,
J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y . Gu, Z. Chen, Z. Yang, and M. Z. Shou, “Show-o: One single transformer to unify multimodal understanding and generation,” inICLR. OpenReview.net, 2025
work page 2025
-
[53]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
X. Chen, Z. Wu, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, and C. Ruan, “Janus-pro: Unified multimodal understanding and generation with data and model scaling,”CoRR, vol. abs/2501.17811, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Transfusion: Predict the next token and diffuse images with one multi-modal model,
C. Zhou, L. Yu, A. Babu, K. Tirumala, M. Yasunaga, L. Shamis, J. Kahn, X. Ma, L. Zettlemoyer, and O. Levy, “Transfusion: Predict the next token and diffuse images with one multi-modal model,” inICLR. OpenReview.net, 2025
work page 2025
-
[55]
On variational bounds of mutual information,
B. Poole, S. Ozair, A. van den Oord, A. A. Alemi, and G. Tucker, “On variational bounds of mutual information,” inICML, ser. Proceedings of Machine Learning Research, vol. 97. PMLR, 2019, pp. 5171–5180
work page 2019
-
[56]
Auto-encoding variational bayes,
D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, Y . Bengio and Y . LeCun, Eds., 2014
work page 2014
-
[57]
Representation Learning with Contrastive Predictive Coding
A. van den Oord, Y . Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”CoRR, vol. abs/1807.03748, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[58]
Imagenet: A large-scale hierarchical image database,
J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inCVPR. IEEE Computer Society, 2009, pp. 248–255
work page 2009
-
[59]
Densefusion- 1m: Merging vision experts for comprehensive multimodal perception,
X. Li, F. Zhang, H. Diao, Y . Wang, X. Wang, and L. Duan, “Densefusion- 1m: Merging vision experts for comprehensive multimodal perception,” NeurIPS, pp. 18 535–18 556, 2024
work page 2024
-
[60]
G. Wang, S. Zhao, X. Zhang, L. Cao, P. Zhan, L. Duan, S. Lu, M. Fu, X. Chen, J. Zhao, Y . Li, and Q. Chen, “Ovis-u1 technical report,”CoRR, vol. abs/2506.23044, 2025
-
[61]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, Z. Gao, E. Cui, X. Wang, Y . Cao, Y . Liu, X. Wei, H. Zhang, H. Wang, W. Xu, H. Li, J. Wang, N. Deng, S. Li, Y . He, T. Jiang, J. Luo, Y . Wang, C. He, B. Shi, X. Zhang, W. Shao, J. He, Y . Xiong, W. Qu, P. Sun, P. Jiao, H. Lv, L. Wu, K. Zhang, H. Deng, J. Ge, K. Chen, L. W...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.