pith. sign in

arxiv: 2606.05552 · v1 · pith:QO6QV5NOnew · submitted 2026-06-04 · 💻 cs.LG · cs.AI· cs.GR

Balancing Image Compression and Generation with Bootstrapped Tokenization

Pith reviewed 2026-06-28 03:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.GR
keywords image tokenizationself-bootstrapped learningimage generationgenerative modelstoken decompositionglobal local separationefficiency in generationreconstruction quality
0
0 comments X

The pith

SelfBootTok decomposes image tokens into global and local groups so generators need only global tokens for better efficiency and quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that standard image tokenizers mix all granularities in each token, creating redundancy and complicating generator training. SelfBootTok uses self-bootstrapped learning to cleanly separate information into global and local token groups, allowing the model to predict local details exclusively from global tokens. This shifts the burden of visual details to the tokenizer, so the generator operates on fewer tokens. A sympathetic reader would care because the approach promises lower computation for high-quality image tasks while scaling to stronger results with added data or parameters.

Core claim

Through self-bootstrapped learning, the model predicts local details exclusively from global tokens, shifting the burden of visual details from the generator to the tokenizer. Consequently, our generator is far more efficient, requiring only global tokens and reducing computation by approximately 40%, while delivering superior reconstruction and generation. Moreover, this paradigm scales elegantly: by leveraging more data or parameters to self-supervise local representation learning, SelfBootTok achieves a new state-of-the-art gFID score of 1.56 using only 64 tokens.

What carries the argument

SelfBootTok, a tokenization method that uses self-bootstrapped learning to decompose image information into global and local token groups and predict local details from global tokens alone.

If this is right

  • Generators require only global tokens and cut computation by approximately 40%.
  • Reconstruction and generation quality improve over methods that mix all information in tokens.
  • State-of-the-art gFID of 1.56 is reached using only 64 tokens.
  • Performance scales further by using more data or parameters to improve local representation learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same global-local separation might reduce token needs in video or audio generation tasks.
  • Downstream models could train faster because they receive less mixed information per token.
  • Further scaling of the bootstrapping step might allow competitive results with even fewer than 64 global tokens.

Load-bearing premise

That self-bootstrapped learning can reliably predict local details exclusively from global tokens without significant loss of information or the need for additional mechanisms.

What would settle it

An experiment where generators using only SelfBootTok global tokens produce images with gFID scores substantially higher than 1.56 or visibly missing critical local details that cannot be recovered from the globals.

Figures

Figures reproduced from arXiv: 2606.05552 by Hao Jiang, Haozhe Chi, Jinghan Li, Jing Wang, Wu Sheng, Yadong Mu, Yi Ma.

Figure 1
Figure 1. Figure 1: Illustration of our self-bootstrapped learning paradigm. Compared to classical 1D image [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: ImageNet-1K 256×256 generation results of generative models trained with 64 tokens. We include a versatile classes of images such as animals, plants and food. Our method achieve pleasant generation outcomes with efficient token representation and local details. 2.1 Image tokenization Image tokenization methods can be broadly categorized by their latent space type (discrete vs. contin￾uous) and spatial stru… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed SelfbootTok pipeline. The input image is first encoded into a set [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of parallel training pipeline. Our method supports training larger scale tokenizer [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Scaling results for the local 2D aligner [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Despite progress in image tokenization, standard methods encode redundant information by mixing all granularities within each token, thus redundancy persists between tokens. The mix of information of different granularity also complicates the training of generators. This paper introduces SelfBootTok, a method that resolves this by cleanly decomposing information into global and local token groups. Through self-bootstrapped learning, the model predicts local details exclusively from global tokens, shifting the burden of visual details from the generator to the tokenizer. Consequently, our generator is far more efficient, requiring only global tokens and reducing computation by approximately 40%, while delivering superior reconstruction and generation. Moreover, this paradigm scales elegantly: by leveraging more data or parameters to self-supervise local representation learning, SelfBootTok achieves a new state-of-the-art gFID score of 1.56 using only 64 tokens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces SelfBootTok, a tokenization method that uses self-bootstrapped learning to decompose image information into separate global and local token groups. This allows the generator to operate exclusively on global tokens, yielding an approximate 40% reduction in computation while improving reconstruction and generation quality. The approach is claimed to scale with more data or parameters, achieving a new state-of-the-art gFID score of 1.56 using only 64 tokens.

Significance. If the claimed clean separation of global and local information holds without hidden leakage or parameter sharing, the result would be significant for efficient image generation pipelines, as it shifts detail modeling to the tokenizer and reduces generator compute. The reported gFID improvement with minimal tokens would represent a notable advance in token-efficient generative modeling.

major comments (2)
  1. [Abstract] Abstract: The central claim that self-bootstrapped learning enables the generator to use only global tokens with a 40% compute reduction requires a description of the bootstrapping objective, loss terms, and training schedule to enforce separation; none are supplied, leaving the efficiency and gFID=1.56 results unverifiable.
  2. [Abstract] Abstract: The assertion that local details are predicted 'exclusively from global tokens' with 'no additional mechanisms' is load-bearing for the decomposition claim, yet the abstract provides no equations, architecture diagram, or ablation showing that the bootstrapping predictor does not share parameters with the generator or rely on joint optimization of local tokens.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on the abstract. We agree that the abstract should better convey the key methodological elements to support the central claims and will revise it accordingly while preserving its brevity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that self-bootstrapped learning enables the generator to use only global tokens with a 40% compute reduction requires a description of the bootstrapping objective, loss terms, and training schedule to enforce separation; none are supplied, leaving the efficiency and gFID=1.56 results unverifiable.

    Authors: We agree the abstract would benefit from a concise description of the bootstrapping process. The full manuscript details the objective (self-supervised prediction of local tokens from global tokens via an auxiliary predictor), the loss (reconstruction loss plus bootstrapping MSE term), and schedule (tokenizer pretraining followed by generator training on global tokens only) in Sections 3.2 and 4. We will add one sentence to the abstract summarizing these elements to improve verifiability at the abstract level. revision: yes

  2. Referee: [Abstract] Abstract: The assertion that local details are predicted 'exclusively from global tokens' with 'no additional mechanisms' is load-bearing for the decomposition claim, yet the abstract provides no equations, architecture diagram, or ablation showing that the bootstrapping predictor does not share parameters with the generator or rely on joint optimization of local tokens.

    Authors: The abstract's length constraints preclude equations or diagrams, but the manuscript provides these in Figure 2 (separate lightweight predictor MLP with dedicated parameters, no sharing with the generator) and Section 5.3 (ablations isolating the bootstrapping component and confirming no joint optimization of local tokens during generator training). We will revise the abstract to include a short clarifying clause referencing the dedicated predictor to strengthen the separation claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The provided abstract and description introduce SelfBootTok via self-bootstrapped learning to separate global/local tokens, with efficiency and gFID claims presented as outcomes of the method. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the text that would reduce any result to its inputs by construction. The central claims rest on the described training process rather than tautological redefinitions, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; full paper required for complete ledger.

pith-pipeline@v0.9.1-grok · 5687 in / 1041 out tokens · 47663 ms · 2026-06-28T03:04:34.145651+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 36 canonical work pages · 14 internal anchors

  1. [1]

    Scalable diffusion models with transformers

    Peebles, William, Xie, Saining. Scalable diffusion models with transformers. Proceedings of the IEEE/CVF international conference on computer vision:4195–4205, 2023

  2. [2]

    Flow Matching for Generative Modeling

    Lipman, Yaron, Chen, Ricky TQ, Ben-Hamu, Heli, Nickel, Maximilian, Le, Matt. Flow match- ing for generative modeling. arXiv preprint arXiv:2210.02747, 2022. 9

  3. [3]

    Discrete flow matching

    Gat, Itai, Remez, Tal, Shaul, Neta, Kreuk, Felix, Chen, Ricky TQ, Synnaeve, Gabriel, Adi, Yossi, Lipman, Yaron. Discrete flow matching. Advances in Neural Information Processing Systems 37:133345–133385, 2024

  4. [4]

    Flow matching in latent space

    Dao, Quan, Phung, Hao, Nguyen, Binh, Tran, Anh. Flow matching in latent space. arXiv preprint arXiv:2307.08698, 2023

  5. [5]

    Visual autoregressive modeling: Scalable image generation via next-scale prediction

    Tian, Keyu, Jiang, Yi, Yuan, Zehuan, Peng, Bingyue, Wang, Liwei. Visual autoregressive modeling: Scalable image generation via next-scale prediction. Advances in neural information processing systems 37:84839–84865, 2024

  6. [6]

    Masked autoencoders are effective tokenizers for diffusion models

    Chen, Hao, Han, Yujin, Chen, Fangyi, Li, Xiang, Wang, Yidong, Wang, Jindong, Wang, Ze, Liu, Zicheng, Zou, Difan, Raj, Bhiksha. Masked autoencoders are effective tokenizers for diffusion models. Forty-second International Conference on Machine Learning, 2025

  7. [7]

    Autoregressive image generation without vector quantization

    Li, Tianhong, Tian, Yonglong, Li, He, Deng, Mingyang, He, Kaiming. Autoregressive image generation without vector quantization. Advances in Neural Information Processing Systems 37:56424–56445, 2024

  8. [8]

    Hieratok: Multi-scale visual tokenizer improves image reconstruction and generation

    Chen, Cong, Huang, Ziyuan, Zou, Cheng, Zhu, Muzhi, Ji, Kaixiang, Liu, Jiajia, Chen, Jing- dong, Chen, Hao, Shen, Chunhua. Hieratok: Multi-scale visual tokenizer improves image reconstruction and generation. arXiv preprint arXiv:2509.23736, 2025

  9. [9]

    Flowar: Scale-wise autoregressive image generation meets flow matching

    Ren, Sucheng, Yu, Qihang, He, Ju, Shen, Xiaohui, Yuille, Alan, Chen, Liang-Chieh. Flowar: Scale-wise autoregressive image generation meets flow matching. arXiv preprint arXiv:2412.15205, 2024

  10. [10]

    Reconstruction vs

    Yao, Jingfeng, Yang, Bin, Wang, Xinggang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. Proceedings of the Computer Vision and Pattern Recognition Conference:15703–15712, 2025

  11. [11]

    Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think

    Wu, Ge, Zhang, Shen, Shi, Ruijing, Gao, Shanghua, Chen, Zhenyuan, Wang, Lei, Chen, Zhaowei, Gao, Hongcheng, Tang, Yao, Yang, Jian, others. Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think. arXiv preprint arXiv:2507.01467, 2025

  12. [12]

    Albergo, Nicholas M

    Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, Sain- ing Xie. SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers. 2024.https://arxiv.org/abs/2401.08740

  13. [13]

    Taming transformers for high-resolution image synthesis

    Esser, Patrick, Rombach, Robin, Ommer, Bjorn. Taming transformers for high-resolution image synthesis. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition:12873–12883, 2021

  14. [14]

    Neural discrete representation learning

    Van Den Oord, Aaron, Vinyals, Oriol, others. Neural discrete representation learning. Advances in neural information processing systems 30, 2017

  15. [15]

    An image is worth 32 tokens for reconstruction and generation

    Yu, Qihang, Weber, Mark, Deng, Xueqing, Shen, Xiaohui, Cremers, Daniel, Chen, Liang-Chieh. An image is worth 32 tokens for reconstruction and generation. Advances in Neural Information Processing Systems 37:128940–128966, 2024

  16. [16]

    Softvq-vae: Efficient 1-dimensional continuous tokenizer

    Chen, Hao, Wang, Ze, Li, Xiang, Sun, Ximeng, Chen, Fangyi, Liu, Jiang, Wang, Jindong, Raj, Bhiksha, Liu, Zicheng, Barsoum, Emad. Softvq-vae: Efficient 1-dimensional continuous tokenizer. Proceedings of the Computer Vision and Pattern Recognition Conference:28358– 28370, 2025

  17. [17]

    Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation

    Xiong, Tianwei, Liew, Jun Hao, Huang, Zilong, Feng, Jiashi, Liu, Xihui. Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation. arXiv preprint arXiv:2504.08736, 2025

  18. [18]

    Spectral image tokenizer

    Esteves, Carlos, Suhail, Mohammed, Makadia, Ameesh. Spectral image tokenizer. Proceedings of the IEEE/CVF International Conference on Computer Vision:17181–17190, 2025. 10

  19. [19]

    FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

    Bachmann, Roman, Allardice, Jesse, Mizrahi, David, Fini, Enrico, Kar, O˘guzhan Fatih, Amirloo, Elmira, El-Nouby, Alaaeldin, Zamir, Amir, Dehghan, Afshin. FlexTok: Resampling Images into 1D Token Sequences of Flexible Length. Forty-second International Conference on Machine Learning, 2025

  20. [20]

    Democratizing text-to-image masked generative models with compact text-aware one-dimensional tokens

    Kim, Dongwon, He, Ju, Yu, Qihang, Yang, Chenglin, Shen, Xiaohui, Kwak, Suha, Chen, Liang-Chieh. Democratizing text-to-image masked generative models with compact text-aware one-dimensional tokens. arXiv preprint arXiv:2501.07730, 2025

  21. [21]

    Flowtok: Flowing seamlessly across text and image tokens

    He, Ju, Yu, Qihang, Liu, Qihao, Chen, Liang-Chieh. Flowtok: Flowing seamlessly across text and image tokens. arXiv preprint arXiv:2503.10772, 2025

  22. [22]

    Vector-quantized Image Modeling with Improved VQGAN

    Yu, Jiahui, Li, Xin, Koh, Jing Yu, Zhang, Han, Pang, Ruoming, Qin, James, Ku, Alexander, Xu, Yuanzhong, Baldridge, Jason, Wu, Yonghui. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021

  23. [23]

    Auto-Encoding Variational Bayes

    Kingma, Diederik P, Welling, Max. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

  24. [24]

    High- resolution image synthesis with latent diffusion models

    Rombach, Robin, Blattmann, Andreas, Lorenz, Dominik, Esser, Patrick, Ommer, Björn. High- resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition:10684–10695, 2022

  25. [25]

    Maskgit: Masked generative image transformer

    Chang, Huiwen, Zhang, Han, Jiang, Lu, Liu, Ce, Freeman, William T. Maskgit: Masked generative image transformer. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition:11315–11325, 2022

  26. [26]

    One-d-piece: Image tokenizer meets quality-controllable compression

    Miwa, Keita, Sasaki, Kento, Arai, Hidehisa, Takahashi, Tsubasa, Yamaguchi, Yu. One-d-piece: Image tokenizer meets quality-controllable compression. arXiv preprint arXiv:2501.10064, 2025

  27. [27]

    Highly Compressed Tokenizer Can Generate Without Training

    Beyer, L Lao, Li, Tianhong, Chen, Xinlei, Karaman, Sertac, He, Kaiming. Highly Compressed Tokenizer Can Generate Without Training. arXiv preprint arXiv:2506.08257, 2025

  28. [28]

    Maskbit: Embedding-free image generation via bit tokens

    Weber, Mark, Yu, Lijun, Yu, Qihang, Deng, Xueqing, Shen, Xiaohui, Cremers, Daniel, Chen, Liang-Chieh. Maskbit: Embedding-free image generation via bit tokens. arXiv preprint arXiv:2409.16211, 2024

  29. [29]

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    Yu, Lijun, Lezama, José, Gundavarapu, Nitesh B, Versari, Luca, Sohn, Kihyuk, Minnen, David, Cheng, Yong, Birodkar, Vighnesh, Gupta, Agrim, Gu, Xiuye, others. Language Model Beats Diffusion–Tokenizer is Key to Visual Generation. arXiv preprint arXiv:2310.05737, 2023

  30. [30]

    Imagefolder: Autoregressive image generation with folded tokens

    Li, Xiang, Qiu, Kai, Chen, Hao, Kuen, Jason, Gu, Jiuxiang, Raj, Bhiksha, Lin, Zhe. Imagefolder: Autoregressive image generation with folded tokens. arXiv preprint arXiv:2410.01756, 2024

  31. [31]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Li, Junnan, Li, Dongxu, Savarese, Silvio, Hoi, Steven. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. International conference on machine learning:19730–19742, 2023

  32. [32]

    Self-bootstrapped visual-language model for knowledge selection and question answering

    Hao, Dongze, Wang, Qunbo, Guo, Longteng, Jiang, Jie, Liu, Jing. Self-bootstrapped visual-language model for knowledge selection and question answering. arXiv preprint arXiv:2404.13947, 2024

  33. [33]

    VILA 2: VILA Augmented VILA

    Fang, Yunhao, Zhu, Ligeng, Lu, Yao, Wang, Yan, Molchanov, Pavlo, Kautz, Jan, Cho, Jang Hyun, Pavone, Marco, Han, Song, Yin, Hongxu. VILA 2: VILA Augmented VILA. arXiv preprint arXiv:2407.17453, 2024

  34. [34]

    VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding

    Waheed, Abdul, Wu, Zhen, Alharthi, Dareen, Kim, Seungone, Raj, Bhiksha. VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding. arXiv preprint arXiv:2509.21451, 2025

  35. [35]

    Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation

    Xia, Jiaer, Tong, Bingkui, Zang, Yuhang, Shao, Rui, Zhou, Kaiyang. Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation. arXiv preprint arXiv:2507.02859, 2025. 11

  36. [36]

    Lowis3d: Language-driven open-world instance-level 3d scene understanding

    Ding, Runyu, Yang, Jihan, Xue, Chuhui, Zhang, Wenqing, Bai, Song, Qi, Xiaojuan. Lowis3d: Language-driven open-world instance-level 3d scene understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 46:8517–8533, 2024

  37. [37]

    Flamingo: a visual language model for few-shot learning

    Alayrac, Jean-Baptiste, Donahue, Jeff, Luc, Pauline, Miech, Antoine, Barr, Iain, Hasson, Yana, Lenc, Karel, Mensch, Arthur, Millican, Katherine, Reynolds, Malcolm, others. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35:23716–23736, 2022

  38. [38]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Awadalla, Anas, Gao, Irena, Gardner, Josh, Hessel, Jack, Hanafy, Yusuf, Zhu, Wanrong, Marathe, Kalyani, Bitton, Yonatan, Gadre, Samir, Sagawa, Shiori, others. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023

  39. [39]

    Instructblip: Towards general-purpose vision-language models with instruction tuning

    Dai, Wenliang, Li, Junnan, Li, Dongxu, Tiong, Anthony, Zhao, Junqi, Wang, Weisheng, Li, Boyang, Fung, Pascale N, Hoi, Steven. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems 36:49250– 49267, 2023

  40. [40]

    VideoChat: Chat-Centric Video Understanding

    Li, KunChang, He, Yinan, Wang, Yi, Li, Yizhuo, Wang, Wenhai, Luo, Ping, Wang, Yali, Wang, Limin, Qiao, Yu. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023

  41. [41]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Lin, Bin, Ye, Yang, Zhu, Bin, Cui, Jiaxi, Ning, Munan, Jin, Peng, Yuan, Li. Video- llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023

  42. [42]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Maaz, Muhammad, Rasheed, Hanoona, Khan, Salman, Khan, Fahad Shahbaz. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023

  43. [43]

    Visual instruction tuning

    Liu, Haotian, Li, Chunyuan, Wu, Qingyang, Lee, Yong Jae. Visual instruction tuning. Advances in neural information processing systems 36:34892–34916, 2023

  44. [44]

    GPT-4 Technical Report

    Achiam, Josh, Adler, Steven, Agarwal, Sandhini, Ahmad, Lama, Akkaya, Ilge, Aleman, Flo- rencia Leoni, Almeida, Diogo, Altenschmidt, Janko, Altman, Sam, Anadkat, Shyamal, others. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  45. [45]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, Gemini, Anil, Rohan, Borgeaud, Sebastian, Alayrac, Jean-Baptiste, Yu, Jiahui, Soricut, Radu, Schalkwyk, Johan, Dai, Andrew M, Hauth, Anja, Millican, Katie, others. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  46. [46]

    Internvideo2: Scaling foundation models for multimodal video understanding

    Wang, Yi, Li, Kunchang, Li, Xinhao, Yu, Jiashuo, He, Yinan, Chen, Guo, Pei, Baoqi, Zheng, Rongkun, Wang, Zun, Shi, Yansong, others. Internvideo2: Scaling foundation models for multimodal video understanding. European Conference on Computer Vision:396–416, 2024

  47. [47]

    Cumo: Scaling multimodal llm with co-upcycled mixture-of-experts

    Li, Jiachen, Wang, Xinyao, Zhu, Sijie, Kuo, Chia-Wen, Xu, Lu, Chen, Fan, Jain, Jitesh, Shi, Humphrey, Wen, Longyin. Cumo: Scaling multimodal llm with co-upcycled mixture-of-experts. Advances in Neural Information Processing Systems 37:131224–131246, 2024

  48. [48]

    Svit: Scaling up visual instruction tuning

    Zhao, Bo, Wu, Boya, He, Muyang, Huang, Tiejun. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023

  49. [49]

    Longllava: Scaling multi-modal llms to 1000 images efficiently via a hybrid architecture

    Wang, Xidong, Song, Dingjie, Chen, Shunian, Chen, Junyin, Cai, Zhenyang, Zhang, Chen, Sun, Lichao, Wang, Benyou. Longllava: Scaling multi-modal llms to 1000 images efficiently via a hybrid architecture. arXiv preprint arXiv:2409.02889, 2024

  50. [50]

    An empir- ical study of scaling instruct-tuned large multimodal models

    Lu, Yadong, Li, Chunyuan, Liu, Haotian, Yang, Jianwei, Gao, Jianfeng, Shen, Yelong. An empir- ical study of scaling instruct-tuned large multimodal models. arXiv preprint arXiv:2309.09958, 2023

  51. [51]

    Auroracap: Efficient, performant video detailed captioning and a new benchmark

    Chai, Wenhao, Song, Enxin, Du, Yilun, Meng, Chenlin, Madhavan, Vashisht, Bar-Tal, Omer, Hwang, Jenq-Neng, Xie, Saining, Manning, Christopher D. Auroracap: Efficient, performant video detailed captioning and a new benchmark. arXiv preprint arXiv:2410.03051, 2024. 12

  52. [52]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    Li, Feng, Zhang, Renrui, Zhang, Hao, Zhang, Yuanhan, Li, Bo, Li, Wei, Ma, Zejun, Li, Chunyuan. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024

  53. [53]

    Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

    Philippe Hansen-Estruch, David Yan, Ching-Yao Chung, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, Xinlei Chen. Learnings from Scaling Visual Tokenizers for Reconstruction and Generation. 2025.https://arxiv.org/abs/2501.09755

  54. [54]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, Maxime, Darcet, Timothée, Moutakanni, Théo, V o, Huy, Szafraniec, Marc, Khalidov, Vasil, Fernandez, Pierre, Haziza, Daniel, Massa, Francisco, El-Nouby, Alaaeldin, others. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  55. [55]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Tschannen, Michael, Gritsenko, Alexey, Wang, Xiao, Naeem, Muhammad Ferjad, Alabdul- mohsin, Ibrahim, Parthasarathy, Nikhil, Evans, Talfan, Beyer, Lucas, Xia, Ye, Mustafa, Basil, others. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025

  56. [56]

    Self-supervised learning from images with a joint- embedding predictive architecture

    Assran, Mahmoud, Duval, Quentin, Misra, Ishan, Bojanowski, Piotr, Vincent, Pascal, Rabbat, Michael, LeCun, Yann, Ballas, Nicolas. Self-supervised learning from images with a joint- embedding predictive architecture. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition:15619–15629, 2023

  57. [57]

    A style-based generator architecture for generative adversarial networks

    Karras, Tero, Laine, Samuli, Aila, Timo. A style-based generator architecture for generative adversarial networks. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition:4401–4410, 2019

  58. [58]

    Ana- lyzing and improving the image quality of stylegan

    Karras, Tero, Laine, Samuli, Aittala, Miika, Hellsten, Janne, Lehtinen, Jaakko, Aila, Timo. Ana- lyzing and improving the image quality of stylegan. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition:8110–8119, 2020

  59. [59]

    Regularizing gener- ative adversarial networks under limited data

    Tseng, Hung-Yu, Jiang, Lu, Liu, Ce, Yang, Ming-Hsuan, Yang, Weilong. Regularizing gener- ative adversarial networks under limited data. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition:7921–7931, 2021

  60. [60]

    Consistency regularization for generative adversarial networks

    Zhang, Han, Zhang, Zizhao, Odena, Augustus, Lee, Honglak. Consistency regularization for generative adversarial networks. arXiv preprint arXiv:1910.12027, 2019

  61. [61]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Heusel, Martin, Ramsauer, Hubert, Unterthiner, Thomas, Nessler, Bernhard, Hochreiter, Sepp. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30, 2017

  62. [62]

    Imagenet: A large- scale hierarchical image database

    Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, Fei-Fei, Li. Imagenet: A large- scale hierarchical image database. 2009 IEEE conference on computer vision and pattern recognition:248–255, 2009

  63. [63]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Sun, Peize, Jiang, Yi, Chen, Shoufa, Zhang, Shilong, Peng, Bingyue, Luo, Ping, Yuan, Zehuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024

  64. [64]

    Autoregressive image generation using residual quantization

    Lee, Doyup, Kim, Chiheon, Kim, Saehoon, Cho, Minsu, Han, Wook-Shin. Autoregressive image generation using residual quantization. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition:11523–11532, 2022

  65. [65]

    Mage: Masked generative encoder to unify representation learning and image synthesis

    Li, Tianhong, Chang, Huiwen, Mishra, Shlok, Zhang, Han, Katabi, Dina, Krishnan, Dilip. Mage: Masked generative encoder to unify representation learning and image synthesis. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition:2142–2152, 2023

  66. [66]

    All are worth words: A vit backbone for diffusion models

    Bao, Fan, Nie, Shen, Xue, Kaiwen, Cao, Yue, Li, Chongxuan, Su, Hang, Zhu, Jun. All are worth words: A vit backbone for diffusion models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition:22669–22679, 2023. 13