Balancing Image Compression and Generation with Bootstrapped Tokenization
Pith reviewed 2026-06-28 03:04 UTC · model grok-4.3
The pith
SelfBootTok decomposes image tokens into global and local groups so generators need only global tokens for better efficiency and quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through self-bootstrapped learning, the model predicts local details exclusively from global tokens, shifting the burden of visual details from the generator to the tokenizer. Consequently, our generator is far more efficient, requiring only global tokens and reducing computation by approximately 40%, while delivering superior reconstruction and generation. Moreover, this paradigm scales elegantly: by leveraging more data or parameters to self-supervise local representation learning, SelfBootTok achieves a new state-of-the-art gFID score of 1.56 using only 64 tokens.
What carries the argument
SelfBootTok, a tokenization method that uses self-bootstrapped learning to decompose image information into global and local token groups and predict local details from global tokens alone.
If this is right
- Generators require only global tokens and cut computation by approximately 40%.
- Reconstruction and generation quality improve over methods that mix all information in tokens.
- State-of-the-art gFID of 1.56 is reached using only 64 tokens.
- Performance scales further by using more data or parameters to improve local representation learning.
Where Pith is reading between the lines
- The same global-local separation might reduce token needs in video or audio generation tasks.
- Downstream models could train faster because they receive less mixed information per token.
- Further scaling of the bootstrapping step might allow competitive results with even fewer than 64 global tokens.
Load-bearing premise
That self-bootstrapped learning can reliably predict local details exclusively from global tokens without significant loss of information or the need for additional mechanisms.
What would settle it
An experiment where generators using only SelfBootTok global tokens produce images with gFID scores substantially higher than 1.56 or visibly missing critical local details that cannot be recovered from the globals.
Figures
read the original abstract
Despite progress in image tokenization, standard methods encode redundant information by mixing all granularities within each token, thus redundancy persists between tokens. The mix of information of different granularity also complicates the training of generators. This paper introduces SelfBootTok, a method that resolves this by cleanly decomposing information into global and local token groups. Through self-bootstrapped learning, the model predicts local details exclusively from global tokens, shifting the burden of visual details from the generator to the tokenizer. Consequently, our generator is far more efficient, requiring only global tokens and reducing computation by approximately 40%, while delivering superior reconstruction and generation. Moreover, this paradigm scales elegantly: by leveraging more data or parameters to self-supervise local representation learning, SelfBootTok achieves a new state-of-the-art gFID score of 1.56 using only 64 tokens.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SelfBootTok, a tokenization method that uses self-bootstrapped learning to decompose image information into separate global and local token groups. This allows the generator to operate exclusively on global tokens, yielding an approximate 40% reduction in computation while improving reconstruction and generation quality. The approach is claimed to scale with more data or parameters, achieving a new state-of-the-art gFID score of 1.56 using only 64 tokens.
Significance. If the claimed clean separation of global and local information holds without hidden leakage or parameter sharing, the result would be significant for efficient image generation pipelines, as it shifts detail modeling to the tokenizer and reduces generator compute. The reported gFID improvement with minimal tokens would represent a notable advance in token-efficient generative modeling.
major comments (2)
- [Abstract] Abstract: The central claim that self-bootstrapped learning enables the generator to use only global tokens with a 40% compute reduction requires a description of the bootstrapping objective, loss terms, and training schedule to enforce separation; none are supplied, leaving the efficiency and gFID=1.56 results unverifiable.
- [Abstract] Abstract: The assertion that local details are predicted 'exclusively from global tokens' with 'no additional mechanisms' is load-bearing for the decomposition claim, yet the abstract provides no equations, architecture diagram, or ablation showing that the bootstrapping predictor does not share parameters with the generator or rely on joint optimization of local tokens.
Simulated Author's Rebuttal
We thank the referee for the detailed feedback on the abstract. We agree that the abstract should better convey the key methodological elements to support the central claims and will revise it accordingly while preserving its brevity.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that self-bootstrapped learning enables the generator to use only global tokens with a 40% compute reduction requires a description of the bootstrapping objective, loss terms, and training schedule to enforce separation; none are supplied, leaving the efficiency and gFID=1.56 results unverifiable.
Authors: We agree the abstract would benefit from a concise description of the bootstrapping process. The full manuscript details the objective (self-supervised prediction of local tokens from global tokens via an auxiliary predictor), the loss (reconstruction loss plus bootstrapping MSE term), and schedule (tokenizer pretraining followed by generator training on global tokens only) in Sections 3.2 and 4. We will add one sentence to the abstract summarizing these elements to improve verifiability at the abstract level. revision: yes
-
Referee: [Abstract] Abstract: The assertion that local details are predicted 'exclusively from global tokens' with 'no additional mechanisms' is load-bearing for the decomposition claim, yet the abstract provides no equations, architecture diagram, or ablation showing that the bootstrapping predictor does not share parameters with the generator or rely on joint optimization of local tokens.
Authors: The abstract's length constraints preclude equations or diagrams, but the manuscript provides these in Figure 2 (separate lightweight predictor MLP with dedicated parameters, no sharing with the generator) and Section 5.3 (ablations isolating the bootstrapping component and confirming no joint optimization of local tokens during generator training). We will revise the abstract to include a short clarifying clause referencing the dedicated predictor to strengthen the separation claim. revision: yes
Circularity Check
No significant circularity; derivation self-contained
full rationale
The provided abstract and description introduce SelfBootTok via self-bootstrapped learning to separate global/local tokens, with efficiency and gFID claims presented as outcomes of the method. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the text that would reduce any result to its inputs by construction. The central claims rest on the described training process rather than tautological redefinitions, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Scalable diffusion models with transformers
Peebles, William, Xie, Saining. Scalable diffusion models with transformers. Proceedings of the IEEE/CVF international conference on computer vision:4195–4205, 2023
2023
-
[2]
Flow Matching for Generative Modeling
Lipman, Yaron, Chen, Ricky TQ, Ben-Hamu, Heli, Nickel, Maximilian, Le, Matt. Flow match- ing for generative modeling. arXiv preprint arXiv:2210.02747, 2022. 9
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Discrete flow matching
Gat, Itai, Remez, Tal, Shaul, Neta, Kreuk, Felix, Chen, Ricky TQ, Synnaeve, Gabriel, Adi, Yossi, Lipman, Yaron. Discrete flow matching. Advances in Neural Information Processing Systems 37:133345–133385, 2024
2024
-
[4]
Dao, Quan, Phung, Hao, Nguyen, Binh, Tran, Anh. Flow matching in latent space. arXiv preprint arXiv:2307.08698, 2023
-
[5]
Visual autoregressive modeling: Scalable image generation via next-scale prediction
Tian, Keyu, Jiang, Yi, Yuan, Zehuan, Peng, Bingyue, Wang, Liwei. Visual autoregressive modeling: Scalable image generation via next-scale prediction. Advances in neural information processing systems 37:84839–84865, 2024
2024
-
[6]
Masked autoencoders are effective tokenizers for diffusion models
Chen, Hao, Han, Yujin, Chen, Fangyi, Li, Xiang, Wang, Yidong, Wang, Jindong, Wang, Ze, Liu, Zicheng, Zou, Difan, Raj, Bhiksha. Masked autoencoders are effective tokenizers for diffusion models. Forty-second International Conference on Machine Learning, 2025
2025
-
[7]
Autoregressive image generation without vector quantization
Li, Tianhong, Tian, Yonglong, Li, He, Deng, Mingyang, He, Kaiming. Autoregressive image generation without vector quantization. Advances in Neural Information Processing Systems 37:56424–56445, 2024
2024
-
[8]
Hieratok: Multi-scale visual tokenizer improves image reconstruction and generation
Chen, Cong, Huang, Ziyuan, Zou, Cheng, Zhu, Muzhi, Ji, Kaixiang, Liu, Jiajia, Chen, Jing- dong, Chen, Hao, Shen, Chunhua. Hieratok: Multi-scale visual tokenizer improves image reconstruction and generation. arXiv preprint arXiv:2509.23736, 2025
-
[9]
Flowar: Scale-wise autoregressive image generation meets flow matching
Ren, Sucheng, Yu, Qihang, He, Ju, Shen, Xiaohui, Yuille, Alan, Chen, Liang-Chieh. Flowar: Scale-wise autoregressive image generation meets flow matching. arXiv preprint arXiv:2412.15205, 2024
-
[10]
Reconstruction vs
Yao, Jingfeng, Yang, Bin, Wang, Xinggang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. Proceedings of the Computer Vision and Pattern Recognition Conference:15703–15712, 2025
2025
-
[11]
Wu, Ge, Zhang, Shen, Shi, Ruijing, Gao, Shanghua, Chen, Zhenyuan, Wang, Lei, Chen, Zhaowei, Gao, Hongcheng, Tang, Yao, Yang, Jian, others. Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think. arXiv preprint arXiv:2507.01467, 2025
-
[12]
Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, Sain- ing Xie. SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers. 2024.https://arxiv.org/abs/2401.08740
-
[13]
Taming transformers for high-resolution image synthesis
Esser, Patrick, Rombach, Robin, Ommer, Bjorn. Taming transformers for high-resolution image synthesis. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition:12873–12883, 2021
2021
-
[14]
Neural discrete representation learning
Van Den Oord, Aaron, Vinyals, Oriol, others. Neural discrete representation learning. Advances in neural information processing systems 30, 2017
2017
-
[15]
An image is worth 32 tokens for reconstruction and generation
Yu, Qihang, Weber, Mark, Deng, Xueqing, Shen, Xiaohui, Cremers, Daniel, Chen, Liang-Chieh. An image is worth 32 tokens for reconstruction and generation. Advances in Neural Information Processing Systems 37:128940–128966, 2024
2024
-
[16]
Softvq-vae: Efficient 1-dimensional continuous tokenizer
Chen, Hao, Wang, Ze, Li, Xiang, Sun, Ximeng, Chen, Fangyi, Liu, Jiang, Wang, Jindong, Raj, Bhiksha, Liu, Zicheng, Barsoum, Emad. Softvq-vae: Efficient 1-dimensional continuous tokenizer. Proceedings of the Computer Vision and Pattern Recognition Conference:28358– 28370, 2025
2025
-
[17]
Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation
Xiong, Tianwei, Liew, Jun Hao, Huang, Zilong, Feng, Jiashi, Liu, Xihui. Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation. arXiv preprint arXiv:2504.08736, 2025
-
[18]
Spectral image tokenizer
Esteves, Carlos, Suhail, Mohammed, Makadia, Ameesh. Spectral image tokenizer. Proceedings of the IEEE/CVF International Conference on Computer Vision:17181–17190, 2025. 10
2025
-
[19]
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length
Bachmann, Roman, Allardice, Jesse, Mizrahi, David, Fini, Enrico, Kar, O˘guzhan Fatih, Amirloo, Elmira, El-Nouby, Alaaeldin, Zamir, Amir, Dehghan, Afshin. FlexTok: Resampling Images into 1D Token Sequences of Flexible Length. Forty-second International Conference on Machine Learning, 2025
2025
-
[20]
Democratizing text-to-image masked generative models with compact text-aware one-dimensional tokens
Kim, Dongwon, He, Ju, Yu, Qihang, Yang, Chenglin, Shen, Xiaohui, Kwak, Suha, Chen, Liang-Chieh. Democratizing text-to-image masked generative models with compact text-aware one-dimensional tokens. arXiv preprint arXiv:2501.07730, 2025
-
[21]
Flowtok: Flowing seamlessly across text and image tokens
He, Ju, Yu, Qihang, Liu, Qihao, Chen, Liang-Chieh. Flowtok: Flowing seamlessly across text and image tokens. arXiv preprint arXiv:2503.10772, 2025
-
[22]
Vector-quantized Image Modeling with Improved VQGAN
Yu, Jiahui, Li, Xin, Koh, Jing Yu, Zhang, Han, Pang, Ruoming, Qin, James, Ku, Alexander, Xu, Yuanzhong, Baldridge, Jason, Wu, Yonghui. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[23]
Auto-Encoding Variational Bayes
Kingma, Diederik P, Welling, Max. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[24]
High- resolution image synthesis with latent diffusion models
Rombach, Robin, Blattmann, Andreas, Lorenz, Dominik, Esser, Patrick, Ommer, Björn. High- resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition:10684–10695, 2022
2022
-
[25]
Maskgit: Masked generative image transformer
Chang, Huiwen, Zhang, Han, Jiang, Lu, Liu, Ce, Freeman, William T. Maskgit: Masked generative image transformer. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition:11315–11325, 2022
2022
-
[26]
One-d-piece: Image tokenizer meets quality-controllable compression
Miwa, Keita, Sasaki, Kento, Arai, Hidehisa, Takahashi, Tsubasa, Yamaguchi, Yu. One-d-piece: Image tokenizer meets quality-controllable compression. arXiv preprint arXiv:2501.10064, 2025
-
[27]
Highly Compressed Tokenizer Can Generate Without Training
Beyer, L Lao, Li, Tianhong, Chen, Xinlei, Karaman, Sertac, He, Kaiming. Highly Compressed Tokenizer Can Generate Without Training. arXiv preprint arXiv:2506.08257, 2025
-
[28]
Maskbit: Embedding-free image generation via bit tokens
Weber, Mark, Yu, Lijun, Yu, Qihang, Deng, Xueqing, Shen, Xiaohui, Cremers, Daniel, Chen, Liang-Chieh. Maskbit: Embedding-free image generation via bit tokens. arXiv preprint arXiv:2409.16211, 2024
-
[29]
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Yu, Lijun, Lezama, José, Gundavarapu, Nitesh B, Versari, Luca, Sohn, Kihyuk, Minnen, David, Cheng, Yong, Birodkar, Vighnesh, Gupta, Agrim, Gu, Xiuye, others. Language Model Beats Diffusion–Tokenizer is Key to Visual Generation. arXiv preprint arXiv:2310.05737, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Imagefolder: Autoregressive image generation with folded tokens
Li, Xiang, Qiu, Kai, Chen, Hao, Kuen, Jason, Gu, Jiuxiang, Raj, Bhiksha, Lin, Zhe. Imagefolder: Autoregressive image generation with folded tokens. arXiv preprint arXiv:2410.01756, 2024
-
[31]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Li, Junnan, Li, Dongxu, Savarese, Silvio, Hoi, Steven. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. International conference on machine learning:19730–19742, 2023
2023
-
[32]
Self-bootstrapped visual-language model for knowledge selection and question answering
Hao, Dongze, Wang, Qunbo, Guo, Longteng, Jiang, Jie, Liu, Jing. Self-bootstrapped visual-language model for knowledge selection and question answering. arXiv preprint arXiv:2404.13947, 2024
-
[33]
Fang, Yunhao, Zhu, Ligeng, Lu, Yao, Wang, Yan, Molchanov, Pavlo, Kautz, Jan, Cho, Jang Hyun, Pavone, Marco, Han, Song, Yin, Hongxu. VILA 2: VILA Augmented VILA. arXiv preprint arXiv:2407.17453, 2024
-
[34]
VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding
Waheed, Abdul, Wu, Zhen, Alharthi, Dareen, Kim, Seungone, Raj, Bhiksha. VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding. arXiv preprint arXiv:2509.21451, 2025
-
[35]
Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation
Xia, Jiaer, Tong, Bingkui, Zang, Yuhang, Shao, Rui, Zhou, Kaiyang. Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation. arXiv preprint arXiv:2507.02859, 2025. 11
-
[36]
Lowis3d: Language-driven open-world instance-level 3d scene understanding
Ding, Runyu, Yang, Jihan, Xue, Chuhui, Zhang, Wenqing, Bai, Song, Qi, Xiaojuan. Lowis3d: Language-driven open-world instance-level 3d scene understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 46:8517–8533, 2024
2024
-
[37]
Flamingo: a visual language model for few-shot learning
Alayrac, Jean-Baptiste, Donahue, Jeff, Luc, Pauline, Miech, Antoine, Barr, Iain, Hasson, Yana, Lenc, Karel, Mensch, Arthur, Millican, Katherine, Reynolds, Malcolm, others. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35:23716–23736, 2022
2022
-
[38]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Awadalla, Anas, Gao, Irena, Gardner, Josh, Hessel, Jack, Hanafy, Yusuf, Zhu, Wanrong, Marathe, Kalyani, Bitton, Yonatan, Gadre, Samir, Sagawa, Shiori, others. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Instructblip: Towards general-purpose vision-language models with instruction tuning
Dai, Wenliang, Li, Junnan, Li, Dongxu, Tiong, Anthony, Zhao, Junqi, Wang, Weisheng, Li, Boyang, Fung, Pascale N, Hoi, Steven. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems 36:49250– 49267, 2023
2023
-
[40]
VideoChat: Chat-Centric Video Understanding
Li, KunChang, He, Yinan, Wang, Yi, Li, Yizhuo, Wang, Wenhai, Luo, Ping, Wang, Yali, Wang, Limin, Qiao, Yu. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Lin, Bin, Ye, Yang, Zhu, Bin, Cui, Jiaxi, Ning, Munan, Jin, Peng, Yuan, Li. Video- llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Maaz, Muhammad, Rasheed, Hanoona, Khan, Salman, Khan, Fahad Shahbaz. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Visual instruction tuning
Liu, Haotian, Li, Chunyuan, Wu, Qingyang, Lee, Yong Jae. Visual instruction tuning. Advances in neural information processing systems 36:34892–34916, 2023
2023
-
[44]
Achiam, Josh, Adler, Steven, Agarwal, Sandhini, Ahmad, Lama, Akkaya, Ilge, Aleman, Flo- rencia Leoni, Almeida, Diogo, Altenschmidt, Janko, Altman, Sam, Anadkat, Shyamal, others. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Gemini: A Family of Highly Capable Multimodal Models
Team, Gemini, Anil, Rohan, Borgeaud, Sebastian, Alayrac, Jean-Baptiste, Yu, Jiahui, Soricut, Radu, Schalkwyk, Johan, Dai, Andrew M, Hauth, Anja, Millican, Katie, others. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Internvideo2: Scaling foundation models for multimodal video understanding
Wang, Yi, Li, Kunchang, Li, Xinhao, Yu, Jiashuo, He, Yinan, Chen, Guo, Pei, Baoqi, Zheng, Rongkun, Wang, Zun, Shi, Yansong, others. Internvideo2: Scaling foundation models for multimodal video understanding. European Conference on Computer Vision:396–416, 2024
2024
-
[47]
Cumo: Scaling multimodal llm with co-upcycled mixture-of-experts
Li, Jiachen, Wang, Xinyao, Zhu, Sijie, Kuo, Chia-Wen, Xu, Lu, Chen, Fan, Jain, Jitesh, Shi, Humphrey, Wen, Longyin. Cumo: Scaling multimodal llm with co-upcycled mixture-of-experts. Advances in Neural Information Processing Systems 37:131224–131246, 2024
2024
-
[48]
Svit: Scaling up visual instruction tuning
Zhao, Bo, Wu, Boya, He, Muyang, Huang, Tiejun. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023
-
[49]
Longllava: Scaling multi-modal llms to 1000 images efficiently via a hybrid architecture
Wang, Xidong, Song, Dingjie, Chen, Shunian, Chen, Junyin, Cai, Zhenyang, Zhang, Chen, Sun, Lichao, Wang, Benyou. Longllava: Scaling multi-modal llms to 1000 images efficiently via a hybrid architecture. arXiv preprint arXiv:2409.02889, 2024
-
[50]
An empir- ical study of scaling instruct-tuned large multimodal models
Lu, Yadong, Li, Chunyuan, Liu, Haotian, Yang, Jianwei, Gao, Jianfeng, Shen, Yelong. An empir- ical study of scaling instruct-tuned large multimodal models. arXiv preprint arXiv:2309.09958, 2023
-
[51]
Auroracap: Efficient, performant video detailed captioning and a new benchmark
Chai, Wenhao, Song, Enxin, Du, Yilun, Meng, Chenlin, Madhavan, Vashisht, Bar-Tal, Omer, Hwang, Jenq-Neng, Xie, Saining, Manning, Christopher D. Auroracap: Efficient, performant video detailed captioning and a new benchmark. arXiv preprint arXiv:2410.03051, 2024. 12
-
[52]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Li, Feng, Zhang, Renrui, Zhang, Hao, Zhang, Yuanhan, Li, Bo, Li, Wei, Ma, Zejun, Li, Chunyuan. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Learnings from Scaling Visual Tokenizers for Reconstruction and Generation
Philippe Hansen-Estruch, David Yan, Ching-Yao Chung, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, Xinlei Chen. Learnings from Scaling Visual Tokenizers for Reconstruction and Generation. 2025.https://arxiv.org/abs/2501.09755
-
[54]
DINOv2: Learning Robust Visual Features without Supervision
Oquab, Maxime, Darcet, Timothée, Moutakanni, Théo, V o, Huy, Szafraniec, Marc, Khalidov, Vasil, Fernandez, Pierre, Haziza, Daniel, Massa, Francisco, El-Nouby, Alaaeldin, others. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
Tschannen, Michael, Gritsenko, Alexey, Wang, Xiao, Naeem, Muhammad Ferjad, Alabdul- mohsin, Ibrahim, Parthasarathy, Nikhil, Evans, Talfan, Beyer, Lucas, Xia, Ye, Mustafa, Basil, others. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Self-supervised learning from images with a joint- embedding predictive architecture
Assran, Mahmoud, Duval, Quentin, Misra, Ishan, Bojanowski, Piotr, Vincent, Pascal, Rabbat, Michael, LeCun, Yann, Ballas, Nicolas. Self-supervised learning from images with a joint- embedding predictive architecture. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition:15619–15629, 2023
2023
-
[57]
A style-based generator architecture for generative adversarial networks
Karras, Tero, Laine, Samuli, Aila, Timo. A style-based generator architecture for generative adversarial networks. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition:4401–4410, 2019
2019
-
[58]
Ana- lyzing and improving the image quality of stylegan
Karras, Tero, Laine, Samuli, Aittala, Miika, Hellsten, Janne, Lehtinen, Jaakko, Aila, Timo. Ana- lyzing and improving the image quality of stylegan. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition:8110–8119, 2020
2020
-
[59]
Regularizing gener- ative adversarial networks under limited data
Tseng, Hung-Yu, Jiang, Lu, Liu, Ce, Yang, Ming-Hsuan, Yang, Weilong. Regularizing gener- ative adversarial networks under limited data. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition:7921–7931, 2021
2021
-
[60]
Consistency regularization for generative adversarial networks
Zhang, Han, Zhang, Zizhao, Odena, Augustus, Lee, Honglak. Consistency regularization for generative adversarial networks. arXiv preprint arXiv:1910.12027, 2019
-
[61]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Heusel, Martin, Ramsauer, Hubert, Unterthiner, Thomas, Nessler, Bernhard, Hochreiter, Sepp. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30, 2017
2017
-
[62]
Imagenet: A large- scale hierarchical image database
Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, Fei-Fei, Li. Imagenet: A large- scale hierarchical image database. 2009 IEEE conference on computer vision and pattern recognition:248–255, 2009
2009
-
[63]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Sun, Peize, Jiang, Yi, Chen, Shoufa, Zhang, Shilong, Peng, Bingyue, Luo, Ping, Yuan, Zehuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
Autoregressive image generation using residual quantization
Lee, Doyup, Kim, Chiheon, Kim, Saehoon, Cho, Minsu, Han, Wook-Shin. Autoregressive image generation using residual quantization. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition:11523–11532, 2022
2022
-
[65]
Mage: Masked generative encoder to unify representation learning and image synthesis
Li, Tianhong, Chang, Huiwen, Mishra, Shlok, Zhang, Han, Katabi, Dina, Krishnan, Dilip. Mage: Masked generative encoder to unify representation learning and image synthesis. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition:2142–2152, 2023
2023
-
[66]
All are worth words: A vit backbone for diffusion models
Bao, Fan, Nie, Shen, Xue, Kaiwen, Cao, Yue, Li, Chongxuan, Su, Hang, Zhu, Jun. All are worth words: A vit backbone for diffusion models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition:22669–22679, 2023. 13
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.