arxiv: 2605.14028 · v1 · pith:RNLXT3JAnew · submitted 2026-05-13 · 💻 cs.CV

Unified Pix Token And Word Token Generative Language Model

Haun Leung , Zinan Wang This is my paper

Pith reviewed 2026-05-15 05:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords pixel tokenword tokengenerative language modelvisual detail recognitioncolor foldingattention approximationunsupervised pretrainingmultimodal model

0 comments

The pith

A new generative language model assigns each image pixel its own token to unify visual and textual inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes unifying pixel tokens and word tokens inside a single generative language model. Instead of relying on patch-based vision encoders such as those from CLIP or SigLIP, the model gives every pixel an individual token embedding. Additional mechanisms include color folding, a global conditional attention approximation, and unsupervised pretraining on images alone. Experiments indicate the model performs adequately even when kept small and trained on limited data. A reader would care because the approach offers a direct route to finer visual detail handling, such as reading small text or numbers, without first passing images through a separate large encoder.

Core claim

The authors claim that a generative language model can treat every pixel as an independent token with its own embedding, combine it with word tokens, apply color folding and global conditional attention approximation, and still reach usable performance after unsupervised image pretraining on a small model with limited data; they further assert that performance will continue to rise with increased scale following the scaling law.

What carries the argument

The unified pix-token and word-token architecture in which each pixel receives its own embedding and is processed together with textual tokens inside the same generative language model.

If this is right

The model can recognize small text and numbers in images more readily than current patch-based multimodal systems.
Usable visual understanding is achievable with a small parameter count and limited training data.
Performance is expected to improve steadily as model size and data volume increase according to the scaling law.
The same architecture can serve as the backbone for multimodal generative tasks that require precise visual detail.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pixel-level tokenization could allow the model to generate or edit images at finer spatial resolutions than patch-based approaches.
The method may reduce dependence on separate large vision encoders in future multimodal systems.
Direct application to OCR-heavy or diagram-understanding benchmarks would provide a clear test of the detail advantage.

Load-bearing premise

That assigning a dedicated token embedding to every pixel, together with color folding and the attention approximation, produces better visual detail understanding than patch-based encoders.

What would settle it

A side-by-side evaluation on images containing small text or numbers that measures whether the new model recognizes those details more accurately than a standard ViT-based encoder of comparable size.

Figures

Figures reproduced from arXiv: 2605.14028 by Haun Leung, Zinan Wang.

**Figure 1.** Figure 1: Pix Token and Pix Token Embedding. in that it does not require complex inregular window attention, full attention, or token embedding merging. It only requires simple local window mask multi-head self attention operation. Swin Transformer and QWen-2.5-VL, like ViT, divide images into token embeddings based on patches, which is a type of fake token embeddings without adjustable parameters during training. B… view at source ↗

**Figure 2.** Figure 2: Small changes to the RGB channels value of color. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of Color Folding. 3.3 Unified Pix Token And Word Token Model We propose unified pix token and word token generative language model: It unifys the pix token into the generative language model where only word token is used. The core idea of our model are: one is unified pix token and word token, and the other is using local conditional self attention to approximate global conditional self atten… view at source ↗

**Figure 4.** Figure 4: Unified Pix Token And Word Token Model overview. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Unified Pix Token And Word Token Model Process. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Only Image Unsupervised Pretraining Curvs. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Since the emergence of Vision Transformer (ViT), it has been widely used in generative language model and generative visual model. Especially in the current state-of-art open source multimodal models, ViT obtained by CLIP or SigLIP method serves as the vision encoder backbone to help them acquire visual understanding capabilities. But this method leads to limitations in visual understanding for details, such as difficulty in recognizing small text or numbers in images. To address these issues, we propose a new model to unify pix token and word token into the generative language model. The new model also features with each pix of image having its own token embedding, color folding, global conditional attention approximation and image unsupervised pretraining. We conducted image unsupervised pretraining experiments using our new model to explore its potential. The experimental results show that it has good performance even in small model and with limited training data. We believe our model also conforms to the scaling law, as long as model parameters and training data increased, its performance will continue to improve.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The per-pixel token idea targets a real limitation in patch-based vision encoders for fine details, but the abstract supplies no metrics or comparisons to show it actually works better.

read the letter

The one thing to know is that this paper proposes giving every pixel its own token embedding in a generative multimodal model, along with color folding and a global conditional attention approximation, to improve handling of small text and numbers that current ViT and CLIP encoders miss. It unifies these pix tokens with word tokens and reports unsupervised image pretraining results on a small model with limited data, claiming good performance and expecting gains from scaling.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a unified generative language model integrating pixel tokens and word tokens. It introduces per-pixel token embeddings for images, color folding, global conditional attention approximation, and unsupervised image pretraining. The authors claim that experiments demonstrate good performance even with small models and limited training data, and assert that the approach will continue to improve in line with scaling laws as model size and data increase.

Significance. If the per-pixel tokenization and associated mechanisms demonstrably improve fine-grained visual understanding over patch-based encoders such as ViT or CLIP, the work could address a recognized limitation in current multimodal generative models. The focus on unsupervised pretraining with small-scale resources also aligns with interest in efficient training paradigms. However, the absence of any quantitative support leaves the potential significance unevaluated.

major comments (2)

[Experiments] Experiments section: The manuscript asserts that unsupervised pretraining 'show[s] that it has good performance even in small model and with limited training data,' yet reports no metrics, baselines, error bars, ablation studies, or task-specific results (e.g., OCR accuracy on text-in-image). This directly undermines evaluation of the central claim.
[Model Description] Model architecture: No equations, complexity analysis, or implementation details are supplied for the per-pixel token embedding scheme, color folding, or global conditional attention approximation. Without these, it is impossible to assess whether the approach overcomes the stated ViT/CLIP limitations or is computationally viable.

minor comments (1)

[Abstract] The abstract and main text repeatedly use the phrase 'good performance' without defining the evaluation protocol or comparison points.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and valuable comments on our manuscript. We address the major concerns point by point below and will revise the paper to incorporate the suggested improvements.

read point-by-point responses

Referee: [Experiments] Experiments section: The manuscript asserts that unsupervised pretraining 'show[s] that it has good performance even in small model and with limited training data,' yet reports no metrics, baselines, error bars, ablation studies, or task-specific results (e.g., OCR accuracy on text-in-image). This directly undermines evaluation of the central claim.

Authors: We agree with the referee that the experiments section requires more rigorous quantitative support. The current manuscript presents preliminary results from unsupervised pretraining to illustrate the model's potential with small-scale resources, but lacks specific metrics and comparisons. In the revised version, we will include quantitative metrics (such as reconstruction loss or downstream task accuracies like OCR on text-in-image), baselines (e.g., comparisons to ViT-based models), error bars from multiple runs, and ablation studies on the proposed components. This will allow proper evaluation of the central claims. revision: yes
Referee: [Model Description] Model architecture: No equations, complexity analysis, or implementation details are supplied for the per-pixel token embedding scheme, color folding, or global conditional attention approximation. Without these, it is impossible to assess whether the approach overcomes the stated ViT/CLIP limitations or is computationally viable.

Authors: We acknowledge the need for more detailed technical descriptions. The manuscript introduces these concepts at a high level, but we will expand the model architecture section to include mathematical formulations (equations) for per-pixel token embeddings, the color folding technique, and the global conditional attention approximation. We will also provide a complexity analysis (e.g., time and space complexity) and implementation details such as tokenization process and attention mechanisms. This will help demonstrate how the approach addresses fine-grained visual understanding limitations of patch-based methods like ViT and CLIP, and assess its computational viability. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture proposal and scaling-law belief are stated without self-referential derivations or fitted inputs renamed as predictions

full rationale

The paper proposes a new generative model unifying per-pixel token embeddings with word tokens, adding color folding and global conditional attention approximation, then reports unsupervised pretraining results on a small model with limited data. No equations, derivation chain, or first-principles results are presented that reduce to the inputs by construction. The scaling-law statement is explicitly labeled a belief rather than derived from model equations. No self-citations, uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text. The central claim of improved detail understanding is an empirical assertion resting on the architectural choice and pretraining, not on any circular reduction of predictions to fitted parameters or self-referential definitions. This is a standard non-circular proposal of a new design.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The proposal rests on the untested assumption that per-pixel tokenization plus the listed mechanisms will overcome the detail limitations of patch-based ViT encoders; no free parameters, axioms, or new entities are explicitly quantified in the abstract.

axioms (1)

domain assumption Patch-based vision encoders inherently limit fine-detail recognition in images
Stated as the core limitation the new model is designed to address

invented entities (1)

pix token no independent evidence
purpose: Individual token embedding for each pixel to capture fine details
New representational unit introduced to replace patch tokens

pith-pipeline@v0.9.0 · 5466 in / 1271 out tokens · 38773 ms · 2026-05-15T05:45:42.466348+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

Neural Information Processing Systems , year=

Attention is All you Need , author=. Neural Information Processing Systems , year=

work page
[2]

2020 , journal=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. 2020 , journal=

work page 2020
[3]

2021 , eprint=

Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=

work page 2021
[4]

2023 , eprint=

Sigmoid Loss for Language Image Pre-Training , author=. 2023 , eprint=

work page 2023
[5]

2023 , eprint=

Visual Instruction Tuning , author=. 2023 , eprint=

work page 2023
[6]

2026 , eprint=

Kimi K2.5: Visual Agentic Intelligence , author=. 2026 , eprint=

work page 2026
[7]

2025 , eprint=

Qwen2.5-VL Technical Report , author=. 2025 , eprint=

work page 2025
[8]

37th International Conference on Machine Learning: ICML 2020, Online, 13-18 July 2020, Part 3 of 15 , year=

Generative Pretraining from Pixels , author=. 37th International Conference on Machine Learning: ICML 2020, Online, 13-18 July 2020, Part 3 of 15 , year=

work page 2020
[9]

2021 , eprint=

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , author=. 2021 , eprint=

work page 2021
[10]

2021 , eprint=

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , author=. 2021 , eprint=

work page 2021
[11]

2022 , eprint=

TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation , author=. 2022 , eprint=

work page 2022
[12]

2025 , eprint=

SeaFormer++: Squeeze-enhanced Axial Transformer for Mobile Visual Recognition , author=. 2025 , eprint=

work page 2025
[13]

Mit Press , year=

In Advances in Neural Information Processing Systems , author=. Mit Press , year=

work page
[14]

2016 , eprint=

Video (language) modeling: a baseline for generative models of natural videos , author=. 2016 , eprint=

work page 2016
[15]

Language Models are Few-Shot Learners , author=

work page