arxiv: 2604.11575 · v1 · submitted 2026-04-13 · 💻 cs.CL

Recognition: unknown

MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts

Chen Hu , Yintao Tai , Antonio Vergari , Frank Keller , Alessandro Suglia

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:46 UTC · model grok-4.3

classification 💻 cs.CL

keywords pixel-based language modelsmultilingual language modelsautoregressive modelstokenization alternativesmultilingual NLPscript diversityorthographic robustnessLAMBADA benchmark

0 comments

The pith

A generative pixel-based language model trained on eight languages and scripts improves multilingual task performance and handles unseen languages more robustly than prior approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that training an autoregressive model directly on pixels, rather than tokens, can scale effectively to multiple languages with different writing systems. This matters because tokenization often creates barriers when scripts vary widely, limiting how well models generalize across languages. If the approach holds, it points toward language models that require less language-specific engineering and perform better on both understanding and generation tasks. The authors support this by evaluating the resulting model, called MIXAR, against earlier pixel-based and tokenizer-based systems on a range of multilingual benchmarks.

Core claim

MIXAR is the first generative pixel-based language model trained on eight different languages that use a range of scripts; it delivers substantial gains on both discriminative and generative multilingual tasks, remains effective on languages never seen during training, and shows further improvements in generative performance such as LAMBADA together with greater resistance to orthographic attacks once scaled to 0.5 billion parameters.

What carries the argument

The MIXAR autoregressive architecture that ingests text as pixel images, allowing it to process diverse scripts without any tokenization step.

If this is right

Substantial performance gains appear on both discriminative and generative multilingual tasks relative to earlier pixel-based and tokenizer-based models.
The model exhibits robustness on languages absent from its training data.
Scaling to 0.5 billion parameters produces additional gains on generative benchmarks such as LAMBADA.
Robustness to orthographic attacks increases with model scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pixel-level processing may reduce the preprocessing overhead that tokenizers impose when new scripts are added.
The same architecture could be tested on even larger numbers of scripts to determine whether the observed generalization continues.
If pixel representations prove sufficient, downstream applications might avoid maintaining separate tokenizers for each language family.

Load-bearing premise

That training on pixels from eight languages is enough to overcome the perceptual differences between scripts and produce generalization and robustness without tokenization.

What would settle it

A head-to-head test in which a tokenizer-based model trained on the identical eight-language data outperforms MIXAR on the same multilingual discriminative and generative tasks, or where MIXAR shows no advantage on a controlled set of previously unseen languages.

Figures

Figures reproduced from arXiv: 2604.11575 by Alessandro Suglia, Antonio Vergari, Chen Hu, Frank Keller, Yintao Tai.

**Figure 1.** Figure 1: MIXAR: a Transformer-based decoder-only architecture that uses rendered text as input to learn across multiple languages. Encoding language as pixels, enables MIXAR to be robust to visual attacks as well. size, as pixels are able to accommodate different writing systems while being more robust to visual changes (Rust et al., 2022). However, both PIXEL and PIXAR were trained only on English, ignoring the po… view at source ↗

**Figure 2.** Figure 2: A patch of size 8×8 pixels cannot capture fine-grained details for Chinese (top), Korean (middle) and Japanese (bottom) characters, while a 32×32 pixel patch can. To this end, we experiment with two different model sizes: a smaller model with 116M parameters containing a stack of 12 Transformer layers (roughly comparable to PIXAR’s model size), and a larger 477M parameter model featuring 24 Transform… view at source ↗

**Figure 3.** Figure 3: MIXAR can handle multilingual text as image, as shown here for examples of correct completions (black) for German, Spanish and Italian prompts (gray) on LAMBADA [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 5.** Figure 5: Examples of 5 latin languages contains in pretraining dataset. 20% letters in these [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: This image shows the comparison of patch size 8 and 32 for all pretraining scripts. [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: This image shows the correct samples of English, Chinese and Japanese bAbI tasks [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: This image shows the wrong samples of English, Chinese and Japanese bAbI tasks [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: This image shows some correct samples of LAMBADA task of eight pretraining [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: This image shows some wrong samples of LAMBADA task of eight pretraining [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

read the original abstract

Pixel-based language models are gaining momentum as alternatives to traditional token-based approaches, promising to circumvent tokenization challenges. However, the inherent perceptual diversity across languages poses a significant hurdle for multilingual generalization in pixel space. This paper introduces MIXAR, the first generative pixel-based language model trained on eight different languages utilizing a range of different scripts. We empirically evaluate MIXAR against previous pixel-based models as well as comparable tokenizer-based models, demonstrating substantial performance improvement on discriminative and generative multilingual tasks. Additionally, we show how MIXAR is robust to languages never seen during the training. These results are further strengthened when scaling the model to 0.5B parameters which not only improves its capabilities in generative tasks like LAMBADA but also its robustness when challenged with input perturbations such as orthographic attacks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MIXAR claims to be the first generative pixel-based LM on eight languages with varied scripts and shows scaling plus robustness gains, but the abstract gives no data details or numbers to back the multilingual claims.

read the letter

The core thing to know is that this paper trains an autoregressive pixel-based model on eight languages spanning different scripts and reports gains over prior pixel models plus tokenizer baselines on both discriminative and generative tasks. It also tests generalization to unseen languages and orthographic perturbations, with further improvements at 0.5B scale including better LAMBADA scores and attack resistance. That direction is worth tracking if pixel approaches are to move past English-only toy settings.

Referee Report

2 major / 1 minor

Summary. The paper introduces MIXAR, the first generative pixel-based autoregressive language model trained on eight languages spanning multiple scripts. It claims substantial performance gains over prior pixel-based and tokenizer-based models on both discriminative and generative multilingual tasks, robustness to languages and scripts unseen during training, and additional benefits from scaling to 0.5B parameters, including improved LAMBADA scores and greater resistance to orthographic attacks.

Significance. If the empirical claims are substantiated with detailed, reproducible results, this would represent a meaningful advance in multilingual language modeling by showing that pixel-based autoregressive models can address script diversity without tokenization. The scaling behavior and robustness findings, if rigorously demonstrated, would provide concrete evidence for the advantages of pixel representations in handling perceptual variation across languages.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The manuscript asserts 'substantial performance improvement' and 'robustness' on multilingual tasks but supplies no quantitative metrics, baseline models, evaluation details, error bars, or statistical significance tests. Without these, the central empirical claims cannot be assessed for magnitude or reliability.
[§3] §3 (Data and Training): No information is provided on training data composition, including per-language or per-script data volumes, balance across the eight languages, or rendering details such as image resolution and font choices. This information is load-bearing for the robustness claims, as dominance by a subset of scripts (e.g., Latin) could confound apparent generalization to unseen languages rather than demonstrating an inherent advantage of the pixel approach.

minor comments (1)

[Abstract] The abstract references LAMBADA without clarifying whether the standard English version or a multilingual adaptation is used, and does not specify the exact orthographic attack types evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving the clarity and completeness of our empirical claims and data description. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The manuscript asserts 'substantial performance improvement' and 'robustness' on multilingual tasks but supplies no quantitative metrics, baseline models, evaluation details, error bars, or statistical significance tests. Without these, the central empirical claims cannot be assessed for magnitude or reliability.

Authors: We agree that the abstract would be strengthened by including specific quantitative results. In the revised manuscript, we will update the abstract to report key metrics (e.g., accuracy or perplexity improvements on the multilingual tasks), name the main baseline models, and briefly note the evaluation setup. For §4, the experiments section already includes comparisons against prior pixel-based models and tokenizer-based models on both discriminative and generative tasks, along with results demonstrating robustness to unseen languages and benefits from scaling to 0.5B parameters. However, we acknowledge the value of additional rigor: we will add error bars from multiple random seeds, more explicit evaluation details (datasets, prompts, and metrics), and statistical significance tests for the reported gains. These changes will allow readers to better assess the magnitude and reliability of the improvements. revision: yes
Referee: [§3] §3 (Data and Training): No information is provided on training data composition, including per-language or per-script data volumes, balance across the eight languages, or rendering details such as image resolution and font choices. This information is load-bearing for the robustness claims, as dominance by a subset of scripts (e.g., Latin) could confound apparent generalization to unseen languages rather than demonstrating an inherent advantage of the pixel approach.

Authors: We agree that these details are essential for interpreting the robustness results and for ruling out potential confounds from data imbalance. In the revised version, we will substantially expand §3 to include a per-language and per-script breakdown of the training data volumes, the overall balance across the eight languages and scripts, and the rendering specifications (image resolution and font choices used for each script). This added information will clarify the data composition and support that the observed robustness to unseen languages and scripts stems from the pixel-based modeling approach rather than from Latin-script dominance. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical claims without derivations or self-referential reductions

full rationale

The paper presents MIXAR as a trained generative model evaluated on multilingual tasks, with claims of robustness to unseen languages and scaling benefits supported by experimental results. No equations, parameter fittings presented as predictions, uniqueness theorems, or ansatzes appear in the provided text. Central claims rest on benchmark comparisons and training descriptions rather than any step that reduces by construction to its own inputs or prior self-citations. This is a standard empirical ML paper whose results are externally falsifiable via replication on the stated tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; the contribution is presented as an empirical extension of prior pixel-based models.

pith-pipeline@v0.9.0 · 5438 in / 1070 out tokens · 44743 ms · 2026-05-10T15:46:51.856716+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 22 canonical work pages · 10 internal anchors

[1]

Do all languages cost the same? tokenization in the era of commercial language models

Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David R Mortensen, Noah A Smith, and Yulia Tsvetkov. Do all languages cost the same? tokenization in the era of commercial language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 9904–9923,

2023
[2]

Xnli: Evaluating cross-lingual sentence representations

Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. Xnli: Evaluating cross-lingual sentence representations. InProceedings of the 2018 conference on empirical methods in natural language processing, pp. 2475–2485,

2018
[3]

Revis- iting pre-trained models for Chinese natural language processing

Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. Revis- iting pre-trained models for Chinese natural language processing. In Trevor Cohn, Yulan He, and Yang Liu (eds.),Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 657–668, Online, November

2020
[4]

doi: 10.18653/v1/2020.findings-emnlp.58

Association for Computational Lin- guistics. doi: 10.18653/v1/2020.findings-emnlp.58. URL https://aclanthology.org/ 2020.findings-emnlp.58/. Falcon Dai and Zheng Cai. Glyph-aware embedding of chinese characters. InProceedings of the First Workshop on Subword and Character Level Models in NLP, pp. 64–69,

work page doi:10.18653/v1/2020.findings-emnlp.58 2020
[5]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186,

2019
[6]

From variational to deterministic autoencoders.arXiv preprint arXiv:1903.12436,

Partha Ghosh, Mehdi SM Sajjadi, Antonio Vergari, Michael Black, and Bernhard Sch¨olkopf. From variational to deterministic autoencoders.arXiv preprint arXiv:1903.12436,

work page arXiv 1903
[7]

Under review

12 Preprint. Under review. Andreas Grivas, Lorenzo Loconte, Emile van Krieken, Piotr Nawrot, Yu Zhao, Euan Wielewski, Pasquale Minervini, Edoardo Ponti, and Antonio Vergari. Fast and expressive multi-token prediction with probabilistic circuits.arXiv preprint arXiv:2511.11346,

work page arXiv
[8]

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations.arXiv preprint arXiv:1903.12261,

work page internal anchor Pith review arXiv 1903
[9]

Multilingual pretraining for pixel language models

Ilker Kesen, Jonas F Lotz, Ingo Ziegler, Phillip Rust, and Desmond Elliott. Multilingual pretraining for pixel language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 29582–29599,

2025
[10]

textless-lib: A library for textless spoken language processing

Eugene Kharitonov, Jade Copet, Kushal Lakhotia, Tu-Anh Nguyen, Paden Tomasello, Ann Lee, Ali Elkahky, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, et al. textless-lib: A library for textless spoken language processing. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language...

2022
[11]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing

URL https://arxiv. org/abs/1808.06226. Sander Land and Max Bartolo. Fishing for magikarp: Automatically detecting under-trained tokens in large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 11631–11646,

work page internal anchor Pith review arXiv 2024
[13]

Vocabulary attack to hijack large language model applications.arXiv preprint arXiv:2404.02637,

Patrick Levi and Christoph P Neumann. Vocabulary attack to hijack large language model applications.arXiv preprint arXiv:2404.02637,

work page arXiv
[14]

Visually grounded reasoning across languages and cultures

Fangyu Liu, Emanuele Bugliarello, Edoardo Maria Ponti, Siva Reddy, Nigel Collier, and Desmond Elliott. Visually grounded reasoning across languages and cultures. InPro- ceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 10467–10485,

2021
[15]

Learning character-level composi- tionality with visual features.arXiv preprint arXiv:1704.04859,

Frederick Liu, Han Lu, Chieh Lo, and Graham Neubig. Learning character-level composi- tionality with visual features.arXiv preprint arXiv:1704.04859,

work page arXiv
[16]

SGDR: Stochastic Gradient Descent with Warm Restarts

URLhttps://arxiv.org/abs/1608.03983. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization,

work page internal anchor Pith review arXiv
[17]

Decoupled Weight Decay Regularization

URL https://arxiv.org/abs/1711.05101. Jonas Lotz, Elizabeth Salesky, Phillip Rust, and Desmond Elliott. Text rendering strategies for pixel language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 10155–10172,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Overcoming vocabulary constraints with pixel-level fallback.arXiv preprint arXiv:2504.02122,

Jonas F Lotz, Hendra Setiawan, Stephan Peitz, and Yova Kementchedjhieva. Overcoming vocabulary constraints with pixel-level fallback.arXiv preprint arXiv:2504.02122,

work page arXiv
[19]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern´andez. The lambada dataset: Word prediction requiring a broad discourse context.arXiv preprint arXiv:1606.06031,

work page Pith review arXiv
[20]

Language modelling with pixels.arXiv preprint arXiv:2207.06991,

Phillip Rust, Jonas F Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, and Desmond Elliott. Language modelling with pixels.arXiv preprint arXiv:2207.06991,

work page arXiv
[21]

Robust open-vocabulary translation from visual text representations

Elizabeth Salesky, David Etter, and Matt Post. Robust open-vocabulary translation from visual text representations. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7235–7252,

2021
[22]

Multilingual pixel repre- sentations for translation and effective cross-lingual transfer

Elizabeth Salesky, Neha Verma, Philipp Koehn, and Matt Post. Multilingual pixel repre- sentations for translation and effective cross-lingual transfer. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 13845–13861,

2023
[23]

Neural Machine Translation of Rare Words with Subword Units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725, 2016a. Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units...

work page internal anchor Pith review arXiv
[24]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

work page internal anchor Pith review arXiv 2002
[25]

Super characters: A conversion from sentiment classification to image classification.arXiv preprint arXiv:1810.07653,

Baohua Sun, Lin Yang, Patrick Dong, Wenhan Zhang, Jason Dong, and Charles Young. Super characters: A conversion from sentiment classification to image classification.arXiv preprint arXiv:1810.07653,

work page arXiv
[26]

Chinesebert: Chinese pretraining enhanced by glyph and pinyin information.arXiv preprint arXiv:2106.16038,

Zijun Sun, Xiaoya Li, Xiaofei Sun, Yuxian Meng, Xiang Ao, Qing He, Fei Wu, and Jiwei Li. Chinesebert: Chinese pretraining enhanced by glyph and pinyin information.arXiv preprint arXiv:2106.16038,

work page arXiv
[27]

Pixar: Auto-regressive language modeling in pixel space.arXiv preprint arXiv:2401.03321,

Yintao Tai, Xiyang Liao, Alessandro Suglia, and Antonio Vergari. Pixar: Auto-regressive language modeling in pixel space.arXiv preprint arXiv:2401.03321,

work page arXiv
[28]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

14 Preprint. Under review. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461,

work page internal anchor Pith review arXiv
[30]

Towards ai-complete question answering: A set of prerequisite toy tasks

Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart Van Merri¨enboer, Armand Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks.arXiv preprint arXiv:1502.05698,

work page arXiv
[31]

URL https://arxiv.org/abs/1609. 08144. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

(2023) and increase the patch size to 32×32 pixels as an essential step to correctly represent these languages

For this reason, we go beyond what was studied Lotz et al. (2023) and increase the patch size to 32×32 pixels as an essential step to correctly represent these languages. While this facilitates encoding more complex scripts, it also increases the complexity of training due to the increased image resolution. Moreover, modeling a higher dimensional distribu...

2023
[33]

stage1 MNLI QQP QNLI SST-2 COLA STSB MRPC RTE WNLI 85MPIXARstage1lr 3e-5 3e-5 3e-5 3e-5 3e-5 3e-5 6e-5 3e-5 3e-5 116MMIXARstage1lr 3e-5 3e-5 3e-5 3e-5 3e-5 3e-5 6e-5 3e-5 6e-5 477MMIXARstage1lr 3e-5 3e-5 3e-5 3e-5 6e-5 3e-5 6e-5 3e-5 6e-5 Weight decay 0.1 0.1 0.1 0.01 0.01 0.01 0.01 0.01 0.01 Optimizer AdamW Warmup Linear warmup Warmup steps 1000 1000 500...

2000