Optical Context Compression Is Just (Bad) Autoencoding

Ivan Yee Lee , Cheng Yang , Taylor Berg-Kirkpatrick

Authors on Pith no claims yet

classification 💻 cs.CV cs.CLcs.LG

keywords visioncompressionencodercontextbaselinedeepseek-ocrdirectevery

read the original abstract

DeepSeek-OCR shows that rendered text can be reconstructed from a small number of vision tokens, sparking excitement about using vision as a compression medium for long textual contexts. But this pipeline requires rendering token embeddings to pixels and compressing from there -- discarding learned representations in favor of an image the vision encoder must then recover from. We ask whether this detour helps. Comparing DeepSeek-OCR's vision encoder against near-zero-parameter mean pooling and a learned hierarchical encoder, we find it does not. For reconstruction, simple direct methods match or surpass vision at every compression ratio. For language modeling, vision performs comparably to truncation -- a baseline that simply discards context -- and loses to the hierarchical encoder at every compression ratio. As expected, all compression methods outperform truncation for factual recall, but vision never surpasses the best direct baseline. The excitement around optical context compression outpaces the evidence. Code and checkpoints are available at https://github.com/ivnle/bad-autoencoding.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Visual Text Compression as Measure Transport
cs.CV 2026-05 unverdicted novelty 7.0

Framing visual text compression as measure transport decomposes encoding loss into precision and coverage costs, enabling a label-free routing rule that matches oracle performance on 17 of 24 NLP datasets while using ...
Memory as Metabolism: A Design for Companion Knowledge Systems
cs.AI 2026-04 unverdicted novelty 4.0

This paper designs a companion knowledge system with TRIAGE, DECAY, CONTEXTUALIZE, CONSOLIDATE, and AUDIT operations plus memory gravity and minority-hypothesis retention to give contradictory evidence a path to updat...