pith. sign in
Pith Number

pith:XLMUT3UJ

pith:2026:XLMUT3UJRA4NW74L6ILZWM47Z2
not attested not anchored not stored refs pending

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Belinda Zeng, Jonas Schult, Luke Zettlemoyer, Mengzhao Chen, Ping Luo, Sen He, Shoufa Chen, Tao Xiang, Tianhong Li, Weiming Ren, Wenhu Chen, Xiaoke Huang, Yatai Ji, Yuren Cong, Zhiheng Liu

Tuna-2 shows that simple pixel patch embeddings can replace pretrained vision encoders for unified multimodal understanding and generation.

arxiv:2604.24763 v2 · 2026-04-27 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{XLMUT3UJRA4NW74L6ILZWM47Z2}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Tuna-2 achieves state-of-the-art performance in multimodal benchmarks, demonstrating that unified pixel-space modelling can fully compete with latent-space approaches for high-quality image generation. Moreover, while the encoder-based variant converges faster in early pretraining, Tuna-2's encoder-free design achieves stronger multimodal understanding at scale, particularly on tasks requiring fine-grained visual perception.

C2weakest assumption

That simple patch embedding layers applied directly to pixels can extract sufficient visual features for both high-quality generation and fine-grained understanding without the inductive biases or pretraining provided by dedicated vision encoders.

C3one line summary

Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.

Cited by

5 papers in Pith

Receipt and verification
First computed 2026-05-20T00:04:32.947791Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

bad949ee898838db7f8bf2179b339fceb6b0809c8883ffb59838abe81b908e48

Aliases

arxiv: 2604.24763 · arxiv_version: 2604.24763v2 · doi: 10.48550/arxiv.2604.24763 · pith_short_12: XLMUT3UJRA4N · pith_short_16: XLMUT3UJRA4NW74L · pith_short_8: XLMUT3UJ
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/XLMUT3UJRA4NW74L6ILZWM47Z2 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: bad949ee898838db7f8bf2179b339fceb6b0809c8883ffb59838abe81b908e48
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "6da0401ac5d63b909a123cde4a092dd4683b560b4692dc78f5060d4c533dfc94",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2026-04-27T17:59:56Z",
    "title_canon_sha256": "fcd3d766fa645fb882bbaf1a54ecfdd2df1a45fc353c92e79193608fca6e7dc7"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2604.24763",
    "kind": "arxiv",
    "version": 2
  }
}