Pith Number

pith:WCZCW6CG

pith:2026:WCZCW6CGUOLLC5I6HYOF7TDI5X

not attested not anchored not stored refs resolved

Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

Byeongho Heo, Dongyoon Han, Geonmo Gu, Jaegul Choo, Junha Song, Sangdoo Yun

Multimodal LLMs can match or exceed full dense attention by dynamically restricting focus to a small number of task-relevant gaze regions and using up to 90 percent fewer visual key-value entries.

arxiv:2605.13080 v1 · 2026-05-13 · cs.CV

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{WCZCW6CGUOLLC5I6HYOF7TDI5X}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Gaze Attention matches or surpasses dense-attention baselines, while using up to 90% fewer visual KV entries in the attention computation.

C2weakest assumption

That spatially grouping visual embeddings into compact gaze regions, dynamically selecting them via lightweight descriptors, and appending learnable context tokens is sufficient to preserve all task-critical information without performance loss.

C3one line summary

Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.

References

194 extracted · 194 resolved · 40 Pith anchors

[1] Visual Instruction Tuning , author=. NeurIPS , year=

[2] Improved baselines with visual instruction tuning , author=. CVPR , year=

[3] LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models · arXiv:2407.07895

[4] Llava-onevision: Easy visual task transfer , author=. TMLR , year=

[5] Learning transferable visual models from natural language supervision , author=. ICML , year=

Formal links

2 machine-checked theorem links

Receipt and verification

First computed	2026-05-18T03:08:58.705781Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

b0b22b7846a396b1751e3e1c5fcc68ede9e8841e21e0449badf7ef3201478b77

Aliases

arxiv: 2605.13080 · arxiv_version: 2605.13080v1 · doi: 10.48550/arxiv.2605.13080 · pith_short_12: WCZCW6CGUOLL · pith_short_16: WCZCW6CGUOLLC5I6 · pith_short_8: WCZCW6CG

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/WCZCW6CGUOLLC5I6HYOF7TDI5X \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: b0b22b7846a396b1751e3e1c5fcc68ede9e8841e21e0449badf7ef3201478b77

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "d5061722eb07690452f1db69d5249046c32d2804a577bf13591dc33afc8bd85e",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2026-05-13T06:54:09Z",
    "title_canon_sha256": "b87ba34be4779e971d0208fedc4745687a3fcfe54e350e1e099b5b6c8fe747f8"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.13080",
    "kind": "arxiv",
    "version": 1
  }
}