pith. sign in
Pith Number

pith:E4ST4TYI

pith:2024:E4ST4TYIFZXUNH4NLCXSENDMNM
not attested not anchored not stored refs resolved

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

Bo Li, Chunyuan Li, Fanyi Pu, Jingkang Yang, Joshua Adrian Cahyono, Kaichen Zhang, Kairui Hu, Peiyuan Zhang, Shuai Liu, Yuanhan Zhang, Ziwei Liu

Evaluating large multimodal models requires balancing wide task coverage, low computational cost, and zero data contamination in benchmarks.

arxiv:2407.12772 v2 · 2024-07-17 · cs.CL · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{E4ST4TYIFZXUNH4NLCXSENDMNM}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Our work highlights the importance of considering the evaluation trilemma and provides practical solutions to navigate the trade-offs in evaluating large multi-modal models.

C2weakest assumption

That the live data sources and pruning rules in LMMS-EVAL LITE and LIVEBENCH truly deliver zero contamination and maintained coverage without introducing new selection biases or missing important capabilities.

C3one line summary

LMMS-EVAL delivers a standardized multimodal evaluation framework with lite and live variants that target the trade-offs among coverage, cost, and zero contamination.

References

24 extracted · 24 resolved · 3 Pith anchors

[1] InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning · arXiv:2305.06500
[2] Internlm-xcomposer2- 4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd
[3] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models · arXiv:2306.13394
[4] Making llama see and draw with seed tokenizer 2023
[5] A diagram is worth a dozen images.ArXiv, abs/1603.07396 2022 · arXiv:1603.07396

Formal links

2 machine-checked theorem links

Cited by

23 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:15.008955Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

27253e4f082e6f469f8d58af22346c6b1fce493d856b203c45919958e98d8fa0

Aliases

arxiv: 2407.12772 · arxiv_version: 2407.12772v2 · doi: 10.48550/arxiv.2407.12772 · pith_short_12: E4ST4TYIFZXU · pith_short_16: E4ST4TYIFZXUNH4N · pith_short_8: E4ST4TYI
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/E4ST4TYIFZXUNH4NLCXSENDMNM \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 27253e4f082e6f469f8d58af22346c6b1fce493d856b203c45919958e98d8fa0
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "4e942c277c09028045710149d7f7bf8b6da6f6a2028782aa5a1e0e123c0fb3bd",
    "cross_cats_sorted": [
      "cs.CV"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2024-07-17T17:51:53Z",
    "title_canon_sha256": "9304da0fd6df0a43a304bc7313fd5757487e366e8650beec5abefe326d06a3a2"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2407.12772",
    "kind": "arxiv",
    "version": 2
  }
}