pith. machine review for the scientific record.
sign in
Pith Number

pith:JD2ZQ3EI

pith:2024:JD2ZQ3EIWO2MYOVWEQJMBIJN7O
not attested not anchored not stored refs resolved

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Daniel Fried, Graham Neubig, Jing Yu Koh, Lawrence Jang, Ming Chong Lim, Po-Yu Huang, Robert Lo, Ruslan Salakhutdinov, Shuyan Zhou, Vikram Duvvur

VisualWebArena shows that multimodal agents still struggle with visually grounded web tasks.

arxiv:2401.13649 v2 · 2024-01-24 · cs.LG · cs.CL · cs.CV

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Through extensive quantitative and qualitative analysis, we identify several limitations of text-only LLM agents, and reveal gaps in the capabilities of state-of-the-art multimodal language agents.

C2weakest assumption

That the chosen websites and task templates are sufficiently representative of the visual and interaction challenges encountered in real-world web use.

C3one line summary

VisualWebArena benchmark demonstrates that state-of-the-art multimodal agents still exhibit significant limitations on visually grounded web tasks.

References

26 extracted · 26 resolved · 5 Pith anchors

[1] Scaling Instruction-Finetuned Language Models · arXiv:2210.11416
[2] Gemini: A Family of Highly Capable Multimodal Models 1996 · arXiv:2312.11805
[3] Language models can solve computer tasks. NeurIPS. Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi
[4] Improved Baselines with Visual Instruction Tuning 2014 · arXiv:2310.03744
[5] GAIA: a benchmark for General AI Assistants 2023 · arXiv:2311.12983

Formal links

2 machine-checked theorem links

Cited by

20 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:13.706760Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

48f5986c88b3b4cc3ab62412c0a12dfb879cad22d6d6ea688bd1aba900c7a54c

Aliases

arxiv: 2401.13649 · arxiv_version: 2401.13649v2 · doi: 10.48550/arxiv.2401.13649 · pith_short_12: JD2ZQ3EIWO2M · pith_short_16: JD2ZQ3EIWO2MYOVW · pith_short_8: JD2ZQ3EI
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/JD2ZQ3EIWO2MYOVWEQJMBIJN7O \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 48f5986c88b3b4cc3ab62412c0a12dfb879cad22d6d6ea688bd1aba900c7a54c
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "09a974b7f3b516863a9fc0ccfb802d41251178a15031c78910b671d935ac6d7f",
    "cross_cats_sorted": [
      "cs.CL",
      "cs.CV"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2024-01-24T18:35:21Z",
    "title_canon_sha256": "e72169dc7b8a326afcf8786f234787d837b6ecd811d6f82b47c1099b40105905"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2401.13649",
    "kind": "arxiv",
    "version": 2
  }
}