Pith Number
pith:JD2ZQ3EI
pith:2024:JD2ZQ3EIWO2MYOVWEQJMBIJN7O
not attested
not anchored
not stored
refs resolved
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
VisualWebArena shows that multimodal agents still struggle with visually grounded web tasks.
arxiv:2401.13649 v2 · 2024-01-24 · cs.LG · cs.CL · cs.CV
Record completeness
1
Bitcoin timestamp
2
Internet Archive
3
Author claim
· sign in to
claim
4
Citations
5
Replications
✓
Portable graph bundle live · download bundle · merged
state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same
current state with the deterministic merge algorithm.
Claims
C1strongest claim
Through extensive quantitative and qualitative analysis, we identify several limitations of text-only LLM agents, and reveal gaps in the capabilities of state-of-the-art multimodal language agents.
C2weakest assumption
That the chosen websites and task templates are sufficiently representative of the visual and interaction challenges encountered in real-world web use.
C3one line summary
VisualWebArena benchmark demonstrates that state-of-the-art multimodal agents still exhibit significant limitations on visually grounded web tasks.
References
[1] Scaling Instruction-Finetuned Language Models
[2] Gemini: A Family of Highly Capable Multimodal Models
[3] Language models can solve computer tasks. NeurIPS. Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi
[4] Improved Baselines with Visual Instruction Tuning
[5] GAIA: a benchmark for General AI Assistants
Formal links
Cited by
Receipt and verification
| First computed | 2026-05-17T23:38:13.706760Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
48f5986c88b3b4cc3ab62412c0a12dfb879cad22d6d6ea688bd1aba900c7a54c
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/JD2ZQ3EIWO2MYOVWEQJMBIJN7O \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 48f5986c88b3b4cc3ab62412c0a12dfb879cad22d6d6ea688bd1aba900c7a54c
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "09a974b7f3b516863a9fc0ccfb802d41251178a15031c78910b671d935ac6d7f",
"cross_cats_sorted": [
"cs.CL",
"cs.CV"
],
"license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
"primary_cat": "cs.LG",
"submitted_at": "2024-01-24T18:35:21Z",
"title_canon_sha256": "e72169dc7b8a326afcf8786f234787d837b6ecd811d6f82b47c1099b40105905"
},
"schema_version": "1.0",
"source": {
"id": "2401.13649",
"kind": "arxiv",
"version": 2
}
}