Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

· 2026 · cs.LG · arXiv 2604.24964

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

Existing web agent benchmarks have largely converged on short, single-site tasks that frontier models are approaching saturation on. However, real world web use consists of long-horizon, multi-site workflows. Common web navigation tasks, such as comparing products across different domains, planning trips across multiple services, or summarizing information from multiple search queries, require sustained context and cross-site reasoning over potentially hours of browsing. To capture and evaluate such behaviors, we introduce Odysseys: a benchmark of 200 long-horizon web tasks derived from real world browsing sessions evaluated on the live Internet. We find that binary pass/fail evaluation is inadequate for long-horizon settings and introduce a rubric-based evaluation, annotating each Odysseys task with an average of 6.1 graded rubrics. We demonstrate that this yields higher agreement with humans and provides a more fine-grained signal than commonly used trajectory-level LLM-as-a-judge evaluation metrics. We tested several leading frontier models and find that the strongest models achieve a success rate of 44.5%, which leaves substantial room for future improvements. Beyond task success, we argue that efficiency is a first-class concern for long-horizon agents. We introduce a Trajectory Efficiency metric (rubric score per step) and find that even frontier agents achieve only 1.15%, marking an evident need for agents that can succeed efficiently and not simply eventually. Odysseys isolates the critical evaluation of long-horizon proficiency in open-web environments, providing a realistic benchmark to measure progress towards computer-use agents that can potentially productively operate for hours. We release our tasks, evaluation scripts, and other results at https://odysseys-website.pages.dev

representative citing papers

SentinelBench: A Benchmark for Long-Running Monitoring Agents

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

SentinelBench is a new benchmark for time-evolving monitoring tasks in web environments, measuring task completion, reaction time, and resource use with baselines from three models and two harnesses.

Skim: Speculative Execution for Fast and Efficient Web Agents

cs.AI · 2026-05-15 · unverdicted · novelty 7.0

Skim profiles website patterns offline to enable fast-path speculative execution for web agents, cutting median cost by 1.9x and latency by 33.4% with no accuracy loss on benchmarks.

Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns

cs.AI · 2026-06-16 · unverdicted · novelty 6.0

SkillMigrator reduces LLM-action counts by 8-10% on WebArena and Mind2Web by transferring web skills via layout-matched transferable interaction patterns.

citing papers explorer

Showing 3 of 3 citing papers after filters.

SentinelBench: A Benchmark for Long-Running Monitoring Agents cs.AI · 2026-06-03 · unverdicted · none · ref 9 · internal anchor
SentinelBench is a new benchmark for time-evolving monitoring tasks in web environments, measuring task completion, reaction time, and resource use with baselines from three models and two harnesses.
Skim: Speculative Execution for Fast and Efficient Web Agents cs.AI · 2026-05-15 · unverdicted · none · ref 14 · internal anchor
Skim profiles website patterns offline to enable fast-path speculative execution for web agents, cutting median cost by 1.9x and latency by 33.4% with no accuracy loss on benchmarks.
Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns cs.AI · 2026-06-16 · unverdicted · none · ref 30 · internal anchor
SkillMigrator reduces LLM-action counts by 8-10% on WebArena and Mind2Web by transferring web skills via layout-matched transferable interaction patterns.

Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

fields

years

verdicts

representative citing papers

citing papers explorer