ColorBrowserAgent: Complex Long-Horizon Browser Agent with Adaptive Knowledge Evolution

Huarong Deng, Jiamu Zhou, Jihong Wang, Jun Wang, Teng Wang, Weiming Zhang, Weinan Zhang, Weiwen Liu, Xingyu Lou, Zhuosheng Zhang

Authors on Pith no claims yet

classification 💻 cs.HC

keywords agentcolorbrowseragentknowledgelong-horizonautomationbrowserchallengescomplex

0 comments

read the original abstract

With the advancement of vision-language models, web automation has made significant progress. However, deploying autonomous agents in real-world settings remains challenging, primarily due to site heterogeneity, where generalist models lack domain-specific priors for diverse interfaces, and long-horizon instability, characterized by the accumulation of decision drift over extended interactions. To address these challenges, we introduce ColorBrowserAgent (Complex Long-Horizon Browser Agent), a knowledge-evolving agent for robust web automation. Our approach addresses these challenges through two synergistic mechanisms: human-in-the-loop knowledge adaptation that transforms sparse human feedback into reusable domain knowledge, and knowledge-aligned progressive summarization that stabilizes long interactions through memory compression. Extensive experiments on WebArena, WebChoreArena and industrial deployment show that ColorBrowserAgent consistently outperforms strong baselines. It achieves a state-of-the-art success rate of 71.2% on WebArena and maintains 47.4% performance under zero-shot transfer setting on WebChoreArena. In commercial deployment, it improves user satisfaction by 19.3% relatively, verifying its robustness in real-world scenarios.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark
cs.AI 2026-04 unverdicted novelty 7.0

WebForge is an automated multi-agent framework that creates realistic and reproducible browser agent benchmarks at scale, demonstrated via a 934-task benchmark that reveals distinct model capability profiles through m...