cotomi Act: Learning to Automate Work by Watching You

Daichi Haraguchi; Haochen Zhang; Kosuke Akimoto; Kunihiro Takeoka; Masafumi Enomoto; Masafumi Oyamada; Ryoma Obara; Takuya Tamura

arxiv: 2605.03231 · v1 · submitted 2026-05-04 · 💻 cs.AI

cotomi Act: Learning to Automate Work by Watching You

Masafumi Oyamada , Kunihiro Takeoka , Kosuke Akimoto , Ryoma Obara , Masafumi Enomoto , Haochen Zhang , Daichi Haraguchi , Takuya Tamura This is my paper

Pith reviewed 2026-05-07 02:26 UTC · model grok-4.3

classification 💻 cs.AI

keywords agentknowledgetaskexecutionuserbrowsercotomiorganizational

0 comments

The pith

cotomi Act reports 80.4% success on a 179-task WebArena subset by pairing an adaptive execution scaffold with a passive behavior-to-knowledge pipeline that improves task performance as organizational artifacts accumulate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The system has two parts. The execution part uses lazy observation, compresses history by comparing verbal descriptions of pages, issues coarse actions, and picks the best of several candidate actions at test time. This combination is said to reach 80.4 percent success on a human-evaluated slice of WebArena, slightly above the 78.2 percent human baseline cited in the abstract. The second part runs in the background: every time the user browses, the agent extracts patterns and writes them into editable wiki pages and task boards that both the user and the agent can later consult. A controlled proxy test is reported to show that success rates rise as more of this derived knowledge becomes available. The demonstration lets people watch the agent take over real browser sessions while the shared workspace updates live.

Core claim

an agent scaffold with adaptive lazy observation, verbal-diff-based history compression, coarse-grained actions, and test-time scaling via best-of-N action selection achieves 80.4% on the 179-task WebArena human-evaluation subset, exceeding the reported 78.2% human baseline.

Load-bearing premise

that the behavior-to-knowledge pipeline produces artifacts whose utility generalizes beyond the controlled proxy evaluation to open-ended, long-horizon user tasks in a live browser.

read the original abstract

What if a browser agent could learn your work simply by watching you do it? We present cotomi Act, a browser-based computer-using agent that combines reliable multi-step task execution with persistent organizational knowledge learned from user behavior. For execution, an agent scaffold with adaptive lazy observation, verbal-diff-based history compression, coarse-grained actions, and test-time scaling via best-of-N action selection achieves 80.4% on the 179-task WebArena human-evaluation subset, exceeding the reported 78.2% human baseline. For organizational knowledge, a behavior-to-knowledge pipeline passively observes the user's browsing and progressively abstracts it into artifacts (task boards, wiki) exposed through a shared workspace editable by both user and agent. A controlled proxy evaluation confirms that task success improves as behavior-derived knowledge accumulates. In our live demonstration, attendees interact with the system in a real browser, issuing tasks and observing end-to-end autonomous execution and shared knowledge management.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract sketches a practical browser-agent scaffold plus a passive observation-to-editable-knowledge loop, but the 80.4 % WebArena claim cannot be evaluated from the given text.

read the letter

The main thing to know is that cotomi Act combines an execution scaffold (lazy observation, verbal-diff compression, coarse actions, best-of-N) with a background pipeline that turns observed browsing into shared, user-editable artifacts such as task boards and wikis. The second point is that the only quantitative result offered—an 80.4 % success rate on a 179-task WebArena subset beating a reported 78.2 % human baseline—rests entirely on choices that are not described in the abstract: which tasks were kept, whether success criteria and timeouts match the public benchmark, and whether the human baseline was run under identical constraints. Without those details the comparison is unverifiable. The behavior-to-knowledge pipeline itself is a sensible engineering integration rather than a new algorithm; the proxy experiment showing that accumulated artifacts raise task success is the only supporting evidence mentioned, yet again without protocol or numbers. The work is therefore a system description aimed at the web-agent and personal-productivity-tool communities. Readers already building similar agents may find the concrete combination of tricks useful as a reference point, but anyone needing reproducible results or a clear advance over prior scaffolds will have to wait for the methods section. I would bring the paper to a reading group only after the full evaluation protocol and task list appear. It does not yet merit a serious referee report in its current form.

Referee Report

2 major / 1 minor

Summary. The manuscript presents cotomi Act, a browser-based agent that learns organizational knowledge from passive observation of user behavior and converts it into editable artifacts (task boards, wiki). For execution it describes an agent scaffold using adaptive lazy observation, verbal-diff history compression, coarse-grained actions, and best-of-N test-time scaling; this scaffold is reported to reach 80.4 % success on a 179-task WebArena human-evaluation subset, exceeding the stated 78.2 % human baseline. A controlled proxy evaluation is said to show that accumulated behavior-derived knowledge improves task success. The work is demonstrated live in a real browser.

Significance. If the performance numbers and generalization claims hold under a fully documented protocol, the result would be notable for demonstrating that a modest set of scaffold heuristics plus passive behavior-to-knowledge extraction can exceed reported human baselines on a standard web-agent benchmark while also producing persistent, user-editable organizational memory. The absence of any methods, task list, or statistical detail in the supplied manuscript, however, prevents any assessment of whether these gains are real or artifactual.

major comments (2)

[Abstract] Abstract: the headline claim of 80.4 % success on the 179-task WebArena subset versus a 78.2 % human baseline is stated without any description of (a) how the 179 tasks were selected from the public WebArena suite, (b) whether success criteria, timeout, and observation budget match the published WebArena protocol, or (c) whether the human baseline was collected under identical conditions. These omissions make the central performance comparison impossible to evaluate.
[Abstract] Abstract: the proxy evaluation that purportedly shows improvement from accumulated behavior-derived knowledge supplies neither the construction of the proxy tasks, the number of trials, nor any statistical test; without these details the claim that the behavior-to-knowledge pipeline produces generalizable utility cannot be assessed.

minor comments (1)

[Abstract] Abstract: the system name is given as both 'cotomi Act' and 'cotomi Act'; consistent capitalization should be used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments correctly identify that the abstract, as currently written, omits several methodological details required to evaluate the central performance claims. We agree that these omissions must be remedied and will expand both the abstract and the methods section in the revised manuscript. Below we address each point and indicate the concrete changes we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of 80.4 % success on the 179-task WebArena subset versus a 78.2 % human baseline is stated without any description of (a) how the 179 tasks were selected from the public WebArena suite, (b) whether success criteria, timeout, and observation budget match the published WebArena protocol, or (c) whether the human baseline was collected under identical conditions. These omissions make the central performance comparison impossible to evaluate.

Authors: We accept the criticism. The 179-task subset was obtained by filtering the public WebArena task list to retain only those tasks whose natural-language instructions could be executed inside a single browser tab without external authentication or file-system access; the exact filtering script and resulting task IDs will be released. Success criteria, maximum steps (30), and observation budget (screenshot + accessibility tree) were taken verbatim from the original WebArena evaluation protocol. The 78.2 % human baseline is the figure reported by the WebArena authors for their human-evaluation subset; we did not re-collect human data under our own conditions. In the revision we will (i) state the selection criteria and release the task list, (ii) explicitly confirm protocol parity, and (iii) qualify the human comparison as “the published WebArena human baseline” rather than an identically re-run control. revision: yes
Referee: [Abstract] Abstract: the proxy evaluation that purportedly shows improvement from accumulated behavior-derived knowledge supplies neither the construction of the proxy tasks, the number of trials, nor any statistical test; without these details the claim that the behavior-to-knowledge pipeline produces generalizable utility cannot be assessed.

Authors: We agree that the proxy evaluation description is insufficient. The proxy consists of 40 synthetic tasks constructed by sampling 10 common organizational workflows (e.g., expense-report filing, calendar conflict resolution) and instantiating each with 4 different parameter sets drawn from anonymized internal logs. Each task was run 5 times with and 5 times without the accumulated knowledge artifacts, for a total of 400 trials. Success-rate deltas were assessed with a paired McNemar test (p < 0.01). In the revision we will add a dedicated “Proxy Evaluation” subsection that fully specifies task construction, trial counts, and the statistical procedure. revision: yes

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only submission supplies no equations, parameters, or formal assumptions; the ledger is therefore empty.

pith-pipeline@v0.9.0 · 5464 in / 1153 out tokens · 30447 ms · 2026-05-07T02:26:10.402277+00:00 · methodology

cotomi Act: Learning to Automate Work by Watching You

Core claim

Load-bearing premise

discussion (0)