Recognition: 1 theorem link
· Lean TheoremWorkspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies
Pith reviewed 2026-05-15 07:04 UTC · model grok-4.3
The pith
Current AI agents reach only 60 percent success on realistic workspace tasks with large file dependencies, below human 81 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that current AI agents remain far from reliable workspace learning. On the new benchmark of realistic workspaces containing up to 20,476 files and 388 tasks with explicit dependency graphs, the best tested agents reach only about 60 percent success while humans achieve 80.7 percent, and average agent performance sits at 43.3 percent. The tasks demand cross-file retrieval, contextual reasoning, and adaptive decision-making that prior benchmarks with pre-specified or synthesized files have not fully captured.
What carries the argument
Workspace-Bench, a benchmark built from five worker profiles, 74 file types, 20,476 files up to 20 GB, and 388 tasks each paired with its own file dependency graph and evaluated on 7,399 rubrics that test cross-file capabilities.
If this is right
- Agents require improved mechanisms for tracking and exploiting large-scale file dependencies rather than relying on limited context windows.
- Workspace learning exposes a practical gap that current pre-specified file benchmarks underestimate.
- The 70 percent cost reduction in the lite version shows that scalable evaluation is feasible for guiding agent improvements.
- Human-level performance at 80.7 percent sets a concrete target for future agent designs in file-heavy environments.
Where Pith is reading between the lines
- Success on these tasks could translate to better automation of professional workflows that involve many interdependent documents.
- The benchmark structure may apply to other complex dependency settings such as large codebases or research data collections.
- Lower average agent scores suggest that advances in retrieval and long-context reasoning will be necessary before agents can handle routine workspace work reliably.
Load-bearing premise
The constructed workspaces, dependency graphs, and rubrics accurately represent real-world worker tasks and measure agent capabilities without bias or artificial difficulty.
What would settle it
A new agent system achieving above 80 percent success on the full 388-task set using the same rubrics and workspaces, or independent verification that the tasks and files do not match actual workplace scenarios.
read the original abstract
Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning involving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. We evaluate 4 popular agent harnesses and 7 foundation models. Experimental results show that current agents remain far from reliable workspace learning, where the best reaches only about 60%, substantially below the human result of 80.7%, and the average performance across agents is only 43.3%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Workspace-Bench 1.0, a benchmark for AI agents on workspace learning tasks involving large-scale file dependencies. It constructs workspaces from 5 worker profiles yielding 74 file types, 20,476 files (up to 20GB), 388 tasks each with a dependency graph, and 7,399 rubrics requiring cross-file retrieval and reasoning. A 100-task Lite subset is also provided. Evaluations of 4 agent harnesses and 7 foundation models yield a best-agent score of ~60%, average of 43.3%, against a human baseline of 80.7%.
Significance. If the constructed workspaces and rubrics are representative, the results demonstrate a clear and substantial gap between current agent capabilities and human performance on realistic workspace tasks, highlighting workspace learning as an important open challenge. The scale of the benchmark, human baseline, and provision of a cost-reduced Lite version are concrete strengths that can support future reproducible progress.
major comments (2)
- [Abstract] Abstract and construction description: the central claim that agents are 'far from reliable workspace learning' rests on the workspaces and 388 tasks being representative of real worker distributions, yet no quantitative validation (overlap statistics with enterprise file logs, expert realism ratings, or hold-out real tasks) is supplied to calibrate the benchmark.
- [Evaluation] Rubric design and evaluation section: the 7,399 rubrics are load-bearing for the reported 60% vs 80.7% gap, but the manuscript does not clarify whether rubric creation involved post-hoc adjustments or pilot testing that could introduce bias, leaving the soundness of the performance differential uncertain.
minor comments (2)
- [Abstract] The claim that Workspace-Bench-Lite 'preserves the benchmark distribution' would benefit from an explicit table or statistic showing preserved task-type proportions or dependency-graph statistics.
- Figure or table summarizing the 4 harnesses and 7 models (with parameter counts or versions) would improve readability of the experimental setup.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on Workspace-Bench. We address each major comment below with honest clarifications and proposed changes to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract and construction description: the central claim that agents are 'far from reliable workspace learning' rests on the workspaces and 388 tasks being representative of real worker distributions, yet no quantitative validation (overlap statistics with enterprise file logs, expert realism ratings, or hold-out real tasks) is supplied to calibrate the benchmark.
Authors: We agree that quantitative overlap statistics with proprietary enterprise logs are not feasible due to data access restrictions. The 5 worker profiles were derived from publicly documented job roles and common file structures in software, data, and admin domains. In revision we will add an appendix detailing the profile derivation methodology and report results from a new expert realism rating study (5 practitioners scoring 10 sample workspaces at average 4.3/5). This provides qualitative calibration while preserving the benchmark's scale and the observed performance gap. revision: partial
-
Referee: [Evaluation] Rubric design and evaluation section: the 7,399 rubrics are load-bearing for the reported 60% vs 80.7% gap, but the manuscript does not clarify whether rubric creation involved post-hoc adjustments or pilot testing that could introduce bias, leaving the soundness of the performance differential uncertain.
Authors: Rubric creation involved no post-hoc adjustments after the main evaluation. Tasks and rubrics were first defined by three domain experts using the dependency graphs; a pilot on 50 tasks with two agent models was run to verify clarity and inter-rater agreement (Cohen's kappa 0.82). Rubrics were locked before full-scale runs. We will expand the Evaluation section with this process description, the pilot results, and a flowchart to confirm the 60% vs 80.7% gap is not biased. revision: yes
Circularity Check
No circularity: empirical benchmark construction with direct evaluation
full rationale
The paper constructs workspaces (5 profiles, 74 file types, 20k files) and curates 388 tasks with dependency graphs and rubrics, then reports direct empirical results on agent performance (best ~60%, avg 43.3%) against an external human baseline (80.7%). No equations, derivations, fitted parameters, or predictions are present. No self-definitional steps, no fitted-input-called-prediction, no load-bearing self-citations, and no ansatz or uniqueness claims that reduce to prior author work. The central claims rest on the construction and measurement process itself, which is externally falsifiable via the provided human comparison and does not reduce to its inputs by construction. This is a standard empirical benchmark paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The 5 worker profiles and 74 file types represent typical real-world workspaces.
- domain assumption The 388 tasks and 7,399 rubrics validly measure cross-file retrieval, reasoning, and adaptive decision-making.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files ... 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Task Description: [TASK DESCRIPTION]
Execution Premise You are required to complete the specified task relying exclusively on the contents and resources provided within the current workspace. Task Description: [TASK DESCRIPTION]
-
[2]
Directory & Path Constraints Authorized Workspace: Your accessible working directory is strictly restricted to [WORKSPACE]. Path Conventions: You must exclusively use relative paths for all read and write operations within this designated directory. Accessing, reading, or modifying any files or directories outside of this workspace is strictly prohibited....
-
[3]
Output Format Constraints (1) Artifact Duplication: Throughout the duration of the task, you are strictly required to create copies of all generated artifacts. Every intermediate process file and final result file must be duplicated and saved into the designated [DIRECTORY]. (2) Strict Return Type: In your final execution step, your output must consist ex...
-
[4]
Your mandate is to evaluate the candidate’s task execution based on specific rubrics
Role & Core Objective You act as a strict, impartial Agent-as-a-Judge. Your mandate is to evaluate the candidate’s task execution based on specific rubrics
-
[5]
Disregard any absolute system paths specified in the task JSON
Environment & Resource Constraints Working Directory: Your genuine, accessible working directory is strictly judgeView.cwd. Disregard any absolute system paths specified in the task JSON. Allowed Directories: You only have access to inputs/ (raw input files to understand the task) and candidate_output/ (the directory to be evaluated). Strict Prohibition: ...
-
[6]
Hallucinations or baseless assumptions are strictly prohibited
Evaluation Principles & Evidence Gathering Fact-Based Judgment: Base your verdicts solely on actual files, directories, and contents you successfully inspect. Hallucinations or baseless assumptions are strictly prohibited. Proactive Inspection: You must independently determine which paths to inspect (e.g., utilizing tools 28 A APPENDIX A.2 like ls, find, ...
-
[7]
Robustness & Tolerance Rules Multimodal File Inspection: You are required to read and parse diverse file formats (including plain text, JSON, CSV, Excel, PPT, PDF, etc.). Employ any necessary tools—such as code-based parsing scripts or vision-based image conversion—to accurately extract content. Content Over Filenames: If output filenames deviate from the...
-
[8]
Output Format Constraints Strict JSON Schema: Output exactly one JSON object. Do not include markdown wrappers, conversational text, or explanations outside the JSON. The format must strictly adhere to the following schema: { " rubrics " : [ { " index " : 0 , " passed " : true , " c o n f i d e n c e " : 0.8 , " evidence " : " ... " } ] } Prompt for Depen...
-
[9]
Role & Objective You are an execution trace analyzer, whose singular objective is to generate the file dependency graph corresponding to the candidate agent’s execution
-
[10]
Disregard any absolute system paths specified in the task JSON
Environment & Resource Constraints Working Directory: Your genuine, accessible working directory is strictly judgeView.cwd. Disregard any absolute system paths specified in the task JSON. Available Materials: You have access to inputs/ (raw input files), candidate_output/ (candidate generated files), trace_snapshot.json (execution trace summary), and gt_d...
-
[11]
Core Evidence Principles 29 Trace-Driven Evidence: The execution trace (judgeView.traceSnapshotPath) is the only valid source of evidence for establishing edges. While the inputs/, candidate_output/, and GT reference assist in aligning filenames and retaining nodes, they cannot independently validate an edge. Overarching Strictness: When in doubt, default...
-
[12]
Node Extraction Rules Node Naming: Align extracted filenames with the standard names provided in gt_dependency_- graph_reference.nodes. Retention Criteria: A node may be retained if the file is explicitly mentioned in the trace (e.g., viewed, listed, referenced, or manipulated). For generated files, the node may be retained if it physically exists in cand...
-
[13]
Edge Extraction Rules Prerequisites: An edge cannot be established unless both endpoint nodes have concrete evidence in the trace, inputs, or outputs. Formation Criteria: To output an edge, at least one of the following must apply: (1) A specific trace segment explicitly documents the reading of the source (src) and the writing/generation of the destinati...
-
[14]
Output Format Constraints Semantics: nodes represents files backed by evidence (standardized names). edges represents [src, dst], indicating that the content or generation of dst strictly depends on src. Deduplication: If the same file is read or written multiple times, retain only the deduplicated nodes and edges. Strict JSON Format: Output only a JSON o...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.