pith. machine review for the scientific record. sign in

arxiv: 2505.19662 · v3 · submitted 2025-05-26 · 💻 cs.AI · cs.CV

Recognition: unknown

FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks

Authors on Pith no claims yet
classification 💻 cs.AI cs.CV
keywords agenticevaluationreal-worldfieldworkarenaperformancetasksworkbenchmark
0
0 comments X
read the original abstract

This paper introduces FieldWorkArena, a benchmark for agentic AI targeting real-world field work. With the recent increase in demand for agentic AI, they are built to detect and document safety hazards, procedural violations, and other critical incidents across real-world manufacturing and retail environments. Whereas most agentic AI benchmarks focus on performance in simulated or digital environments, our work addresses the fundamental challenge of evaluating agents in the real-world. In this paper, we improve the evaluation function from previous methods to assess the performance of agentic AI in diverse real-world tasks. Our dataset comprises on-site captured images/videos in factories, warehouses and retails. Tasks were meticulously developed through interviews with site workers and managers. Evaluation results confirmed that performance evaluation considering the characteristics of Multimodal LLM (MLLM) such as GPT-4o is feasible. Furthermore, this study identifies both the effectiveness and limitations of the proposed new evaluation methodology. The complete dataset and evaluation program are publicly accessible on the website (https://en-documents.research.global.fujitsu.com/fieldworkarena/)

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

    cs.AI 2026-05 conditional novelty 7.0

    BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.