pith. machine review for the scientific record. sign in

arxiv: 2605.13625 · v1 · submitted 2026-05-13 · 💻 cs.AI

Recognition: unknown

How to Interpret Agent Behavior

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:19 UTC · model grok-4.3

classification 💻 cs.AI
keywords agent behaviortaxonomyautonomous agentstrajectory analysisgrounded theoryfailure modesagent oversightbehavioral profiles
0
0 comments X

The pith

ACTONOMY taxonomy structures agent behavior into 10 actions and 120 categories for consistent interpretation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops ACTONOMY, a hierarchical taxonomy for analyzing the runtime behavior of autonomous agents from their trajectories. Using Grounded Theory, it defines 10 actions, 46 subactions, and 120 leaf categories to describe what agents do. An accompanying open repository includes an automated pipeline to apply the taxonomy to traces and a protocol for extensions. Experiments demonstrate that this approach allows comparison of behavioral profiles between different agents and characterization of one agent's performance across varied trajectories, revealing indicators of failure modes. The goal is to give researchers, designers, and users a shared vocabulary to better understand and oversee long-running agents.

Core claim

ACTONOMY is a taxonomy developed through Grounded Theory with a three-level hierarchy consisting of 10 actions, 46 subactions, and 120 leaf categories. Paired with an open repository and automated analysis pipeline, it enables systematic description of agent trajectories. The taxonomy supports comparing behavioral profiles across agents and characterizing a single agent's behavior over diverse trajectories to surface patterns linked to failure modes.

What carries the argument

The ACTONOMY taxonomy, a three-level hierarchy of agent actions developed via Grounded Theory, that provides standardized categories for interpreting unstructured natural-language trajectories.

Load-bearing premise

A taxonomy created from Grounded Theory on a limited set of agent traces will stay comprehensive and unbiased for new agents, tasks, and longer trajectories.

What would settle it

If new agent trajectories require introducing many categories outside the existing 120 and beyond what the extension protocol can handle without major revision, the claim of broad applicability would be falsified.

Figures

Figures reproduced from arXiv: 2605.13625 by Daniel Khashabi, Heyuan Huang, Jen-tse Huang, Jie Gao, Kaiser Sun, Katherine Van Koevering, Mark Dredze, Sijie Ji, Weiyan Shi, Zhuoran Lu, Ziang Xiao.

Figure 1
Figure 1. Figure 1: Why do we need Act·ONOMY? Act·ONOMY can be used to label agent trajectories with human-readable action tags; we use a 13-turn SWE-bench trajectory as a running ex￾ample. Top: A phase overview of the trajectory on pylint-dev/pylint-5859, with color-coded regions marking distinct turns. Middle: Three pivotal turns annotated with Act·ONOMY sub-action tags: Turn 4 (confirm) verifies the bug and pivots to code … view at source ↗
Figure 2
Figure 2. Figure 2: The Act·ONOMY: 10 main actions and 46 subactions. Within each category, sub-actions are ordered by descending frequency. Italicized rows (freq. “—”) marking sub-actions retained by theoretical motivation but not yet observed in the construction corpus. Freq. is the share of paper￾grounded behavior-description sentences (n=120) drawn from the paper construction set. • Automatic analysis tool. We provide an … view at source ↗
Figure 3
Figure 3. Figure 3: An at-a-glance map of Act·ONOMY. To build a shared vocabulary and con￾ceptual framework for describing and an￾alyzing agent behavior, we constructed Act·ONOMY through a grounded theory approach. Because the taxonomy is our primary contribution, we present it first and detail our methodology in §3. 2.1 Taxonomy Overview Act·ONOMY organizes agent behaviors into 10 main actions, 46 subactions, and 120 leaf in… view at source ↗
Figure 4
Figure 4. Figure 4: Action co-occurrence. To examine how Act·ONOMY generalizes be￾yond its construction set, we applied it to 3,455 behavioral descriptions automatically extracted from 211 behavior-related agent papers curated by the awesome-language-agents GitHub list,5 spanning safety, evaluation, software en￾gineering, computer use, and web automation. Two patterns stand out. (1) Frequency is top-heavy at the Action level … view at source ↗
Figure 5
Figure 5. Figure 5: An Overview of our Grounded Theory [7] pipeline to construct Act· [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Three agents show distinct behavioral profiles. (a) Action distribution for each agent; [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Two SWE-agent trajectories produce contrasting behavioral shapes. Stacked bars show per-turn Act·ONOMY categories assigned by Automated-Trace-Analysis-Tool, accompa￾nied by its automatically generated natural-language session summaries. The callout zooms in on the leaf-level, quote-grounded labels that pinpoint specific behaviors driving the agent’s decision. agent acknowledges the need for verification, r… view at source ↗
Figure 8
Figure 8. Figure 8: Large-scale distribution and co-occurrence of Act· [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Interface of Automated-Trace-Analysis-Tool. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
read the original abstract

Autonomous agents such as Claude Code and Codex now operate for hours or even days. Understanding their runtime behavior has become critical for downstream tasks such as diagnosing inefficiencies, fixing bugs, and ensuring better oversight. A primary way to gain this understanding is analyzing the reasoning trajectories and execution traces these agents generate. Yet such data remains in unstructured natural-language form, making it difficult for humans to interpret at scale. We introduce ACT*ONOMY (a combination of Action and Taxonomy), a taxonomy for describing and analyzing agent behavior at runtime. ACT*ONOMY has two components: (1) the taxonomy itself, developed through Grounded Theory and structured as a three-level hierarchy of 10 actions, 46 subactions, and 120 leaf categories; and (2) an open repository that hosts the living taxonomy, provides an automated analysis pipeline that applies it to agent trajectories analysis, and defines an extension protocol for customization and growth. Our experiments show that ACTONOMY can compare behavioral profiles across agents and characterize a single agent's behavior across diverse trajectories, surfacing patterns indicative of failure modes. By providing a shared vocabulary, ACT*ONOMY helps researchers, agent designers, and end users interpret agent behavior more consistently, enabling better oversight and control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Circularity Check

0 steps flagged

No significant circularity; taxonomy is inductively derived

full rationale

The paper's central contribution is a taxonomy (ACTONOMY) constructed bottom-up via Grounded Theory from a set of agent traces, yielding a fixed three-level hierarchy of 10 actions, 46 subactions, and 120 leaf categories. This inductive process does not invoke fitted parameters, self-referential equations, uniqueness theorems, or self-citations as load-bearing premises. The subsequent experiments apply the resulting taxonomy to new trajectories for profiling and failure-mode detection, but the taxonomy itself is not defined in terms of those outputs; the derivation chain remains independent and externally falsifiable against additional traces. No steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The taxonomy rests on the empirical claim that agent behavior can be usefully partitioned into the derived categories; no numerical free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption Agent runtime behavior can be decomposed into a finite, hierarchical set of actions and subactions that are observable in natural-language trajectories.
    Invoked when the authors apply Grounded Theory to build the three-level structure.

pith-pipeline@v0.9.0 · 5545 in / 1214 out tokens · 71438 ms · 2026-05-14T18:19:50.443478+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 40 canonical work pages · 10 internal anchors

  1. [1]

    J. R. Anderson and C. Lebiere. The newell test for a theory of cognition.Behavioral and brain Sciences, 26(5):587–601, 2003

  2. [2]

    Zhan, Q., Fang, R., Panchal, H

    Y . Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, Mar. 2022. doi: 10.1162/coli a 00422. URLhttps: //aclanthology.org/2022.cl-1.7/

  3. [3]

    V . P. Bhardwaj. Agentassay: Token-efficient regression testing for non-deterministic ai agent workflows, 2026. URLhttps://zenodo.org/doi/10.5281/zenodo.18842011

  4. [4]

    S. R. Bowman, J. Hyun, E. Perez, E. Chen, C. Pettit, S. Heiner, K. Luko ˇsi¯ut˙e, A. Askell, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Olah, D. Amodei, D. Amodei, D. Drain, D. Li, E. Tran-Johnson, J. Kernion, J. Kerr, J. Mueller, J. Ladish, J. Landau, K. Ndousse, L. Lovitt, N. Elhage, N. Schiefer, N. Joseph, N. Mercado, N. DasSarma, R. ...

  5. [5]

    Braun and V

    V . Braun and V . Clarke. Using thematic analysis in psychology.Qualitative research in psy- chology, 3(2):77–101, 2006

  6. [6]

    Why Do Multi-Agent LLM Systems Fail?

    M. Cemri, M. Z. Pan, S. Yang, L. A. Agrawal, B. Chopra, R. Tiwari, K. Keutzer, A. Parameswaran, D. Klein, K. Ramchandran, M. Zaharia, J. E. Gonzalez, and I. Stoica. Why do multi-agent llm systems fail?, 2025. URLhttps://arxiv.org/abs/2503.13657

  7. [7]

    Charmaz.Constructing grounded theory: A practical guide through qualitative analysis

    K. Charmaz.Constructing grounded theory: A practical guide through qualitative analysis. sage, 2006

  8. [8]

    K. Charmaz. Grounded theory.Qualitative psychology: A practical guide to research methods, 3(2015):53–84, 2015

  9. [9]

    Chong, H

    P. Chong, H. Abichandani, J. Shen, A. Ghosh, M. P. Moe, Y . Mai, and D. Dahlmeier. Talk, evaluate, diagnose: User-aware agent evaluation with automated error analysis, 2026. URL https://arxiv.org/abs/2603.15483

  10. [10]

    Clarke and V

    V . Clarke and V . Braun. Thematic analysis.The journal of positive psychology, 12(3):297–298, 2017

  11. [11]

    org/posts/baJyjpktzmcmRfosq/ stitching-saes-of-different-sizes

    A. Conmy, A. N. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability, 2023. URLhttps://arxiv.or g/abs/2304.14997

  12. [12]

    Cunningham, A

    H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey. Sparse autoencoders find highly interpretable features in language models, 2023. URLhttps://arxiv.org/abs/23 09.08600

  13. [13]

    Deshpande, V

    D. Deshpande, V . Gangal, H. Mehta, J. Krishnan, A. Kannappan, and R. Qian. Trail: Trace reasoning and agentic issue localization, 2025. URLhttps://arxiv.org/abs/2505.086 38

  14. [14]

    Desmond, J

    M. Desmond, J. Y . Lee, I. Ibrahim, J. M. Johnson, A. Sil, J. MacNair, and R. Puri. Agent trajectory explorer: Visualizing and providing feedback on agent trajectories. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29634–29636, 2025

  15. [15]

    Fischer and C

    T. Fischer and C. Biemann. Exploring large language models for qualitative data analysis. In M. H ¨am¨al¨ainen, E. ¨Ohman, S. Miyagawa, K. Alnajjar, and Y . Bizzoni, editors,Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities, pages 423–437, Miami, USA, Nov. 2024. Association for Computational Linguistics....

  16. [16]

    J. Gao, Y . Guo, G. Lim, T. Zhang, Z. Zhang, T. J.-J. Li, and S. T. Perrault. Collabcoder: a lower-barrier, rigorous workflow for inductive collaborative qualitative analysis with large language models. InProceedings of the 2024 CHI conference on human factors in computing systems, pages 1–29, 2024

  17. [17]

    C. Hu, L. Zhang, Y . Lim, A. Wadhwani, A. Peters, and D. Kang. Repro-bench: Can agentic ai systems assess the reproducibility of social science research?, 2025. URLhttps://arxiv. org/abs/2507.18901

  18. [18]

    X. Hu, Z. Zhao, S. Wei, Z. Chai, Q. Ma, G. Wang, X. Wang, J. Su, J. Xu, M. Zhu, Y . Cheng, J. Yuan, J. Li, K. Kuang, Y . Yang, H. Yang, and F. Wu. Infiagent-dabench: Evaluating agents on data analysis tasks, 2024. URLhttps://arxiv.org/abs/2401.05507

  19. [19]

    F. Jia, Z. Ye, S. Lai, K. Shu, J. Gu, A. Bibi, Z. Hu, D. Jurgens, J. Evans, P. H. Torr, et al. Can large language model agents simulate human trust behavior?Advances in neural information processing systems, 37:15674–15729, 2024

  20. [20]

    MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning

    Z. Jiang, H. Guo, C. Fang, C. Xiao, X. Hu, L. Sun, and M. Xu. Medvr: Annotation-free medical visual reasoning via agentic reinforcement learning, 2026. URLhttps://arxiv.or g/abs/2604.08203

  21. [21]

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learn- ing Representations, volume 2024, pages 54107–54157, 2024

  22. [22]

    Siegel, Nitya Nadgir, and Arvind Narayanan

    S. Kapoor and A. Narayanan. Ai agents that matter.arXiv preprint arXiv:2407.01502, 2024

  23. [23]

    T. Kwa, B. West, J. Becker, A. Deng, K. Garcia, M. Hasin, S. Jawhar, M. Kinniment, N. Rush, S. V . Arx, R. Bloom, T. Broadley, H. Du, B. Goodrich, N. Jurkovic, L. H. Miles, S. Nix, T. Lin, N. Parikh, D. Rein, L. J. K. Sato, H. Wijk, D. M. Ziegler, E. Barnes, and L. Chan. Measuring ai ability to complete long software tasks, 2026. URLhttps://arxiv.org/abs/...

  24. [24]

    K. Li, J. Shi, Y . Xiao, M. Jiang, J. Sun, Y . Wu, D. Fu, S. Xia, X. Cai, T. Xu, et al. Agencybench: Benchmarking the frontiers of autonomous agents in 1m-token real-world contexts.arXiv preprint arXiv:2601.11044, 2026

  25. [25]

    R. Li, J. Xiong, X. He, J. Zhao, J. Lv, H. Fang, L. Qi, and X. Wang. Chathls: Towards systematic design automation and optimization for high-level synthesis, 2026. URLhttps: //arxiv.org/abs/2507.00642

  26. [26]

    X. Li, J. Gao, S. Lin, X. Zhou, C. Zhang, B. Cheng, J. Han, and B. Wang. Human or machine? a preliminary turing test for speech-to-speech interaction, 2026. URLhttps://arxiv.org/ abs/2602.24080

  27. [27]

    J. Liao, Y . Feng, Y . Zheng, J. Zhao, S. Wang, and J. Zheng. My words imply your opinion: Reader agent-based propagation enhancement for personalized implicit emotion analysis. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16156–16172, 2025

  28. [28]

    J. Liu, C. Huang, Z. Guan, W. Lei, and Y . Deng. E2edev: Benchmarking large language models in end-to-end software development task, 2025

  29. [29]

    W. Liu, S. An, J. Lu, M. Wu, T. Li, X. Wang, C. Lv, X. Zheng, D. Yin, X. Sun, and X. Huang. Tell me what you don’t know: Enhancing refusal capabilities of role-playing agents via representation space analysis and editing. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pag...

  30. [30]

    X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, et al. Agent- bench: Evaluating llms as agents. InInternational Conference on Learning Representations, volume 2024, pages 52989–53046, 2024. 11

  31. [31]

    C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URLhttps://arxiv.org/abs/2408.0 6292

  32. [32]

    C. Ma, J. Zhang, Z. Zhu, C. Yang, Y . Yang, Y . Jin, Z. Lan, L. Kong, and J. He. Agent- board: An analytical evaluation board of multi-turn llm agents. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Infor- mation Processing Systems, volume 37, pages 74325–74362. Curran Associates, Inc., 202...

  33. [33]

    W. Ma, Y . Yang, Q. Hu, S. Ying, Z. Jin, B. Du, Z. Xing, T. Li, J. Shi, Y . Liu, and L. Jiang. Re- thinking testing for llm applications: Characteristics, challenges, and a lightweight interaction protocol, 2025. URLhttps://arxiv.org/abs/2508.20737

  34. [34]

    arXiv preprint arXiv:2502.04358 , year=

    E. Meyerson and X. Qiu. Position: Scaling llm agents requires asymptotic analysis with llm primitives, 2025. URLhttps://arxiv.org/abs/2502.04358

  35. [35]

    Newell.Unified theories of cognition

    A. Newell.Unified theories of cognition. Harvard University Press, 1994

  36. [36]

    J. Ou, A. Uzunoglu, B. V . Durme, and D. Khashabi. Worldapis: The world is worth how many apis? a thought experiment, 2025. URLhttps://arxiv.org/abs/2407.07778

  37. [37]

    Parfenova, A

    A. Parfenova, A. Marfurt, J. Pfeffer, and A. Denzler. Text annotation via inductive coding: Comparing human experts to LLMs in qualitative data analysis. In L. Chiruzzo, A. Ritter, and L. Wang, editors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 6471–6484, Albuquerque, New Mexico, Apr. 2025. Association for Computational L...

  38. [38]

    D. Paul, D. Murphy, M. Gritta, R. Cardenas, V . Prokhorov, L. S. Bolliger, A. Toker, R. Miles, A.-M. Oncescu, J. A. Sivakumar, P. Borchert, I. Elezi, M. Zhang, K. Y . Lee, G. Zhang, J. Wang, and G. Lampouras. A benchmark for deep information synthesis, 2026. URLhttps://arxi v.org/abs/2602.21143

  39. [39]

    H. N. Phan, T. N. Nguyen, P. X. Nguyen, and N. D. Q. Bui. Hyperagent: Generalist software engineering agents to solve coding tasks at scale, 2025. URLhttps://arxiv.org/abs/24 09.16299

  40. [40]

    C. Qin, X. Feng, W. Ma, X. Feng, and L. Kong. Implicitmembench: Measuring unconscious behavioral adaptation in large language models, 2026. URLhttps://arxiv.org/abs/26 04.08064

  41. [41]

    Rahwan, M

    I. Rahwan, M. Cebrian, N. Obradovich, J. Bongard, J.-F. Bonnefon, C. Breazeal, J. W. Cran- dall, N. A. Christakis, I. D. Couzin, M. O. Jackson, N. R. Jennings, E. Kamar, I. M. Kloumann, H. Larochelle, D. Lazer, R. McElreath, A. Mislove, D. C. Parkes, A. S. Pentland, M. E. Roberts, A. Shariff, J. B. Tenenbaum, and M. Wellman. Machine behaviour.Nature, 568 ...

  42. [42]

    T. Rehan. Test-driven ai agent definition (tdad): Compiling tool-using agents from behavioral specifications, 2026. URLhttps://arxiv.org/abs/2603.08806

  43. [43]

    M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh. Beyond accuracy: Behavioral testing of nlp models with checklist, 2020. URLhttps://arxiv.org/abs/2005.04118

  44. [44]

    R. D. Santi, F. A. Joseph, N. Liniger, M. Mutti, and A. Krause. Geometric active exploration in markov decision processes: the benefit of abstraction, 2024. URLhttps://arxiv.org/ abs/2407.13364

  45. [45]

    H. Shi, Z. Sun, X. Yuan, M.-A. C ˆot´e, and B. Liu. OPEx: A component-wise analysis of LLM-centric agents in embodied instruction following. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 622–636, Bangkok, Thailand, Aug

  46. [46]

    doi: 10.18653/v1/2024.acl-long.37

    Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.37. URL https://aclanthology.org/2024.acl-long.37/. 12

  47. [47]

    Shinn, F

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36, 2023

  48. [48]

    Y . Song, D. Yin, X. Yue, J. Huang, S. Li, and B. Y . Lin. Trial and error: Exploration-based trajectory optimization of LLM agents. In L.-W. Ku, A. Martins, and V . Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7584–7600, Bangkok, Thailand, Aug. 2024. Association ...

  49. [49]

    Sreedhar and L

    K. Sreedhar and L. Chilton. Simulating human strategic behavior: Comparing single and multi-agent llms. 2024. URLhttps://arxiv.org/abs/2402.08189

  50. [50]

    T. R. Sumers, S. Yao, K. Narasimhan, and T. L. Griffiths. Cognitive architectures for language agents, 2024. URLhttps://arxiv.org/abs/2309.02427

  51. [51]

    Triantafyllou, A

    S. Triantafyllou, A. Sukovic, D. Mandal, and G. Radanovic. Agent-specific effects: A causal effect propagation analysis in multi-agent mdps, 2024. URLhttps://arxiv.org/abs/23 10.11334

  52. [52]

    Vernon, C

    D. Vernon, C. von Hofsten, and L. Fadiga. Desiderata for developmental cognitive architec- tures.Biologically Inspired Cognitive Architectures, 18:116–127, 2016

  53. [53]

    K. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022. URLhttps://arxiv.org/ abs/2211.00593

  54. [54]

    X. Wang, B. Li, Y . Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y . Shao, N. Muennighoff, Y . Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig. Openhands: An open platform for ai software developers as generalist agents, 2025. URLhttps://arxiv.org/abs/2407.16741

  55. [55]

    Y . Wang, R. Xu, K. Zheng, T. Zhang, J. N. Kogundi, S. Hans, and V . Ustun. Gameplayqa: A benchmarking framework for decision-dense pov-synced multi-video understanding of 3d virtual agents, 2026. URLhttps://arxiv.org/abs/2603.24329

  56. [56]

    Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023. URLhttps://arxiv.org/abs/2308.081 55

  57. [57]

    Y . Xiao, J. Liu, Y . Zheng, X. Xie, J. Hao, M. Li, R. Wang, F. Ni, Y . Li, J. Luo, S. Jiao, and J. Peng. Cellagent: An llm-driven multi-agent framework for automated single-cell data analysis, 2024. URLhttps://arxiv.org/abs/2407.09811

  58. [58]

    Y .-A. Xiao, P. Gao, C. Peng, and Y . Xiong. Reducing cost of llm agents with trajectory reduc- tion. 2026. doi: https://doi.org/10.1145/3797084. URLhttps://arxiv.org/abs/2509.2 3586

  59. [59]

    H. Yang, J. Liu, C. Huang, F. Wu, W. Lei, and S.-K. Ng. Metro: Towards strategy induction from expert dialogue transcripts for non-collaborative dialogues, 2026. URLhttps://arxi v.org/abs/2604.11427

  60. [60]

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe- agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

  61. [61]

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2023

  62. [62]

    S. Yi, J. Nguyen, H. Xu, T. Lim, A. Well, M. Markey, and Y . Ding. Auto-ta: Towards scalable automated thematic analysis (ta) via multi-agent large language models with reinforcement learning, 2025. URLhttps://arxiv.org/abs/2506.23998. 13

  63. [63]

    A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y .-X. Wang. Language agent tree search unifies reasoning acting and planning in language models. 2024. URLhttps: //arxiv.org/abs/2310.04406

  64. [64]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 14 Contents A Limitations 16 B Broader Impacts 16 C Corpus and Annotation Details 16 D Discovery Qualitative Analyst 17 E Codebook Evolution 18 F Larg...