TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories

Ge Zhang; Junjie Nian; Kang Chen; Yixin Cao; Yugang Jiang

arxiv: 2605.31308 · v1 · pith:JBEWQHBInew · submitted 2026-05-29 · 💻 cs.AI

TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories

Junjie Nian , Kang Chen , Ge Zhang , Yixin Cao , Yugang Jiang This is my paper

Pith reviewed 2026-06-28 22:20 UTC · model grok-4.3

classification 💻 cs.AI

keywords agent trajectoriesdecision landscapestrap regionsSWE-benchrecovery policiesgraph frameworkprocess evaluationagent benchmarks

0 comments

The pith

TraceGraph builds shared graphs over agent trajectories to identify trap regions and apply recovery policies that raise resolved rates on SWE-bench.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TraceGraph to convert pooled multi-model agent rollouts into graphs of observable action-observation states. These graphs overlay productive cores and trap regions and tag each trajectory with Access, Trap exposure, and Repair events. The resulting landscapes expose navigation differences across models that aggregate scores hide and distinguish benchmarks by whether they reward trap avoidance or recovery. The same structure supplies a runtime detector for historical trap states that triggers lightweight continuation policies, raising official resolved rates from 40.4 percent to 43.5 percent on per-provider fired subsets and from 41.0 percent to 44.8 percent on common-fired instances.

Core claim

TraceGraph builds a graph over observable action-observation states from pooled rollouts before model identity is introduced. It then overlays outcome-informed productive cores and trap regions and summarizes each rollout with three events: Access, Trap exposure, and Repair. Across five benchmark splits the graphs profile navigation differences hidden by aggregate scores and show that splits differ in whether they reward avoiding traps or recovering from them. The same landscape motivates a trap-aware recovery pipeline in which a runtime detector fires on states matching historical trap regions and lightweight continuation policies are evaluated from the same prefix, raising resolved rates o

What carries the argument

TraceGraph, the graph constructed over pooled action-observation states with overlaid productive cores and trap regions that summarizes trajectories by Access, Trap exposure, and Repair events.

If this is right

Agent trajectories can be compared across models on a single shared landscape rather than by aggregate pass rates.
Benchmarks can be distinguished by whether success depends more on avoiding traps or on recovering after entering them.
Historical trap regions can be used at runtime to select among continuation policies without retraining the original agent.
Provider-specific active components can be activated once a trap state is detected.
Process-level events supply a vocabulary for diagnosing where models diverge on the same task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The graph construction could be applied to other multi-model agent evaluations to create comparable decision maps.
The method may help isolate whether performance gains come from better trap avoidance or from stronger recovery once inside a trap.
Extending the detector to operate on partial trajectories could allow earlier intervention before full failure.

Load-bearing premise

Trap regions identified from pooled historical rollouts remain stable and detectable at runtime on new trajectories without the detector being tuned to the same data.

What would settle it

Applying the runtime trap detector and selected continuation policies to a fresh collection of trajectories and measuring no increase in resolved rate or frequent firing on states outside the historical trap regions.

read the original abstract

Agent benchmarks increasingly record rich interaction trajectories, yet evaluation often reduces each rollout to a pass rate or reward score. We introduce TraceGraph, a graph-based framework that turns released multi-model agent trajectories into shared decision landscapes. For each task, TraceGraph builds a graph over observable action-observation states from pooled rollouts before model identity is introduced. It then overlays outcome-informed productive cores and trap regions, and summarizes each rollout with three events: Access, Trap exposure, and Repair. Across trajectories spanning five benchmark splits, TraceGraph profiles reveal navigation differences hidden by aggregate scores and show that splits differ in whether they reward avoiding traps or recovering from them. The same TraceGraph landscape also motivates a trap-aware recovery pipeline for SWE-bench: aruntime detector fires on states matching historical trap regions, then lightweight continuation policies are evaluated from the same prefix. On fired states, the best pooled single-factor policy raises official resolved rate from 40.4% to 43.5% on the per-provider fired subset and from 41.0% to 44.8% on common-fired instances, with provider-specific active components. Overall, TraceGraph provides a process vocabulary for asking what agent benchmarks test, where models diverge on a shared landscape, and how failure regions can guide downstream improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TraceGraph gives a shared graph view of agent trajectories that surfaces trap regions, but the SWE-bench gains rest on an unclear data split that risks circularity.

read the letter

The core contribution is turning pooled agent rollouts into one graph over action-observation states, then labeling trap and productive regions from outcomes. This produces a common landscape that lets you compare how different models move through the same tasks and see whether a benchmark rewards trap avoidance or recovery.

The paper shows this view can expose differences across benchmark splits that aggregate scores hide. It also reports a concrete recovery step on SWE-bench: a detector fires on states that match historical trap regions and switches to a better continuation policy, lifting resolved rate from 40.4% to 43.5% on the per-provider fired subset.

The main weakness is that the abstract gives no evidence the fired test cases are disjoint from the rollouts used to build the graph and pick the policies. If the same trajectories supply both the trap labels and the measured gains, the improvement could be an in-sample artifact rather than a runtime fix. The description of how trap regions are validated or how the detector is frozen is also missing.

This work is aimed at people who run agent benchmarks and want a finer-grained way to diagnose failures. The numbers are specific enough that a referee should see the full methods section, especially the data-handling details. If the split is clean, the framing is worth engaging; if not, the empirical claim needs reworking.

Referee Report

2 major / 1 minor

Summary. The paper introduces TraceGraph, a graph-based framework that aggregates pooled multi-model agent trajectories into shared decision landscapes per task before introducing model identity. It identifies outcome-informed productive cores and trap regions, summarizes each rollout via Access, Trap exposure, and Repair events, profiles navigation differences across five benchmark splits, and presents a trap-aware recovery pipeline for SWE-bench in which a runtime detector matches states to historical trap regions and applies lightweight continuation policies, reporting official resolved-rate gains from 40.4% to 43.5% on the per-provider fired subset and from 41.0% to 44.8% on common-fired instances.

Significance. If the reported gains survive a proper train/test separation, TraceGraph supplies a concrete process vocabulary and shared-landscape representation that moves evaluation beyond aggregate pass rates toward diagnosing where models diverge and how failure regions can be exploited for targeted recovery. The use of official resolved rates on SWE-bench subsets is a positive feature.

major comments (2)

[Abstract] Abstract: the trap-aware recovery pipeline reports resolved-rate gains on 'fired states' without stating whether the trajectories used to construct the TraceGraph, define trap regions, and select continuation policies are disjoint from the trajectories supplying those fired states. This omission directly undermines interpretation of the 3.1 pp and 3.8 pp improvements as out-of-sample runtime gains rather than in-sample artifacts.
[Abstract] Abstract: no description is given of how trap regions were defined from the pooled rollouts, how the detector was validated, or whether continuation-policy selection was performed after inspecting the improvement metric on the same data; these details are load-bearing for the central empirical claim.

minor comments (1)

[Abstract] The abstract refers to 'five benchmark splits' without naming the splits or citing the exact datasets and versions used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and for identifying points where the abstract must be more explicit about data separation and methodological choices. These clarifications are necessary to support the central empirical claims. We will revise the abstract accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the trap-aware recovery pipeline reports resolved-rate gains on 'fired states' without stating whether the trajectories used to construct the TraceGraph, define trap regions, and select continuation policies are disjoint from the trajectories supplying those fired states. This omission directly undermines interpretation of the 3.1 pp and 3.8 pp improvements as out-of-sample runtime gains rather than in-sample artifacts.

Authors: We agree the abstract must state the relationship between the trajectories used for TraceGraph construction, trap definition, policy selection, and the fired states on which gains are measured. In revision we will add an explicit sentence clarifying whether these sets are disjoint (or the degree of overlap) and will note any implications for interpreting the reported gains as out-of-sample. If the sets overlap, we will also qualify the results accordingly rather than claiming runtime generalization. revision: yes
Referee: [Abstract] Abstract: no description is given of how trap regions were defined from the pooled rollouts, how the detector was validated, or whether continuation-policy selection was performed after inspecting the improvement metric on the same data; these details are load-bearing for the central empirical claim.

Authors: We will expand the abstract with a brief clause describing the definition of trap regions (outcome-informed states reached disproportionately by failing trajectories), the validation approach used for the runtime detector, and the procedure for selecting continuation policies (including whether selection inspected the same improvement metric). These additions will be kept concise while making the load-bearing choices transparent. revision: yes

Circularity Check

1 steps flagged

Recovery-rate gains measured on fired states defined from the same pooled rollouts used to identify trap regions

specific steps

fitted input called prediction [Abstract]
"a runtime detector fires on states matching historical trap regions, then lightweight continuation policies are evaluated from the same prefix. On fired states, the best pooled single-factor policy raises official resolved rate from 40.4% to 43.5% on the per-provider fired subset and from 41.0% to 44.8% on common-fired instances"

Trap regions are identified from the pooled historical rollouts; the same rollouts supply the fired states on which the continuation policies are tested and the resolved-rate gains are measured. The improvement is therefore computed on the identical data used to define the detector, reducing the result to an in-sample statistic rather than an independent prediction.

full rationale

The abstract describes building TraceGraph and trap regions from pooled rollouts, then firing a detector on matching states and reporting resolved-rate lifts (40.4%→43.5%, 41.0%→44.8%) on the resulting fired subset. No train/test split, frozen detector, or disjoint evaluation set is stated, so the reported improvement is evaluated on data that supplied the trap definitions themselves. This matches the fitted_input_called_prediction pattern with a single load-bearing step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that observable action-observation states are sufficient to define decision landscapes and that outcome labels from pooled rollouts can be treated as stable region annotations; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Observable action-observation pairs form a sufficient state space for building shared decision graphs across models.
Invoked when the paper states that graphs are built over observable states before model identity is introduced.

pith-pipeline@v0.9.1-grok · 5767 in / 1417 out tokens · 18628 ms · 2026-06-28T22:20:55.946187+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 2 canonical work pages

[1]

URLhttps://arxiv.org/abs/2506.07982

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.𝜏2-bench: Evaluating conversational agents in a dual-control environment, 2025. URLhttps://arxiv.org/abs/2506.07982

Pith/arXiv arXiv 2025
[2]

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2024. doi: 10.1609/aaai.v38i16.29...

work page doi:10.1609/aaai.v38i16.29720 2024
[3]

SliceGraph: Mapping process isomers in multi-run chain-of-thought reasoning, 2026

Kang Chen, Junjie Nian, Yixin Cao, and Yugang Jiang. SliceGraph: Mapping process isomers in multi-run chain-of-thought reasoning, 2026. URLhttps://arxiv.org/abs/2605.14619

Pith/arXiv arXiv 2026
[4]

Thinking traps in long chain-of-thought: A measurable study and trap-aware adaptive restart, 2026

Kang Chen, Fan Yu, Junjie Nian, Shihan Zhao, Zhuoka Feng, Zĳun Yao, Heng Wang, Minshen Yu, and Yixin Cao. Thinking traps in long chain-of-thought: A measurable study and trap-aware adaptive restart, 2026. URL https://arxiv.org/abs/2601.11940

arXiv 2026
[5]

Agenttrajectoriesdataset: Processingandformatdocumentation

ChenyanXiongResearchGroupatCMU. Agenttrajectoriesdataset: Processingandformatdocumentation. Hugging Face dataset, 2026. URLhttps://huggingface.co/datasets/cx-cmu/agent_trajectories

2026
[6]

A framework for few-shot language model evaluation

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

arXiv 2023
[7]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InProceedings of the Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=VTF8yNQM66

2024
[8]

Benchmark test-time scaling of general LLM agents, 2026

XiaochuanLi,RyanMing,PranavSetlur,AbhĳayPaladugu,AndyTang,HaoKang,ShuaiShao,RongJin,andChenyan Xiong. Benchmark test-time scaling of general LLM agents, 2026. URLhttps://arxiv.org/abs/2602.18998

arXiv 2026
[9]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Alexander Cosgrove, Christopher D. Manning, Christopher Re, Diana Acosta-Navas, Drew Arad Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda ...
[10]

Featured Certification, Expert Certification

URLhttps://openreview.net/forum?id=iO4LZibEqW. Featured Certification, Expert Certification
[11]

Process-centric analysis of agentic software systems, 2026

Shuyang Liu, Yang Chen, Rahul Krishna, Saurabh Sinha, Jatin Ganhotra, and Reyhan Jabbarvand. Process-centric analysis of agentic software systems, 2026. URLhttps://arxiv.org/abs/2512.02393

Pith/arXiv arXiv 2026
[12]

AgentBench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InProceedings of the Twelfth International Conferenc...

2024
[13]

Merrill, Alexander G

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

Pith/arXiv arXiv 2026
[14]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, 2023. URLhttps://arxiv.org/abs/2302.04761

Pith/arXiv arXiv 2023
[15]

ALFWorld: Aligning text and embodied environments for interactive learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning. InProceedings of the International Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=0IOX0YcCdTn

2021
[16]

The shape of reasoning: Topological analysis of reasoning traces in large language models, 2025

Xue Wen Tan, Nathaniel Tan, Galen Lee, and Stanley Kok. The shape of reasoning: Topological analysis of reasoning traces in large language models, 2025. URLhttps://arxiv.org/abs/2510.20665

Pith/arXiv arXiv 2025
[17]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InProceedings of the Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id= 1PL1NIMMrw

2023
[18]

MCP-bench: Benchmarkingtool-usingLLMagentswithcomplexreal-world tasks via MCP servers, 2025

Zhenting Wang, Qi Chang, Hemani Patel, Shashank Bĳu, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, AnkitShah,YujiaBao,andEugeneSiow. MCP-bench: Benchmarkingtool-usingLLMagentswithcomplexreal-world tasks via MCP servers, 2025. URLhttps://arxiv.org/abs/2508.20453

arXiv 2025
[19]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=_VjQlMeSB_J

2022
[20]

Mapping the minds of LLMs: A graph-based analysis of reasoningLLMs

Zhen Xiong, Yujun Cai, Zhecheng Li, and Yiwei Wang. Mapping the minds of LLMs: A graph-based analysis of reasoningLLMs. InProceedingsofthe2025ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages17751– 17763, 2025. doi: 10.18653/v1/2025.emnlp-main.896. URLhttps://aclanthology.org/2025.emnlp-main. 896/

work page doi:10.18653/v1/2025.emnlp-main.896 2025
[21]

WebShop: Towards scalable real-world web interaction with grounded language agents

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems, 2022. URL https://arxiv.org/abs/2207.01206

arXiv 2022
[22]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=5Xc1ecxO1h

2023
[23]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InProceedings of the Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

2023
[24]

Narasimhan.𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik R. Narasimhan.𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains. InProceedings of the Thirteenth International Conference on Learning Representations,
[25]

inspect source

URLhttps://openreview.net/forum?id=roNSXZpUDN. A Responsible Research Statement Potential risks.TraceGraphis an analysis and evaluation framework, not a deployed agent-control system. The main risks are over-interpreting outcome-informed graph roles as blind predictors of success, overfitting recovery policies to benchmark-specific traps, or using automat...
[26]

Analyze the problem, plan your approach, and explain your next step

**THOUGHT**: Your reasoning about what to do next. Analyze the problem, plan your approach, and explain your next step
[27]

*.py" | xargs grep -l

**ACTION**: A single bash command to execute. This must be wrapped in a bash code block. Example response: THOUGHT: I need to find the file that contains the buggy function. ACTION: ```bash find . -type f -name "*.py" | xargs grep -l "def process_data" ``` ## Important Rules - Each response must contain exactly ONE bash command in the ACTION section. - To...
[28]

Re-read the failing test/traceback for the specific assertion or unexpected value (do not rely on memory of earlier steps)
[29]

Localize to the smallest function or class implicated by that evidence, in the file(s) above
[30]

Propose ONE minimal change consistent with the evidence; do not rewrite unrelated code
[31]

Run the narrowest relevant test or check before submitting
[32]

Respond in the normal THOUGHT / ACTION format with exactly one bash command

If your current patch is not supported by the error/test evidence, revise or discard it. Respond in the normal THOUGHT / ACTION format with exactly one bash command. Slot filling.At trigger time the slots are populated from the agent’s own per-step record (no oracle signal): {ctx_command}is the most recentACTIONbash command, truncated to200characters;{ctx...

[1] [1]

URLhttps://arxiv.org/abs/2506.07982

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.𝜏2-bench: Evaluating conversational agents in a dual-control environment, 2025. URLhttps://arxiv.org/abs/2506.07982

Pith/arXiv arXiv 2025

[2] [2]

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2024. doi: 10.1609/aaai.v38i16.29...

work page doi:10.1609/aaai.v38i16.29720 2024

[3] [3]

SliceGraph: Mapping process isomers in multi-run chain-of-thought reasoning, 2026

Kang Chen, Junjie Nian, Yixin Cao, and Yugang Jiang. SliceGraph: Mapping process isomers in multi-run chain-of-thought reasoning, 2026. URLhttps://arxiv.org/abs/2605.14619

Pith/arXiv arXiv 2026

[4] [4]

Thinking traps in long chain-of-thought: A measurable study and trap-aware adaptive restart, 2026

Kang Chen, Fan Yu, Junjie Nian, Shihan Zhao, Zhuoka Feng, Zĳun Yao, Heng Wang, Minshen Yu, and Yixin Cao. Thinking traps in long chain-of-thought: A measurable study and trap-aware adaptive restart, 2026. URL https://arxiv.org/abs/2601.11940

arXiv 2026

[5] [5]

Agenttrajectoriesdataset: Processingandformatdocumentation

ChenyanXiongResearchGroupatCMU. Agenttrajectoriesdataset: Processingandformatdocumentation. Hugging Face dataset, 2026. URLhttps://huggingface.co/datasets/cx-cmu/agent_trajectories

2026

[6] [6]

A framework for few-shot language model evaluation

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

arXiv 2023

[7] [7]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InProceedings of the Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=VTF8yNQM66

2024

[8] [8]

Benchmark test-time scaling of general LLM agents, 2026

XiaochuanLi,RyanMing,PranavSetlur,AbhĳayPaladugu,AndyTang,HaoKang,ShuaiShao,RongJin,andChenyan Xiong. Benchmark test-time scaling of general LLM agents, 2026. URLhttps://arxiv.org/abs/2602.18998

arXiv 2026

[9] [9]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Alexander Cosgrove, Christopher D. Manning, Christopher Re, Diana Acosta-Navas, Drew Arad Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda ...

[10] [10]

Featured Certification, Expert Certification

URLhttps://openreview.net/forum?id=iO4LZibEqW. Featured Certification, Expert Certification

[11] [11]

Process-centric analysis of agentic software systems, 2026

Shuyang Liu, Yang Chen, Rahul Krishna, Saurabh Sinha, Jatin Ganhotra, and Reyhan Jabbarvand. Process-centric analysis of agentic software systems, 2026. URLhttps://arxiv.org/abs/2512.02393

Pith/arXiv arXiv 2026

[12] [12]

AgentBench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InProceedings of the Twelfth International Conferenc...

2024

[13] [13]

Merrill, Alexander G

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

Pith/arXiv arXiv 2026

[14] [14]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, 2023. URLhttps://arxiv.org/abs/2302.04761

Pith/arXiv arXiv 2023

[15] [15]

ALFWorld: Aligning text and embodied environments for interactive learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning. InProceedings of the International Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=0IOX0YcCdTn

2021

[16] [16]

The shape of reasoning: Topological analysis of reasoning traces in large language models, 2025

Xue Wen Tan, Nathaniel Tan, Galen Lee, and Stanley Kok. The shape of reasoning: Topological analysis of reasoning traces in large language models, 2025. URLhttps://arxiv.org/abs/2510.20665

Pith/arXiv arXiv 2025

[17] [17]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InProceedings of the Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id= 1PL1NIMMrw

2023

[18] [18]

MCP-bench: Benchmarkingtool-usingLLMagentswithcomplexreal-world tasks via MCP servers, 2025

Zhenting Wang, Qi Chang, Hemani Patel, Shashank Bĳu, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, AnkitShah,YujiaBao,andEugeneSiow. MCP-bench: Benchmarkingtool-usingLLMagentswithcomplexreal-world tasks via MCP servers, 2025. URLhttps://arxiv.org/abs/2508.20453

arXiv 2025

[19] [19]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=_VjQlMeSB_J

2022

[20] [20]

Mapping the minds of LLMs: A graph-based analysis of reasoningLLMs

Zhen Xiong, Yujun Cai, Zhecheng Li, and Yiwei Wang. Mapping the minds of LLMs: A graph-based analysis of reasoningLLMs. InProceedingsofthe2025ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages17751– 17763, 2025. doi: 10.18653/v1/2025.emnlp-main.896. URLhttps://aclanthology.org/2025.emnlp-main. 896/

work page doi:10.18653/v1/2025.emnlp-main.896 2025

[21] [21]

WebShop: Towards scalable real-world web interaction with grounded language agents

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems, 2022. URL https://arxiv.org/abs/2207.01206

arXiv 2022

[22] [22]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=5Xc1ecxO1h

2023

[23] [23]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InProceedings of the Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

2023

[24] [24]

Narasimhan.𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik R. Narasimhan.𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains. InProceedings of the Thirteenth International Conference on Learning Representations,

[25] [25]

inspect source

URLhttps://openreview.net/forum?id=roNSXZpUDN. A Responsible Research Statement Potential risks.TraceGraphis an analysis and evaluation framework, not a deployed agent-control system. The main risks are over-interpreting outcome-informed graph roles as blind predictors of success, overfitting recovery policies to benchmark-specific traps, or using automat...

[26] [26]

Analyze the problem, plan your approach, and explain your next step

**THOUGHT**: Your reasoning about what to do next. Analyze the problem, plan your approach, and explain your next step

[27] [27]

*.py" | xargs grep -l

**ACTION**: A single bash command to execute. This must be wrapped in a bash code block. Example response: THOUGHT: I need to find the file that contains the buggy function. ACTION: ```bash find . -type f -name "*.py" | xargs grep -l "def process_data" ``` ## Important Rules - Each response must contain exactly ONE bash command in the ACTION section. - To...

[28] [28]

Re-read the failing test/traceback for the specific assertion or unexpected value (do not rely on memory of earlier steps)

[29] [29]

Localize to the smallest function or class implicated by that evidence, in the file(s) above

[30] [30]

Propose ONE minimal change consistent with the evidence; do not rewrite unrelated code

[31] [31]

Run the narrowest relevant test or check before submitting

[32] [32]

Respond in the normal THOUGHT / ACTION format with exactly one bash command

If your current patch is not supported by the error/test evidence, revise or discard it. Respond in the normal THOUGHT / ACTION format with exactly one bash command. Slot filling.At trigger time the slots are populated from the agent’s own per-step record (no oracle signal): {ctx_command}is the most recentACTIONbash command, truncated to200characters;{ctx...