TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories
Pith reviewed 2026-06-28 22:20 UTC · model grok-4.3
The pith
TraceGraph builds shared graphs over agent trajectories to identify trap regions and apply recovery policies that raise resolved rates on SWE-bench.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TraceGraph builds a graph over observable action-observation states from pooled rollouts before model identity is introduced. It then overlays outcome-informed productive cores and trap regions and summarizes each rollout with three events: Access, Trap exposure, and Repair. Across five benchmark splits the graphs profile navigation differences hidden by aggregate scores and show that splits differ in whether they reward avoiding traps or recovering from them. The same landscape motivates a trap-aware recovery pipeline in which a runtime detector fires on states matching historical trap regions and lightweight continuation policies are evaluated from the same prefix, raising resolved rates o
What carries the argument
TraceGraph, the graph constructed over pooled action-observation states with overlaid productive cores and trap regions that summarizes trajectories by Access, Trap exposure, and Repair events.
If this is right
- Agent trajectories can be compared across models on a single shared landscape rather than by aggregate pass rates.
- Benchmarks can be distinguished by whether success depends more on avoiding traps or on recovering after entering them.
- Historical trap regions can be used at runtime to select among continuation policies without retraining the original agent.
- Provider-specific active components can be activated once a trap state is detected.
- Process-level events supply a vocabulary for diagnosing where models diverge on the same task.
Where Pith is reading between the lines
- The graph construction could be applied to other multi-model agent evaluations to create comparable decision maps.
- The method may help isolate whether performance gains come from better trap avoidance or from stronger recovery once inside a trap.
- Extending the detector to operate on partial trajectories could allow earlier intervention before full failure.
Load-bearing premise
Trap regions identified from pooled historical rollouts remain stable and detectable at runtime on new trajectories without the detector being tuned to the same data.
What would settle it
Applying the runtime trap detector and selected continuation policies to a fresh collection of trajectories and measuring no increase in resolved rate or frequent firing on states outside the historical trap regions.
read the original abstract
Agent benchmarks increasingly record rich interaction trajectories, yet evaluation often reduces each rollout to a pass rate or reward score. We introduce TraceGraph, a graph-based framework that turns released multi-model agent trajectories into shared decision landscapes. For each task, TraceGraph builds a graph over observable action-observation states from pooled rollouts before model identity is introduced. It then overlays outcome-informed productive cores and trap regions, and summarizes each rollout with three events: Access, Trap exposure, and Repair. Across trajectories spanning five benchmark splits, TraceGraph profiles reveal navigation differences hidden by aggregate scores and show that splits differ in whether they reward avoiding traps or recovering from them. The same TraceGraph landscape also motivates a trap-aware recovery pipeline for SWE-bench: aruntime detector fires on states matching historical trap regions, then lightweight continuation policies are evaluated from the same prefix. On fired states, the best pooled single-factor policy raises official resolved rate from 40.4% to 43.5% on the per-provider fired subset and from 41.0% to 44.8% on common-fired instances, with provider-specific active components. Overall, TraceGraph provides a process vocabulary for asking what agent benchmarks test, where models diverge on a shared landscape, and how failure regions can guide downstream improvement.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TraceGraph, a graph-based framework that aggregates pooled multi-model agent trajectories into shared decision landscapes per task before introducing model identity. It identifies outcome-informed productive cores and trap regions, summarizes each rollout via Access, Trap exposure, and Repair events, profiles navigation differences across five benchmark splits, and presents a trap-aware recovery pipeline for SWE-bench in which a runtime detector matches states to historical trap regions and applies lightweight continuation policies, reporting official resolved-rate gains from 40.4% to 43.5% on the per-provider fired subset and from 41.0% to 44.8% on common-fired instances.
Significance. If the reported gains survive a proper train/test separation, TraceGraph supplies a concrete process vocabulary and shared-landscape representation that moves evaluation beyond aggregate pass rates toward diagnosing where models diverge and how failure regions can be exploited for targeted recovery. The use of official resolved rates on SWE-bench subsets is a positive feature.
major comments (2)
- [Abstract] Abstract: the trap-aware recovery pipeline reports resolved-rate gains on 'fired states' without stating whether the trajectories used to construct the TraceGraph, define trap regions, and select continuation policies are disjoint from the trajectories supplying those fired states. This omission directly undermines interpretation of the 3.1 pp and 3.8 pp improvements as out-of-sample runtime gains rather than in-sample artifacts.
- [Abstract] Abstract: no description is given of how trap regions were defined from the pooled rollouts, how the detector was validated, or whether continuation-policy selection was performed after inspecting the improvement metric on the same data; these details are load-bearing for the central empirical claim.
minor comments (1)
- [Abstract] The abstract refers to 'five benchmark splits' without naming the splits or citing the exact datasets and versions used.
Simulated Author's Rebuttal
We thank the referee for the careful review and for identifying points where the abstract must be more explicit about data separation and methodological choices. These clarifications are necessary to support the central empirical claims. We will revise the abstract accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the trap-aware recovery pipeline reports resolved-rate gains on 'fired states' without stating whether the trajectories used to construct the TraceGraph, define trap regions, and select continuation policies are disjoint from the trajectories supplying those fired states. This omission directly undermines interpretation of the 3.1 pp and 3.8 pp improvements as out-of-sample runtime gains rather than in-sample artifacts.
Authors: We agree the abstract must state the relationship between the trajectories used for TraceGraph construction, trap definition, policy selection, and the fired states on which gains are measured. In revision we will add an explicit sentence clarifying whether these sets are disjoint (or the degree of overlap) and will note any implications for interpreting the reported gains as out-of-sample. If the sets overlap, we will also qualify the results accordingly rather than claiming runtime generalization. revision: yes
-
Referee: [Abstract] Abstract: no description is given of how trap regions were defined from the pooled rollouts, how the detector was validated, or whether continuation-policy selection was performed after inspecting the improvement metric on the same data; these details are load-bearing for the central empirical claim.
Authors: We will expand the abstract with a brief clause describing the definition of trap regions (outcome-informed states reached disproportionately by failing trajectories), the validation approach used for the runtime detector, and the procedure for selecting continuation policies (including whether selection inspected the same improvement metric). These additions will be kept concise while making the load-bearing choices transparent. revision: yes
Circularity Check
Recovery-rate gains measured on fired states defined from the same pooled rollouts used to identify trap regions
specific steps
-
fitted input called prediction
[Abstract]
"a runtime detector fires on states matching historical trap regions, then lightweight continuation policies are evaluated from the same prefix. On fired states, the best pooled single-factor policy raises official resolved rate from 40.4% to 43.5% on the per-provider fired subset and from 41.0% to 44.8% on common-fired instances"
Trap regions are identified from the pooled historical rollouts; the same rollouts supply the fired states on which the continuation policies are tested and the resolved-rate gains are measured. The improvement is therefore computed on the identical data used to define the detector, reducing the result to an in-sample statistic rather than an independent prediction.
full rationale
The abstract describes building TraceGraph and trap regions from pooled rollouts, then firing a detector on matching states and reporting resolved-rate lifts (40.4%→43.5%, 41.0%→44.8%) on the resulting fired subset. No train/test split, frozen detector, or disjoint evaluation set is stated, so the reported improvement is evaluated on data that supplied the trap definitions themselves. This matches the fitted_input_called_prediction pattern with a single load-bearing step.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Observable action-observation pairs form a sufficient state space for building shared decision graphs across models.
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2506.07982
Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.𝜏2-bench: Evaluating conversational agents in a dual-control environment, 2025. URLhttps://arxiv.org/abs/2506.07982
Pith/arXiv arXiv 2025
-
[2]
Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2024. doi: 10.1609/aaai.v38i16.29...
-
[3]
SliceGraph: Mapping process isomers in multi-run chain-of-thought reasoning, 2026
Kang Chen, Junjie Nian, Yixin Cao, and Yugang Jiang. SliceGraph: Mapping process isomers in multi-run chain-of-thought reasoning, 2026. URLhttps://arxiv.org/abs/2605.14619
Pith/arXiv arXiv 2026
-
[4]
Thinking traps in long chain-of-thought: A measurable study and trap-aware adaptive restart, 2026
Kang Chen, Fan Yu, Junjie Nian, Shihan Zhao, Zhuoka Feng, Zijun Yao, Heng Wang, Minshen Yu, and Yixin Cao. Thinking traps in long chain-of-thought: A measurable study and trap-aware adaptive restart, 2026. URL https://arxiv.org/abs/2601.11940
arXiv 2026
-
[5]
Agenttrajectoriesdataset: Processingandformatdocumentation
ChenyanXiongResearchGroupatCMU. Agenttrajectoriesdataset: Processingandformatdocumentation. Hugging Face dataset, 2026. URLhttps://huggingface.co/datasets/cx-cmu/agent_trajectories
2026
-
[6]
A framework for few-shot language model evaluation
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...
arXiv 2023
-
[7]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InProceedings of the Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=VTF8yNQM66
2024
-
[8]
Benchmark test-time scaling of general LLM agents, 2026
XiaochuanLi,RyanMing,PranavSetlur,AbhijayPaladugu,AndyTang,HaoKang,ShuaiShao,RongJin,andChenyan Xiong. Benchmark test-time scaling of general LLM agents, 2026. URLhttps://arxiv.org/abs/2602.18998
arXiv 2026
-
[9]
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Alexander Cosgrove, Christopher D. Manning, Christopher Re, Diana Acosta-Navas, Drew Arad Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda ...
-
[10]
Featured Certification, Expert Certification
URLhttps://openreview.net/forum?id=iO4LZibEqW. Featured Certification, Expert Certification
-
[11]
Process-centric analysis of agentic software systems, 2026
Shuyang Liu, Yang Chen, Rahul Krishna, Saurabh Sinha, Jatin Ganhotra, and Reyhan Jabbarvand. Process-centric analysis of agentic software systems, 2026. URLhttps://arxiv.org/abs/2512.02393
Pith/arXiv arXiv 2026
-
[12]
AgentBench: Evaluating LLMs as agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InProceedings of the Twelfth International Conferenc...
2024
-
[13]
Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...
Pith/arXiv arXiv 2026
-
[14]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, 2023. URLhttps://arxiv.org/abs/2302.04761
Pith/arXiv arXiv 2023
-
[15]
ALFWorld: Aligning text and embodied environments for interactive learning
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning. InProceedings of the International Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=0IOX0YcCdTn
2021
-
[16]
The shape of reasoning: Topological analysis of reasoning traces in large language models, 2025
Xue Wen Tan, Nathaniel Tan, Galen Lee, and Stanley Kok. The shape of reasoning: Topological analysis of reasoning traces in large language models, 2025. URLhttps://arxiv.org/abs/2510.20665
Pith/arXiv arXiv 2025
-
[17]
Le, Ed H
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InProceedings of the Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id= 1PL1NIMMrw
2023
-
[18]
MCP-bench: Benchmarkingtool-usingLLMagentswithcomplexreal-world tasks via MCP servers, 2025
Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, AnkitShah,YujiaBao,andEugeneSiow. MCP-bench: Benchmarkingtool-usingLLMagentswithcomplexreal-world tasks via MCP servers, 2025. URLhttps://arxiv.org/abs/2508.20453
arXiv 2025
-
[19]
Chi, Quoc V
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=_VjQlMeSB_J
2022
-
[20]
Mapping the minds of LLMs: A graph-based analysis of reasoningLLMs
Zhen Xiong, Yujun Cai, Zhecheng Li, and Yiwei Wang. Mapping the minds of LLMs: A graph-based analysis of reasoningLLMs. InProceedingsofthe2025ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages17751– 17763, 2025. doi: 10.18653/v1/2025.emnlp-main.896. URLhttps://aclanthology.org/2025.emnlp-main. 896/
-
[21]
WebShop: Towards scalable real-world web interaction with grounded language agents
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems, 2022. URL https://arxiv.org/abs/2207.01206
arXiv 2022
-
[22]
Griffiths, Yuan Cao, and Karthik Narasimhan
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=5Xc1ecxO1h
2023
-
[23]
ReAct: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InProceedings of the Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X
2023
-
[24]
Narasimhan.𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik R. Narasimhan.𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains. InProceedings of the Thirteenth International Conference on Learning Representations,
-
[25]
inspect source
URLhttps://openreview.net/forum?id=roNSXZpUDN. A Responsible Research Statement Potential risks.TraceGraphis an analysis and evaluation framework, not a deployed agent-control system. The main risks are over-interpreting outcome-informed graph roles as blind predictors of success, overfitting recovery policies to benchmark-specific traps, or using automat...
-
[26]
Analyze the problem, plan your approach, and explain your next step
**THOUGHT**: Your reasoning about what to do next. Analyze the problem, plan your approach, and explain your next step
-
[27]
*.py" | xargs grep -l
**ACTION**: A single bash command to execute. This must be wrapped in a bash code block. Example response: THOUGHT: I need to find the file that contains the buggy function. ACTION: ```bash find . -type f -name "*.py" | xargs grep -l "def process_data" ``` ## Important Rules - Each response must contain exactly ONE bash command in the ACTION section. - To...
-
[28]
Re-read the failing test/traceback for the specific assertion or unexpected value (do not rely on memory of earlier steps)
-
[29]
Localize to the smallest function or class implicated by that evidence, in the file(s) above
-
[30]
Propose ONE minimal change consistent with the evidence; do not rewrite unrelated code
-
[31]
Run the narrowest relevant test or check before submitting
-
[32]
Respond in the normal THOUGHT / ACTION format with exactly one bash command
If your current patch is not supported by the error/test evidence, revise or discard it. Respond in the normal THOUGHT / ACTION format with exactly one bash command. Slot filling.At trigger time the slots are populated from the agent’s own per-step record (no oracle signal): {ctx_command}is the most recentACTIONbash command, truncated to200characters;{ctx...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.