pith. machine review for the scientific record. sign in

arxiv: 2603.23806 · v2 · submitted 2026-03-25 · 💻 cs.SE · cs.AI

Recognition: no theorem link

Willful Disobedience: Automatically Detecting Failures in Agentic Traces

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:17 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords AI agentsagentic tracesbehavioral rulestrace compliancespecification violationsoutcome benchmarksmulti-step evaluation
0
0 comments X

The pith

AgentPex extracts behavioral rules from agent prompts to automatically flag procedural violations in traces that outcome-only scores overlook.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AgentPex as a method to evaluate long multi-turn agent executions called agentic traces. It pulls explicit behavioral rules from the original prompts and system instructions, then checks each trace step against those rules for compliance. This matters because final outcome benchmarks miss failures such as wrong routing, unsafe tool calls, or prompt rule breaks that occur mid-execution. When run on 424 traces from telecom, retail, and airline domains across different models, the tool separates model behaviors and reveals violations invisible to outcome metrics alone. The approach supplies domain-level and metric-level breakdowns so developers can diagnose patterns at scale.

Core claim

AgentPex extracts behavioral rules from agent prompts and system instructions, then uses these specifications to automatically evaluate traces for compliance, distinguishing agent behavior across models and surfacing specification violations not captured by outcome-only scoring.

What carries the argument

Rule extraction from prompts and instructions that converts them into executable compliance checks for full traces.

If this is right

  • AgentPex identifies incorrect workflow routing and unsafe tool usage that final success scores ignore.
  • It separates performance differences between models on identical tasks by compliance rate.
  • It supplies per-domain and per-metric breakdowns that point to specific agent weaknesses.
  • Developers gain scalable inspection of multi-step traces without manual review of every turn.
  • The method highlights prompt-specified rules that current benchmarks do not test.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The extraction step could be applied to live agent logs for ongoing compliance monitoring in production.
  • Explicit rule statements in prompts may become a standard practice to improve automatic verifiability.
  • Teams might select or fine-tune models based on measured compliance to the extracted rule set rather than task accuracy alone.
  • Integration into agent development pipelines could catch violations during testing before deployment.

Load-bearing premise

Rules pulled from prompts and instructions accurately and completely represent the intended behavioral specifications without meaningful loss or misreading during extraction.

What would settle it

Human reviewers examine the same set of traces and find that AgentPex either misses major violations or flags behaviors that humans judge as compliant with the original prompt intent.

Figures

Figures reproduced from arXiv: 2603.23806 by Benjamin Zorn, Reshabh K Sharma, Shraddha Barke.

Figure 1
Figure 1. Figure 1: The agent reaches the correct final outcome but [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Extracted output and transition rules (subset) for [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The AgentPex pipeline from raw trace import through specification extraction to evaluation and aggregate scoring. system prompt and tool descriptions and are represented as lists of checkable statements. • Output Specification: Constraints on response behavior, in￾cluding formatting, style, refusal conditions, and mandatory user confirmation. The extraction targets explicit directive language (e.g., “alway… view at source ↗
Figure 4
Figure 4. Figure 4: Example extracted rules per specification type from [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Traces sorted by AgentPex aggregate (low to high) alongside the 𝜏 2 composite score. Low-𝜏 2 traces concentrate among low AgentPex scores, showing agreement without ground truth [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-metric scores across models. Similar metric [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: ROC and Precision–Recall curves for output_spec when classifying 𝜏 2 failures. Red dots indicate candidate score thresholds (50–80). ROC-AUC = 0.680. Dashed lines denote random (ROC) and base-rate (PR) baselines [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: 𝜏 2 composite and AgentPex aggregate scores are similar across models. 4.5 RQ4: Can AgentPex detect model-specific patterns? [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
read the original abstract

AI agents are increasingly embedded in real software systems, where they execute multi-step workflows through multi-turn dialogue, tool invocations, and intermediate decisions. These long execution histories, called agentic traces, make validation difficult. Outcome-only benchmarks can miss critical procedural failures, such as incorrect workflow routing, unsafe tool usage, or violations of prompt-specified rules. This paper presents AgentPex, an AI-powered tool designed to systematically evaluate agentic traces. AgentPex extracts behavioral rules from agent prompts and system instructions, then uses these specifications to automatically evaluate traces for compliance. We evaluate AgentPex on 424 traces from ${\tau}^2$-bench across models in telecom, retail, and airline customer service. Our results show that AgentPex distinguishes agent behavior across models and surfaces specification violations that are not captured by outcome-only scoring. It also provides fine-grained analysis by domain and metric, enabling developers to understand agent strengths and weaknesses at scale. The source code of AgentPex is available at https://github.com/microsoft/agentpex.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AgentPex, an AI-powered tool that extracts behavioral rules from agent prompts and system instructions, then automatically evaluates compliance in multi-turn agentic traces. On 424 traces from the τ²-bench benchmark spanning telecom, retail, and airline customer-service domains, the authors claim that AgentPex distinguishes behavioral differences across models and identifies specification violations missed by outcome-only scoring, while also enabling fine-grained per-domain and per-metric analysis.

Significance. If the rule-extraction step proves reliable, the work would offer a practical advance for agent evaluation by moving beyond binary outcome metrics to procedural and specification-based checks. This could help developers diagnose specific failure modes (e.g., unsafe tool use or workflow routing errors) at scale, especially as agentic systems are deployed in production software.

major comments (2)
  1. [Methods / Evaluation] The central claim rests on the fidelity of automatically extracted behavioral rules, yet the manuscript provides no human-annotated ground truth, coverage metrics, or error rates for the extraction process (e.g., handling of implicit constraints or ambiguous phrasing). Because downstream compliance scoring and the reported model distinctions on the 424 traces depend entirely on these rules, the absence of validation directly weakens the evidence for the strongest claim.
  2. [Results] No statistical significance tests, confidence intervals, or inter-model comparison details are reported for the claimed distinctions across models. The abstract states that AgentPex “distinguishes agent behavior,” but without these analyses it is unclear whether observed differences exceed what would be expected from extraction noise or sampling variation.
minor comments (2)
  1. [Implementation] The GitHub link is provided, but the manuscript does not describe the exact prompting strategy or LLM used for rule extraction, which would aid reproducibility.
  2. [Figures/Tables] Figure captions and table headers could more explicitly link each metric to the extracted rules that produced it.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We will revise the manuscript to include validation for the rule extraction process and statistical analyses for the model comparisons as detailed in our point-by-point responses below.

read point-by-point responses
  1. Referee: [Methods / Evaluation] The central claim rests on the fidelity of automatically extracted behavioral rules, yet the manuscript provides no human-annotated ground truth, coverage metrics, or error rates for the extraction process (e.g., handling of implicit constraints or ambiguous phrasing). Because downstream compliance scoring and the reported model distinctions on the 424 traces depend entirely on these rules, the absence of validation directly weakens the evidence for the strongest claim.

    Authors: We acknowledge this limitation in the current version. To address it, we will conduct a human evaluation of the extracted rules. We plan to sample 50 prompts from each domain, extract rules using AgentPex, and have two independent annotators label whether the rules accurately capture the behavioral constraints in the prompt (including implicit ones). We will report agreement metrics (Cohen's kappa) and accuracy. This new analysis will be added to the Methods section and will support the reliability of the downstream results. revision: yes

  2. Referee: [Results] No statistical significance tests, confidence intervals, or inter-model comparison details are reported for the claimed distinctions across models. The abstract states that AgentPex “distinguishes agent behavior,” but without these analyses it is unclear whether observed differences exceed what would be expected from extraction noise or sampling variation.

    Authors: We agree that statistical rigor is necessary. In the revision, we will add bootstrap confidence intervals for the violation rates and perform pairwise statistical tests (e.g., Fisher's exact test) between models to determine if the observed differences are significant. These will be reported in the Results section with p-values and effect sizes. We believe this will clarify that the distinctions are not due to noise. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes AgentPex as a tool that extracts behavioral rules from agent prompts and system instructions then applies them to evaluate traces against an external benchmark (τ²-bench with 424 traces). No equations, fitted parameters, self-citations, or uniqueness theorems are presented that reduce the central claims (distinguishing model behaviors and surfacing violations) to inputs by construction. The evaluation reports empirical distinctions across models and domains without the results being statistically forced by the extraction method itself. The derivation chain is self-contained against the external benchmark and does not rely on renaming known results or smuggling ansatzes via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that behavioral rules can be reliably extracted from natural-language prompts and that these rules form a sufficient specification for compliance checking. No free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Prompts and system instructions contain extractable behavioral rules that fully capture intended agent constraints.
    Invoked when the method turns prompts into specifications for evaluation.

pith-pipeline@v0.9.0 · 5481 in / 1258 out tokens · 28002 ms · 2026-05-15T01:17:33.782555+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ContextCov: Deriving and Enforcing Executable Constraints from Agent Instruction Files

    cs.SE 2026-02 unverdicted novelty 6.0

    ContextCov compiles agent instruction files into static, runtime, and architectural guardrails, raising constraint compliance to 88.3% on SWE-bench Lite tasks versus 67% and 50.3% for prompt and reflection baselines.

  2. Learning Correct Behavior from Examples: Validating Sequential Execution in Autonomous Agents

    cs.AI 2026-05 unverdicted novelty 5.0

    A new algorithm learns correct agent behavior models from few traces by combining dominator analysis, LLMs, and automata to validate sequential executions with high accuracy.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 2 Pith papers

  1. [1]

    https: //www.langchain.com/langsmith/observability

    LangSmith: AI Agent & LLM Observability Platform — langchain.com. https: //www.langchain.com/langsmith/observability. [Accessed 24-03-2026]

  2. [2]

    Agentic ai: a com- prehensive survey of architectures, applications, and future directions.Artificial Intelligence Review, 59(1), November 2025

    Mohamad Abou Ali, Fadi Dornaika, and Jinan Charafeddine. Agentic ai: a com- prehensive survey of architectures, applications, and future directions.Artificial Intelligence Review, 59(1), November 2025

  3. [3]

    Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

  4. [4]

    Agentrx: Diagnosing ai agent failures from execution trajectories, 2026

    Shraddha Barke, Arnav Goyal, Alind Khare, Avaljot Singh, Suman Nath, and Chetan Bansal. Agentrx: Diagnosing ai agent failures from execution trajectories, 2026

  5. [5]

    𝜏 2-bench: Evaluating conversational agents in a dual-control environment, 2025

    Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. 𝜏 2-bench: Evaluating conversational agents in a dual-control environment, 2025

  6. [6]

    Humans or llms as the judge? a study on judgement biases

    Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? a study on judgement biases. 2024

  7. [7]

    Teaching large language models to self-debug

    Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. 2023

  8. [8]

    Trail: Trace reasoning and agentic issue localization

    Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kan- nappan, and Rebecca Qian. Trail: Trace reasoning and agentic issue localization. 2025

  9. [9]

    GitHub Copilot: Your ai pair programmer

    GitHub. GitHub Copilot: Your ai pair programmer. https://github .com/features/ copilot, 2026. Product page for GitHub Copilot

  10. [10]

    Prompts are programs

    Tommy Guy, Peli de Halleux, Reshabh K Sharma, and Ben Zorn. Prompts are programs. https://blog .sigplan.org/2024/10/22/prompts-are-programs//, 2024. SIGPLAN Perspectives Blog, [Accessed 28-02-2026]

  11. [11]

    Metagpt: Meta programming for a multi-agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework. 2024

  12. [12]

    A survey on hallucination in large language models: Principles, taxonomy, chal- lenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, January 2025

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, chal- lenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, January 2025

  13. [13]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? 2024

  14. [14]

    Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines. 2023

  15. [15]

    Agentbench: Evaluating llms as agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents. 2025

  16. [16]

    G-eval: Nlg evaluation using gpt-4 with better human alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceed- ings of the 2023 conference on empirical methods in natural language processing, pages 2511–2522, 2023

  17. [17]

    design by contract

    B. Meyer. Applying "design by contract".Computer, 25(10):40–51, 1992

  18. [18]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. 2023

  19. [19]

    Microsoft 365 Copilot: Your everyday ai companion

    Microsoft. Microsoft 365 Copilot: Your everyday ai companion. https:// www.microsoft.com/en-us/microsoft-365/copilot, 2026. Product page for Mi- crosoft 365 Copilot

  20. [20]

    Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. 2023

  21. [21]

    O’Brien, Carrie J

    Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. 2023

  22. [22]

    The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

    Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning, 2025

  23. [23]

    Patil, Tianjun Zhang, Xin Wang, and Joseph E

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. 2023

  24. [24]

    Toolllm: Facilitating large language models to master 16000+ real-world apis

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis. 2023

  25. [25]

    The future of ai agents: Market projections and industry trends

    Microsoft Research. The future of ai agents: Market projections and industry trends. https://www .microsoft.com/en-us/microsoft-365/blog/2025/11/18/ microsoft-ignite-2025-copilot-and-agents-built-to-power-the-frontier-firm/,

  26. [26]

    Industry projection for AI agent adoption

  27. [27]

    Adaptive testing and debugging of nlp models

    Marco Tulio Ribeiro and Scott Lundberg. Adaptive testing and debugging of nlp models. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3253–3267, 2022

  28. [28]

    Be- yond accuracy: Behavioral testing of nlp models with checklist, 2020

    Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Be- yond accuracy: Behavioral testing of nlp models with checklist, 2020

  29. [29]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. 2023

  30. [30]

    Defending language models against image-based prompt attacks via user-provided specifications

    Reshabh K Sharma, Vinayak Gupta, and Dan Grossman. Defending language models against image-based prompt attacks via user-provided specifications. In 2024 IEEE Security and Privacy Workshops (SPW), pages 112–131. IEEE, 2024

  31. [31]

    Spml: A dsl for defending language models against prompt attacks, 2024

    Reshabh K Sharma, Vinayak Gupta, and Dan Grossman. Spml: A dsl for defending language models against prompt attacks, 2024

  32. [32]

    Promptpex: Automatic test generation for language model prompts, 2026

    Reshabh K Sharma, Jonathan De Halleux, Shraddha Barke, Dan Grossman, and Benjamin Zorn. Promptpex: Automatic test generation for language model prompts, 2026

  33. [33]

    Reflexion: Language agents with verbal reinforce- ment learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforce- ment learning. 2023

  34. [34]

    Patil, Lingjiao Chen, Wei-Lin Chiang, and Jared Q

    Ion Stoica, Matei Zaharia, Joseph Gonzalez, Ken Goldberg, Koushik Sen, Hao Zhang, Anastasios Angelopoulos, Shishir G. Patil, Lingjiao Chen, Wei-Lin Chiang, and Jared Q. Davis. Specifications: The missing link to making the development of llm systems an engineering discipline, 2024

  35. [35]

    Large language models are not fair evaluators

    Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. 2023

  36. [36]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. 2023

  37. [37]

    Autogen: Enabling next-gen llm applications via multi-agent conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation. 2023

  38. [38]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. NeurIPS 2024, 2024

  39. [39]

    React: Synergizing reasoning and acting in language models, 2023

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023

  40. [40]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. 2023

  41. [41]

    always",

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. Willful Disobedience: Automatically Detecting Failures in Agentic Traces Conference’17, July 2017, Washington, DC, USA A Prompts Used inAgentPex This appendix provides the verbatim prompts used...

  42. [42]

    Read the system prompt carefully, sentence by sentence

  43. [43]

    Look for ALL explicit directives about how the agent should respond or behave

  44. [44]

    always",

    Extract constraints that use words like "always", "never", "must", "should", "use", etc

  45. [45]

    Include behavioral constraints that directly affect the output

  46. [46]

    Convert each constraint into a clear output specification rule

  47. [47]

    must include X

    For confirmation/consent requirements, phrase the rule to accept FUNCTIONALLY EQUIVALENT expressions of clear intent If no explicit output constraints are found, return an empty array []. Generate a comprehensive list of output specification rules based on ALL explicitly mentioned constraints. Return only a JSON array of rule strings and nothing else. <SY...

  48. [48]

    Scan for EXPLICIT transition constraints that satisfy Transition-only

  49. [49]

    Extract each as a simple sentence using the exact wording from the text

  50. [50]

    Exclude any single-output or question/policy-only statements

  51. [51]

    must ensure X before Y

    If none exist, return an empty array Output format: JSON array of strings (each a transition constraint). Only return JSON. You are evaluating how well an AI assistant adhered to transition specifications during a conversation using a 3-TIER PENALTY SYSTEM. MANDATORY EVALUATION PROCESS: STEP 1: Create a checklist for EACH transition specification: - Extra...

  52. [52]

    Scan the system prompt for EXPLICIT statements about forbidden tool combinations or sequences

  53. [53]

    Never call X after Y

    Look for phrases like: - "Never call X after Y" - "Tool A and Tool B are mutually exclusive" - "Do not use X and Y together" - "Tool X cannot be followed by Tool Y"

  54. [54]

    Examine tool descriptions for explicit incompatibility statements

  55. [55]

    ONLY generate forbidden edges when explicit restrictions are found

  56. [56]

    from": "tool1

    If NO explicit restrictions exist, return an empty list Generate a JSON array of forbidden edges in this format: [ {"from": "tool1", "to": "tool2", "reason": "explicit reason from system prompt"} ] Important: - Only include edges that are EXPLICITLY forbidden - Include the specific reason/constraint that forbids each edge - If no explicit forbiddens are f...

  57. [57]

    You MUST check EVERY parameter in EVERY tool call

  58. [58]

    You MUST use the ACTUAL arguments provided

  59. [59]

    You MUST list ALL violations

  60. [60]

    Type mismatches are VIOLATIONS

  61. [61]

    Missing required parameters are CRITICAL VIOLATIONS

  62. [62]

    A.5 Predicted Plan You are a tool-centric plan generation expert

    Constraint violations are VIOLATIONS TOOL SCHEMAS: {tool_schemas_json} TOOL CALLS WITH SCHEMA MATCHING: {matched_calls_json} Respond with a JSON object containing reasoning and score (0-100). A.5 Predicted Plan You are a tool-centric plan generation expert. Given a user intent, and available tools, you need to generate a step-by-step plan as call to the a...