Recognition: unknown
Measuring the Unmeasurable: Markov Chain Reliability for LLM Agents
Pith reviewed 2026-05-08 02:47 UTC · model grok-4.3
The pith
Fitting LLM agent execution traces to absorbing Markov chains accurately predicts their success-time distributions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Agent traces, once clustered into states, can be represented by an absorbing DTMC whose transition matrix yields a first-passage distribution that matches observed reliability behavior. On the seven evaluated frameworks the fitted chains produce analytic reliability decay curves that stay within 0.053 of held-out empirical curves, the two-sample KS test on the first-passage CDF accepts the model in all cases, and posterior and bootstrap interval estimates agree closely.
What carries the argument
The TraceToChain pipeline, which performs automatic state clustering, Laplace-smoothed MLE transition estimation for an absorbing DTMC, and composite AIC-plus-KS validation while reporting Dirichlet posterior and bootstrap uncertainty intervals.
Load-bearing premise
Agent execution traces, after automatic clustering, satisfy the Markov property well enough that an absorbing DTMC captures the true success-time distribution and supports extrapolation to unseen traces.
What would settle it
A new collection of agent frameworks processed with the identical 50/50 protocol that yields either a KS p-value below 0.05 on the first-passage CDF or an L-infinity RDC deviation above 0.1 on held-out data would falsify the claim of reliable modeling.
Figures
read the original abstract
Large language model (LLM) agents increasingly operate as sequential software systems, but their reliability is often summarized by scalar benchmark metrics. Metrics such as pass$@k$, pass$^k$, and the reliability decay curve (RDC) are useful summaries, but they do not identify the success-time distribution being estimated, test whether traces support that distribution, or quantify finite-trace uncertainty. We present \textsc{TraceToChain}, a reproducible pipeline that fits agent execution traces to an absorbing discrete-time Markov chain (DTMC), $\hat M=(\hat Q,\hat R_\oplus,\hat R_\ominus)$, with explicit diagnostics and uncertainty. The pipeline builds an automatic cluster taxonomy, estimates transitions with Laplace-smoothed maximum-likelihood estimation (MLE), checks fit with a composite Akaike information criterion (AIC) and Kolmogorov--Smirnov (KS) goodness-of-fit certificate, and reports Dirichlet-posterior credible intervals and non-parametric bootstrap intervals. We adapt classical reliability mathematics (Kemeny--Snell~\cite{kemenysnell}, Cheung~\cite{cheung1980}, Goel--Okumoto~\cite{goelokt}) to agent traces. The resulting first-passage view reconciles metrics usually reported separately: pass$@k$, pass$^k$, and the RDC are projections of one success-time distribution. On seven controlled MAST-style frameworks with a strict 50/50 fit/test protocol, held-out empirical RDCs overlay their analytic counterparts with max $L_\infty^{\mathrm{RDC}} = 0.053$ (median $0.048$). A two-sample KS test on the first-passage cumulative distribution function (CDF) accepts the fitted chain with $p>0.05$ on $7/7$ frameworks (min $p = 0.78$), and per-entry $95\%$ posterior and bootstrap intervals agree to $\approx\!0.01$ at the median.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TraceToChain, a reproducible pipeline that automatically clusters LLM agent execution traces, fits them to an absorbing DTMC (Q̂, R̂⊕, R̂⊖) via Laplace-smoothed MLE, and adapts classical reliability results (Kemeny–Snell, Cheung, Goel–Okumoto) to unify pass@k, pass^k, and the reliability decay curve (RDC) as projections of a single success-time distribution. It supplies AIC/KS diagnostics, Dirichlet posterior credible intervals, and bootstrap intervals. On seven MAST-style frameworks under a strict 50/50 fit/test split, held-out empirical RDCs overlay analytic ones with max L∞RDC = 0.053 (median 0.048), and two-sample KS tests on the first-passage CDF accept the model (p > 0.05 on all 7, min p = 0.78).
Significance. If the post-clustering Markov property holds, the work supplies a statistically grounded, extrapolative model for agent reliability that replaces scalar benchmarks with a full distribution plus uncertainty quantification. The strict held-out protocol, adaptation of classical reliability mathematics, and explicit diagnostics are genuine strengths that would allow falsifiable predictions and reproducible analysis of sequential LLM systems.
major comments (2)
- [Abstract / validation procedure] Abstract and validation section: the reported two-sample KS test on the first-passage CDF (p > 0.05 on 7/7) together with composite AIC only confirm that the fitted chain reproduces the observed marginal success-time distribution. They do not test the Markov property itself (i.e., whether P(next state | current state) is independent of history after automatic clustering). Because the central claim is that the DTMC yields the correct success-time distribution and reliable analytic RDC, this gap is load-bearing; a direct test (conditional independence or order-2 Markov check on the clustered sequences) is required.
- [Methods / clustering and transition estimation] Methods / clustering step: the automatic cluster taxonomy is presented as producing states suitable for an absorbing DTMC, yet no quantitative check (e.g., memory test or silhouette analysis tied to transition independence) is described. If residual history dependence remains, the Laplace-smoothed MLE transitions and the subsequent first-passage formulas become a convenient projection rather than a verified model, undermining extrapolation claims even when L∞RDC ≤ 0.053.
minor comments (2)
- [Model definition] Notation: the symbols R̂⊕ and R̂⊖ are introduced without an explicit equation linking them to the standard Kemeny–Snell canonical form; a short display equation would remove ambiguity.
- [Results / figures] Figure clarity: the held-out RDC overlay plots would benefit from shaded 95 % posterior/bootstrap bands so readers can visually assess the reported median agreement of ≈0.01.
Simulated Author's Rebuttal
We thank the referee for the constructive and precise comments, which help clarify the scope of our validation. We address each major point below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract / validation procedure] Abstract and validation section: the reported two-sample KS test on the first-passage CDF (p > 0.05 on 7/7) together with composite AIC only confirm that the fitted chain reproduces the observed marginal success-time distribution. They do not test the Markov property itself (i.e., whether P(next state | current state) is independent of history after automatic clustering). Because the central claim is that the DTMC yields the correct success-time distribution and reliable analytic RDC, this gap is load-bearing; a direct test (conditional independence or order-2 Markov check on the clustered sequences) is required.
Authors: We agree that the KS test on the first-passage CDF and the composite AIC primarily validate agreement between the empirical and analytic marginal success-time distributions rather than directly confirming the Markov property after clustering. While the strong held-out RDC overlay (max L∞ = 0.053) and high p-values provide indirect support—non-Markovian residuals would be expected to degrade the first-passage match—we acknowledge that this is not a substitute for an explicit test. In the revised manuscript we will add an order-2 Markov test (likelihood-ratio comparison of order-1 vs. order-2 transitions on the clustered sequences) together with a conditional-independence diagnostic, reporting the results in the validation section. revision: yes
-
Referee: [Methods / clustering and transition estimation] Methods / clustering step: the automatic cluster taxonomy is presented as producing states suitable for an absorbing DTMC, yet no quantitative check (e.g., memory test or silhouette analysis tied to transition independence) is described. If residual history dependence remains, the Laplace-smoothed MLE transitions and the subsequent first-passage formulas become a convenient projection rather than a verified model, undermining extrapolation claims even when L∞RDC ≤ 0.053.
Authors: The referee is correct that the manuscript currently lacks an explicit quantitative check linking the clustering output to transition independence. The automatic taxonomy is intended to yield states for which the Markov assumption is approximately valid, but without a memory test this remains an unverified modeling choice. We will therefore add, in the methods section, a memory test (order-1 vs. order-2 likelihood-ratio test on the clustered traces) and report any associated silhouette scores or transition-stability metrics. These diagnostics will be used to qualify the reliability of the subsequent first-passage formulas and extrapolation claims. revision: yes
Circularity Check
No significant circularity; derivation self-contained via held-out validation
full rationale
The paper fits an absorbing DTMC to 50% of traces, derives first-passage success-time quantities (including unified pass@k, pass^k, and RDC as projections), and validates them directly against held-out empirical distributions on the remaining 50% using KS tests (p>0.05 on 7/7) and RDC L∞ overlays (max 0.053). This external test-set comparison prevents any reduction of predictions to training inputs by construction. Classical reliability results (Kemeny-Snell, Cheung, Goel-Okumoto) are adapted without self-citation chains, and no self-definitional, ansatz-smuggling, or renaming steps appear in the provided text. The central claim rests on empirical match rather than tautology.
Axiom & Free-Parameter Ledger
free parameters (1)
- Transition probabilities (Q, R)
axioms (2)
- domain assumption Agent traces can be represented as an absorbing discrete-time Markov chain after automatic clustering
- domain assumption The Markov property holds for the clustered states
Forward citations
Cited by 1 Pith paper
-
Why Retrying Fails: Context Contamination in LLM Agent Pipelines
A Context-Contaminated Restart Model derives exact success probabilities and an optimal pipeline depth T* = sqrt(B * log(1/(1-ε1)) / log(1/(1-ε0))) for fixed budget B, validated on SWE-bench where it fits data far bet...
Reference graph
Works this paper leans on
-
[1]
J. G. Kemeny and J. L. Snell,Finite Markov Chains. Princeton, NJ: Van Nostrand, 1960
1960
-
[2]
A user-oriented software reliability model,
R. C. Cheung, “A user-oriented software reliability model,”IEEE Transactions on Software Engineering, vol. SE-6, no. 2, pp. 118–125, 1980
1980
-
[3]
Time-dependent error-detection rate model for software reliability and other performance measures,
A. L. Goel and K. Okumoto, “Time-dependent error-detection rate model for software reliability and other performance measures,”IEEE Transactions on Reliability, vol. R-28, no. 3, pp. 206–211, 1979
1979
-
[4]
ReAct: Synergizing reasoning and acting in language models,
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations, 2023
2023
-
[5]
Toolformer: Language Models Can Teach Themselves to Use Tools
T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, L. Zettle- moyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,”arXiv preprint arXiv:2302.04761, 2023
work page internal anchor Pith review arXiv 2023
-
[6]
A survey on large language model based autonomous agents,
L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Lin, W. X. Zhao, Z. Wei, and J.-R. Wen, “A survey on large language model based autonomous agents,”Frontiers of Computer Science, vol. 18, no. 6, p. 186345, 2024
2024
-
[7]
Basic concepts and taxonomy of dependable and secure computing,
A. Aviˇzienis, J.-C. Laprie, B. Randell, and C. Landwehr, “Basic concepts and taxonomy of dependable and secure computing,”IEEE Transactions on Dependable and Secure Computing, vol. 1, no. 1, pp. 11–33, 2004
2004
-
[8]
Beyer, C
B. Beyer, C. Jones, J. Petoff, and N. R. Murphy,Site Reliability Engineering: How Google Runs Production Systems. Sebastopol, CA: O’Reilly Media, 2016
2016
-
[9]
τ-bench: A benchmark for tool-agent-user interaction in real-world domains,
S. Yao, N. Shinn, P. Razavi, and K. Narasimhan, “ τ-bench: A benchmark for tool-agent-user interaction in real-world domains,” inAdvances in Neural Information Processing Systems, 2024
2024
-
[10]
G. W. Stewart,Matrix Algorithms, Volume I: Basic Decompositions. Philadelphia, PA: SIAM, 1998, neumann-series perturbation bound on (I−Q) −1
1998
-
[11]
R. A. Horn and C. R. Johnson,Matrix Analysis, 2nd ed. Cambridge, UK: Cambridge University Press, 2013, jordan-form norm bounds on ∥Qd∥; see Thms. 3.1.12 and 5.6.12
2013
-
[12]
K. S. Trivedi and A. Bobbio,Reliability and Availability Engineering: Modeling, Analysis, and Applications. Cambridge, UK: Cambridge University Press, 2017
2017
-
[13]
J. D. Musa, A. Iannino, and K. Okumoto,Software Reliability: Measure- ment, Prediction, Application. New York, NY: McGraw-Hill, 1987
1987
-
[14]
Karlin,Total Positivity, Volume I
S. Karlin,Total Positivity, Volume I. Stanford, CA: Stanford University Press, 1968, proves that TP2 kernels are stochastically monotone
1968
-
[15]
Monotone matrices and monotone Markov processes,
J. Keilson and A. Kester, “Monotone matrices and monotone Markov processes,”Stochastic Processes and their Applications, vol. 5, no. 3, pp. 231–241, 1977, formal definition of stochastic monotonicity for transition kernels
1977
-
[16]
Evaluating Large Language Models Trained on Code
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-V...
work page internal anchor Pith review arXiv 2021
-
[17]
A. Khanal, Y . Tao, and J. Zhou, “Beyond pass@1: A reliability science framework for long-horizon LLM agents,”arXiv preprint arXiv:2603.29231, 2026
-
[18]
PRISM 4.0: Verification of probabilistic real-time systems,
M. Kwiatkowska, G. Norman, and D. Parker, “PRISM 4.0: Verification of probabilistic real-time systems,” inComputer Aided Verification. Springer, 2011, pp. 585–591
2011
-
[19]
A storm is coming: A modern probabilistic model checker,
C. Dehnert, S. Junges, J.-P. Katoen, and M. V olk, “A storm is coming: A modern probabilistic model checker,” inComputer Aided Verification. Springer, 2017, pp. 592–600
2017
-
[20]
TriCEGAR: A trace-driven abstraction mechanism for agentic AI,
R. Koohestani, A. G ¨orpelio˘glu, E. Klimov, B. K. Ozkan, and M. Izadi, “TriCEGAR: A trace-driven abstraction mechanism for agentic AI,”arXiv preprint arXiv:2601.22997, 2026
-
[21]
Pro2guard: Proactive runtime enforcement of llm agent safety via probabilistic model checking
H. Wang, C. M. Poskitt, J. Wei, and J. Sun, “ProbGuard: Probabilistic run- time monitoring for LLM agent safety,”arXiv preprint arXiv:2508.00500, 2025
-
[22]
Reflexion: Language agents with verbal reinforcement learning,
N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” inAdvances in Neural Information Processing Systems, 2023
2023
-
[23]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “ToolLLM: Facilitating large language models to master 16000+ real-world APIs,”arXiv preprint arXiv:2307.16789, 2023
work page internal anchor Pith review arXiv 2023
-
[24]
SWE-bench: Can language models resolve real-world GitHub issues?
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can language models resolve real-world GitHub issues?” inInternational Conference on Learning Representations, 2024
2024
-
[25]
WebArena: A realistic web environment for building autonomous agents,
S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, U. Alon, and G. Neubig, “WebArena: A realistic web environment for building autonomous agents,” inInternational Conference on Learning Representations, 2024
2024
-
[26]
ALFWorld: Aligning text and embodied environments for interactive learning,
M. Shridhar, X. Yuan, M.-A. C ˆot´e, Y . Bisk, A. Trischler, and M. Hausknecht, “ALFWorld: Aligning text and embodied environments for interactive learning,” inInternational Conference on Learning Representations, 2021
2021
-
[27]
Voyager: An Open-Ended Embodied Agent with Large Language Models
G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar, “V oyager: An open-ended embodied agent with large language models,”arXiv preprint arXiv:2305.16291, 2023
work page internal anchor Pith review arXiv 2023
-
[28]
Why Do Multi-Agent LLM Systems Fail?
M. Cemri, M. Z. Pan, S. Yang, L. A. Agrawal, B. Chopra, R. Tiwari, K. Keutzer, A. Parameswaran, D. Klein, K. Ramchandran, M. Zaharia, J. E. Gonzalez, and I. Stoica, “Why do multi-agent LLM systems fail?” arXiv preprint arXiv:2503.13657, 2025
work page internal anchor Pith review arXiv 2025
-
[29]
AgentBench: Evaluating LLMs as agents,
X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y . Su, H. Sun, M. Huang, Y . Dong, and J. Tang, “AgentBench: Evaluating LLMs as agents,” inInternational Conference on Learning Representations, 2024
2024
-
[30]
A logic for reasoning about time and reliability,
H. Hansson and B. Jonsson, “A logic for reasoning about time and reliability,”Formal Aspects of Computing, vol. 6, no. 5, pp. 512–535, 1994
1994
-
[31]
Baier and J.-P
C. Baier and J.-P. Katoen,Principles of Model Checking. Cambridge, MA: MIT Press, 2008
2008
-
[32]
Hierarchical grouping to optimize an objective function,
J. H. Ward, “Hierarchical grouping to optimize an objective function,” Journal of the American Statistical Association, vol. 58, no. 301, pp. 236–244, 1963
1963
-
[33]
Silhouettes: A graphical aid to the interpretation and validation of cluster analysis,
P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and validation of cluster analysis,”Journal of Computational and Applied Mathematics, vol. 20, pp. 53–65, 1987
1987
-
[34]
A new look at the statistical model identification,
H. Akaike, “A new look at the statistical model identification,”IEEE Transactions on Automatic Control, vol. 19, no. 6, pp. 716–723, 1974
1974
-
[35]
Table for estimating the goodness of fit of empirical distributions,
N. Smirnov, “Table for estimating the goodness of fit of empirical distributions,”Annals of Mathematical Statistics, vol. 19, no. 2, pp. 279–281, 1948
1948
-
[36]
Stochastically monotone Markov chains,
D. J. Daley, “Stochastically monotone Markov chains,”Z. Wahrschein- lichkeitstheorie verw. Gebiete, vol. 10, pp. 305–317, 1968, foundational paper on the stochastic-monotonicity notion used in A4. [Online]. Available: https://link.springer.com/journal/440/volumes-and-issues/10-4
1968
-
[37]
Classes of orderings of measures and related correlation inequalities. I. multivariate totally positive distributions,
S. Karlin and Y . Rinott, “Classes of orderings of measures and related correlation inequalities. I. multivariate totally positive distributions,” Journal of Multivariate Analysis, vol. 10, no. 4, pp. 467–498, 1980, log-concavity of CDFs under TP2 kernels
1980
-
[38]
D. A. Levin and Y . Peres,Markov Chains and Mixing Times, 2nd ed. American Mathematical Society, 2017, pre-mixing constants in non- reversible chains (Thm. 12.3)
2017
-
[39]
ULTRA-MC: A unified approach to learning mixtures of markov chains via hitting times,
F. Spaeh, K. Sotiropoulos, and C. E. Tsourakakis, “ULTRA-MC: A unified approach to learning mixtures of markov chains via hitting times,” arXiv preprint arXiv:2405.15094, 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.