Recognition: 2 theorem links
· Lean TheoremWhy Retrying Fails: Context Contamination in LLM Agent Pipelines
Pith reviewed 2026-05-12 01:53 UTC · model grok-4.3
The pith
Retrying LLM agents after failure contaminates context and raises per-step error rates from a base level ε0 to a higher fixed ε1, requiring new formulas for success probability and budget splits.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the Context-Contaminated Restart Model a task consists of T tool-call steps that each fail with probability ε0 on a clean attempt but with elevated probability ε1 > ε0 on every subsequent attempt because prior failures remain in context. The model supplies a closed-form probability of succeeding in at most K attempts, a theorem for the additional attempts Delta K caused by contamination, an optimal depth T* = sqrt(B * log(1/(1-ε1)) / log(1/(1-ε0))) that maximizes success for fixed budget B = K T, an information-theoretic lower bound showing this K is tight up to a constant, and a theorem quantifying the gain from clearing context before each retry. Validation on SWE-bench Verified data
What carries the argument
The Context-Contaminated Restart Model (CCRM): a sequence of T tool-call steps in which any failure leaves the context contaminated, raising the per-step failure probability from base rate ε0 to a constant higher rate ε1 for all later attempts.
If this is right
- Exact closed-form formula for the probability of succeeding within at most K attempts under contamination.
- Cascade-overhead theorem that states the precise number of extra attempts required compared with a clean-restart baseline.
- Optimal budget-allocation theorem that identifies the pipeline depth T* maximizing success for any fixed total budget B = K T.
- Information-theoretic lower bound via Le Cam's method showing the required number of attempts is within a constant factor of the best possible.
- Clean-restart dominance theorem that quantifies the exact improvement obtained by clearing context before each retry.
Where Pith is reading between the lines
- Agent implementations should add context-clearing or summarization steps before retries to avoid paying the contamination penalty.
- The optimal T* formula can be used directly once practitioners estimate ε0 and ε1 from their own logs or benchmarks.
- The same contamination pattern may appear in other sequential LLM workflows such as long chain-of-thought reasoning or multi-turn planning.
- Testing variable ε1 that depends on failure type or history length would be a direct extension of the constant-rate assumption.
Load-bearing premise
The elevated error rate after contamination is a fixed constant ε1 that does not depend on the details of the failure or on how many prior attempts have occurred.
What would settle it
Measure the actual per-step success rate on the first attempt versus on immediate retries for the same agent and tasks on SWE-bench or a similar benchmark; the ratio of those rates should equal the model's ε1/ε0 if the central claim holds.
Figures
read the original abstract
When an LLM agent fails a multi-step tool-augmented task and retries, the failed attempt typically remains in its context window -- contaminating the next attempt and elevating the per-step error rate beyond the base level. This context-contaminated restart phenomenon is widely observed in practice yet entirely lacks formal treatment. We introduce the Context-Contaminated Restart Model (CCRM): a chain of T tool-call steps, each failing with base rate epsilon_0; after any failed attempt, the subsequent attempt operates in contaminated context with elevated error rate epsilon_1 > epsilon_0. Under this model we derive five main results. (R1) An exact closed-form formula for P(succeed in at most K attempts). (R2) A cascade-overhead theorem giving the additional attempts Delta K incurred by contamination versus the clean-restart baseline. (R3) An optimal budget-allocation theorem identifying the pipeline depth T* that maximises success probability for a fixed total budget B=KT; we prove the closed form T* = sqrt(B * log(1/(1-epsilon_1)) / log(1/(1-epsilon_0))), with K*=B/T*. (R4) An information-theoretic lower bound via Le Cam's method showing K_CCRM is tight up to O(1). (R5) A clean-restart dominance theorem quantifying the exact benefit of context-clearing before retry. We validate CCRM on real SWE-bench Verified data: the IID model overestimates pass@3 by 17.4 percentage points (98.6% vs. 81.2%), while CCRM fits with error less than 0.001, implying a cascade ratio of epsilon_1/epsilon_0 = 7.1. Monte Carlo experiments confirm all theoretical predictions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Context-Contaminated Restart Model (CCRM) for LLM agent pipelines. In this model, a chain of T tool-call steps each fail with base rate ε₀, but after failure the next attempt has elevated error rate ε₁ > ε₀ due to context contamination. The authors derive (R1) closed-form P(success in ≤K attempts), (R2) cascade-overhead theorem, (R3) optimal T* = sqrt(B log(1/(1-ε₁))/log(1/(1-ε₀))) for budget B=KT, (R4) Le Cam lower bound, (R5) clean-restart dominance. Validation on SWE-bench Verified shows CCRM error <0.001 vs IID overestimating pass@3 by 17.4pp, with cascade ratio 7.1.
Significance. The results provide a formal treatment of a common practical issue in LLM agents. The closed-form theorems and tight empirical validation (error <0.001) are notable strengths, as is the explicit optimal-allocation result. If the key assumptions hold, the framework could guide retry strategies in agent systems.
major comments (2)
- [Model Definition and (R3)] The assumption that the contaminated error rate ε₁ is fixed and does not depend on the specific failure details or history length is central to all derived results, including the optimal T* formula in (R3). The manuscript provides no empirical test of this constancy (e.g., by stratifying fits by attempt index or error type), relying instead on a single aggregate fit to SWE-bench data. This makes the optimality theorem's applicability conditional on an unverified assumption.
- [Empirical Validation] The parameters ε₀ and ε₁ are fitted to the SWE-bench Verified data, which is also used to compute the reported cascade ratio of 7.1 and the fit error <0.001. This circularity means the validation does not independently confirm the model's predictive power for the optimal allocation; a hold-out or cross-validation approach would strengthen the claims.
minor comments (1)
- [Abstract] Add explicit equation numbers when referring to the closed forms in the abstract and results summary.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, acknowledging where the concerns are valid, and outline specific revisions that will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Model Definition and (R3)] The assumption that the contaminated error rate ε₁ is fixed and does not depend on the specific failure details or history length is central to all derived results, including the optimal T* formula in (R3). The manuscript provides no empirical test of this constancy (e.g., by stratifying fits by attempt index or error type), relying instead on a single aggregate fit to SWE-bench data. This makes the optimality theorem's applicability conditional on an unverified assumption.
Authors: We agree that the constancy of ε₁ is a modeling assumption required for the closed-form results, including the optimal T* derivation. The current validation relies on an aggregate fit to the full SWE-bench Verified trajectories. In the revised manuscript we will add a new subsection that stratifies the observed per-step error rates by retry index (first attempt, second attempt, etc.) and by error category where possible. This will provide a direct empirical check on whether ε₁ remains approximately constant. The results of this stratification will be reported, and if material deviations appear we will qualify the applicability of the optimality theorem accordingly. revision: yes
-
Referee: [Empirical Validation] The parameters ε₀ and ε₁ are fitted to the SWE-bench Verified data, which is also used to compute the reported cascade ratio of 7.1 and the fit error <0.001. This circularity means the validation does not independently confirm the model's predictive power for the optimal allocation; a hold-out or cross-validation approach would strengthen the claims.
Authors: We acknowledge that fitting and evaluating on the identical dataset limits the strength of the predictive claims. While the primary empirical contribution is the demonstration that CCRM explains the observed data far better than the IID baseline, we agree that an independent test is needed to support the optimal-allocation result. In the revision we will perform a hold-out experiment: parameters will be fit on 70% of the SWE-bench tasks and the predicted success probabilities, cascade overhead, and optimal T* will be evaluated on the remaining 30%. The results of this out-of-sample check will be added to the empirical section. revision: yes
Circularity Check
No circularity: mathematical derivation of T* is independent of the data fit
full rationale
The paper defines the CCRM with fixed parameters ε0 and ε1, then derives the closed-form T* by optimizing the explicit success-probability expression P(success) = 1 - (1-ε0)^T * [(1-ε1)^T]^(K-1) under the budget constraint B = K T. This is a standard calculus exercise on the model equations and does not reduce to the SWE-bench fit. The fit is performed afterward solely for model validation and to obtain numerical ε values; it is not an input to the proof of the closed form. No self-citation, self-definition, or renaming of fitted quantities as predictions occurs in the derivation chain.
Axiom & Free-Parameter Ledger
free parameters (3)
- epsilon_0
- epsilon_1
- cascade_ratio
axioms (2)
- domain assumption Each tool-call step fails independently with a constant probability that depends only on whether the context is clean or contaminated.
- domain assumption Contamination persists uniformly for the entire retry attempt once a failure has occurred.
invented entities (1)
-
Contaminated context state
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearTheorem 3 (Optimal pipeline depth). ... T∗=sqrt(B·log(1/(1−ε1))/log(1/(1−ε0)))
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery unclearCCRM success formula (Theorem 1) ... modified geometric distribution
Reference graph
Works this paper leans on
-
[1]
Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wi- ley (2006)
work page 2006
-
[2]
Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., Narasimhan, K.R.: SWE-bench: Can language models resolve real-world GitHub issues? In: Proc. ICLR (2024)
work page 2024
-
[3]
Le Cam, L.: Convergence of estimates under dimensionality restrictions. Ann. Stat.1(1), 38–53 (1973)
work page 1973
-
[4]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., et al.: ToolLLM: Facilitating large language models to master 16000+ real-world APIs. arXiv:2307.16789 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Rausand, M., Barros, A., Hoyland, A.: System Reliability Theory, 3rd edn. Wiley (2020)
work page 2020
- [6]
-
[7]
Trivedi, K.S.: Probability and Statistics with Reliability, Queuing and Com- puter Science Applications, 2nd edn. Wiley (2002)
work page 2002
-
[8]
Wang, L., Ma, C., Feng, X., Zhang, Z., et al.: A survey on large language model based autonomous agents. Front. Comput. Sci.18(6), 186345 (2024)
work page 2024
-
[9]
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct:Synergizingreasoningandactinginlanguagemodels.In:Proc.ICLR (2023)
work page 2023
-
[10]
Wang, F., Liu, H., Dai, Z., Zeng, J., et al.: AgentTTS: LLM agent for test- time compute-optimal scaling in complex tasks. arXiv:2508.00890 (2025)
- [11]
- [12]
-
[13]
Noël, V.: Catching contamination before generation: spectral kill switches for agents. arXiv:2511.05804 (2025)
-
[14]
Verdent AI: SWE-bench Verified Technical Report: 76.1% pass@1 and 81.2% pass@3.https://www.verdent.ai/blog/ swe-bench-verified-technical-report(2025)
work page 2025
-
[15]
Liu, T., Wang, Z., Miao, J., Hsu, I., Yan, J., Chen, J., Han, R., Xu, F., Chen, Y., Jiang, K., Daruki, S., Liang, Y., Wang, W.Y., Pfister, T., Lee, C.Y.: Budget-aware tool-use enables effective agent scaling. arXiv:2511.17006 (2025)
- [16]
-
[17]
Where llm agents fail and how they can learn from failures,
Zhu, K., Liu, Z., Li, B., Tian, M., Yang, Y., Zhang, J., Han, P., Xie, Q., Cui, F., Zhang, W., et al.: Where LLM agents fail and how they can learn from failures. arXiv:2509.25370 (2025)
-
[18]
Lee, N., Erdogan, L.E., John, C.J., Krishnapillai, S., Mahoney, M.W., Keutzer, K., Gholami, A.: Agentic test-time scaling for web agents. arXiv:2602.12276 (2026)
- [19]
-
[20]
Spend less, reason better: Budget-aware value tree search for llm agents
Li, Y., Deng, W., Li, J., Li, X.: Spend less, reason better: budget-aware value tree search for LLM agents. arXiv:2603.12634 (2026)
-
[21]
Li, X., Ming, R., Setlur, P., Paladugu, A., Tang, A., Kang, H., Shao, S., Jin, R., Xiong, C.: Benchmark test-time scaling of general LLM agents. arXiv:2602.18998 (2026)
-
[22]
Datadog: State of AI Engineering 2026.https://www.datadoghq.com/ state-of-ai-engineering/(2026)
work page 2026
-
[23]
Khanal, A., Tao, Y., Zhou, J.: Beyond pass@1: A reliability science frame- work for long-horizon LLM agents. arXiv:2603.29231 (2026)
-
[24]
The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break
Wang, X.J., Bai, H., Sun, Y., Wang, H., Zhang, S., Hu, W., Schroder, M., Mutlu, B., Song, D., Nowak, R.D.: The long-horizon task mirage: diagnosing where and why agentic systems break. arXiv:2604.11978 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
logrocket.com/llm-context-problem-strategies-2026/(2026)
LogRocket Blog: The LLM context problem in 2026.https://blog. logrocket.com/llm-context-problem-strategies-2026/(2026)
work page 2026
-
[26]
Information fidelity in tool-using llm agents: A martingale analysis of the model context protocol,
Fan, F.X., Tan, C., Wattenhofer, R., Ong, Y.S.: Information fidelity in tool- using LLM agents: a martingale analysis of MCP. In: Proc. AAMAS (2026). arXiv:2602.13320
-
[27]
Patel, K., Surendira, S., George, J., Kapale, S.: The Six Sigma agent: enterprise-grade reliability via consensus-driven decomposed execution. arXiv:2601.22290 (2026)
-
[28]
Measuring the Unmeasurable: Markov Chain Reliability for LLM Agents
Tran-Truong, P.T., Le, X.B.: Measuring the unmeasurable: Markov chain reliability for LLM agents. arXiv:2604.24579 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.