Recognition: no theorem link
MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals
Pith reviewed 2026-05-11 01:48 UTC · model grok-4.3
The pith
MASPrism identifies failure sources in multi-agent LLM traces using prefill-stage signals from a small language model without any decoding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that a two-pass prefill process on a small language model can attribute failures by first identifying candidate symptom steps and source actions via negative log-likelihood and attention weights extracted without decoding, then ranking the sources with a focused prompt in a second prefill pass.
What carries the argument
The two-stage prefill process on a small language model that extracts token-level negative log-likelihood and attention weights to detect symptoms and rank failure sources.
If this is right
- MASPrism achieves the best performance on three of four evaluated subsets on Who&When and TRAIL benchmarks.
- It improves Top-1 accuracy by 33.41% over the best baseline on Who&When-HC.
- On TRAIL it shows up to 89.50% relative improvement over strong proprietary LLMs such as Gemini-2.5-Pro.
- Each trace processes in 2.66 seconds on average for a 6.69 times speedup over single-pass prompting with zero output tokens generated.
Where Pith is reading between the lines
- Such prefill-based signals might apply to debugging other sequential decision processes beyond multi-agent systems.
- The method suggests that small models can provide useful diagnostic information for large agent workflows without matching their scale.
- Integration into runtime monitoring could allow automatic flagging of failure points during live executions.
Load-bearing premise
Token-level negative log-likelihood and attention weights extracted during prefill passes on a small language model suffice to identify symptom steps and earlier failure sources without full decoding, replay, or task-specific training.
What would settle it
Observing no statistical link between the prefill-extracted signals and the locations of known injected failures in a set of multi-agent traces would falsify the central claim.
Figures
read the original abstract
Failure attribution in LLM-based multi-agent systems aims to identify the steps that contribute to a failed execution. This task remains difficult because a single execution can contain many agent actions and tool calls, failure evidence can appear many steps after the original mistake, and existing methods often rely on costly agent workflows, replay, or training on synthetic failure logs. To address these challenges, we propose MASPrism, a lightweight framework that performs failure attribution using prefill-stage signals from a small language model (SLM). MASPrism first extracts token-level negative log-likelihood and attention weights during a prefill pass to identify symptom-like steps and earlier candidate sources, without decoding. It then reconstructs a focused diagnostic prompt and performs a second prefill pass to rank failure-source candidates. Using Qwen3-0.6B as the SLM, MASPrism achieves the best performance on three of the four evaluated subsets across Who&When and TRAIL, improving Top-1 accuracy on Who&When-HC by 33.41% over the best baseline. On TRAIL, MASPrism outperforms strong proprietary LLMs, including Gemini-2.5-Pro, with up to 89.50% relative improvement. MASPrism processes each trace in 2.66 seconds on average, achieving a 6.69$\times$ speedup over the single-pass prompting baseline, with zero output tokens. These results show that MASPrism provides an effective and practical framework for failure attribution in long multi-agent execution logs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MASPrism, a lightweight failure attribution framework for LLM-based multi-agent systems. It extracts token-level negative log-likelihood and attention weights from a single prefill pass on Qwen3-0.6B to identify symptom steps and candidate failure sources without decoding or task-specific training, then reconstructs a focused diagnostic prompt for a second prefill pass to rank sources. Evaluations on Who&When (including HC subset) and TRAIL benchmarks report that MASPrism achieves best performance on three of four subsets, with 33.41% Top-1 accuracy gain over the best baseline on Who&When-HC, up to 89.50% relative improvement over Gemini-2.5-Pro on TRAIL, average 2.66s per trace, and 6.69× speedup versus single-pass prompting with zero output tokens.
Significance. If the empirical results hold under rigorous baseline re-implementation and statistical testing, MASPrism would represent a practical advance for debugging long-horizon multi-agent executions by avoiding the cost of full decoding, replay, or synthetic training data. The combination of competitive or superior accuracy with substantial latency reduction and no output tokens could influence reliability tooling in agentic workflows, particularly where proprietary LLM calls are expensive.
minor comments (3)
- [§3.2] §3.2: the reconstruction of the diagnostic prompt from first-pass signals is described at a high level; providing the exact template or pseudocode would improve reproducibility of the two-pass procedure.
- [Tables 2-3] Table 2 and Table 3: report standard deviations or p-values for the Top-1 accuracy differences to substantiate the claimed improvements over baselines.
- [§4.1] §4.1: clarify the exact data splits and whether any hyperparameter tuning on the evaluation sets was performed, given the empirical nature of the comparisons.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of MASPrism and the recommendation for minor revision. We appreciate the recognition that the approach offers a practical advance for debugging long-horizon multi-agent executions through prefill-stage signals without requiring full decoding or task-specific training.
Circularity Check
No significant circularity
full rationale
The paper presents an empirical framework for failure attribution that extracts token-level negative log-likelihood and attention weights from prefill passes on a small language model, then ranks candidates via a second prefill. All performance claims (Top-1 accuracy gains, relative improvements over Gemini-2.5-Pro, 6.69× speedup) are supported by direct experimental comparisons on the Who&When and TRAIL datasets against external baselines. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the method or results. The approach is self-contained because its central claims rest on observable signals and replicable benchmark outcomes rather than any reduction to inputs by construction.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.