arxiv: 2605.07509 · v2 · submitted 2026-05-08 · 💻 cs.SE

Recognition: no theorem link

MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

Yang Liu , Hongjiang Feng , Junsong Pu , Zhuangbin Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:48 UTC · model grok-4.3

classification 💻 cs.SE

keywords failure attributionmulti-agent systemsprefill signalssmall language modelsexecution traceslightweight analysisLLM debugging

0 comments

The pith

MASPrism identifies failure sources in multi-agent LLM traces using prefill-stage signals from a small language model without any decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a lightweight approach to finding which steps in a multi-agent LLM system caused a failure. Multi-agent executions can be very long, with the root cause appearing many steps before the observed error, making manual inspection impractical. Prior methods depend on replaying agents, training on failure examples, or running full costly inferences. MASPrism instead pulls token-level negative log-likelihood and attention weights from the prefill stage of a small language model to spot symptom steps and earlier sources. If successful, this would enable fast, low-cost diagnosis of problems in deployed multi-agent applications.

Core claim

The central discovery is that a two-pass prefill process on a small language model can attribute failures by first identifying candidate symptom steps and source actions via negative log-likelihood and attention weights extracted without decoding, then ranking the sources with a focused prompt in a second prefill pass.

What carries the argument

The two-stage prefill process on a small language model that extracts token-level negative log-likelihood and attention weights to detect symptoms and rank failure sources.

If this is right

MASPrism achieves the best performance on three of four evaluated subsets on Who&When and TRAIL benchmarks.
It improves Top-1 accuracy by 33.41% over the best baseline on Who&When-HC.
On TRAIL it shows up to 89.50% relative improvement over strong proprietary LLMs such as Gemini-2.5-Pro.
Each trace processes in 2.66 seconds on average for a 6.69 times speedup over single-pass prompting with zero output tokens generated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such prefill-based signals might apply to debugging other sequential decision processes beyond multi-agent systems.
The method suggests that small models can provide useful diagnostic information for large agent workflows without matching their scale.
Integration into runtime monitoring could allow automatic flagging of failure points during live executions.

Load-bearing premise

Token-level negative log-likelihood and attention weights extracted during prefill passes on a small language model suffice to identify symptom steps and earlier failure sources without full decoding, replay, or task-specific training.

What would settle it

Observing no statistical link between the prefill-extracted signals and the locations of known injected failures in a set of multi-agent traces would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.07509 by Hongjiang Feng, Junsong Pu, Yang Liu, Zhuangbin Chen.

**Figure 2.** Figure 2: Overview of the MASPrism framework. Filtering Prompt [System] This trace has been truncated. Each step shows a bounded prefix of key content. [...] marks omitted text. Focus on error patterns and causal chains. Omissions are expected [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Prompt prepended to the truncated trace in the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Prompt prepended to the reconstructed trace in the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Attention from symptom steps on Who&When. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Failure attribution in LLM-based multi-agent systems aims to identify the steps that contribute to a failed execution. This task remains difficult because a single execution can contain many agent actions and tool calls, failure evidence can appear many steps after the original mistake, and existing methods often rely on costly agent workflows, replay, or training on synthetic failure logs. To address these challenges, we propose MASPrism, a lightweight framework that performs failure attribution using prefill-stage signals from a small language model (SLM). MASPrism first extracts token-level negative log-likelihood and attention weights during a prefill pass to identify symptom-like steps and earlier candidate sources, without decoding. It then reconstructs a focused diagnostic prompt and performs a second prefill pass to rank failure-source candidates. Using Qwen3-0.6B as the SLM, MASPrism achieves the best performance on three of the four evaluated subsets across Who&When and TRAIL, improving Top-1 accuracy on Who&When-HC by 33.41% over the best baseline. On TRAIL, MASPrism outperforms strong proprietary LLMs, including Gemini-2.5-Pro, with up to 89.50% relative improvement. MASPrism processes each trace in 2.66 seconds on average, achieving a 6.69$\times$ speedup over the single-pass prompting baseline, with zero output tokens. These results show that MASPrism provides an effective and practical framework for failure attribution in long multi-agent execution logs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proposes MASPrism, a lightweight failure attribution framework for LLM-based multi-agent systems. It extracts token-level negative log-likelihood and attention weights from a single prefill pass on Qwen3-0.6B to identify symptom steps and candidate failure sources without decoding or task-specific training, then reconstructs a focused diagnostic prompt for a second prefill pass to rank sources. Evaluations on Who&When (including HC subset) and TRAIL benchmarks report that MASPrism achieves best performance on three of four subsets, with 33.41% Top-1 accuracy gain over the best baseline on Who&When-HC, up to 89.50% relative improvement over Gemini-2.5-Pro on TRAIL, average 2.66s per trace, and 6.69× speedup versus single-pass prompting with zero output tokens.

Significance. If the empirical results hold under rigorous baseline re-implementation and statistical testing, MASPrism would represent a practical advance for debugging long-horizon multi-agent executions by avoiding the cost of full decoding, replay, or synthetic training data. The combination of competitive or superior accuracy with substantial latency reduction and no output tokens could influence reliability tooling in agentic workflows, particularly where proprietary LLM calls are expensive.

minor comments (3)

[§3.2] §3.2: the reconstruction of the diagnostic prompt from first-pass signals is described at a high level; providing the exact template or pseudocode would improve reproducibility of the two-pass procedure.
[Tables 2-3] Table 2 and Table 3: report standard deviations or p-values for the Top-1 accuracy differences to substantiate the claimed improvements over baselines.
[§4.1] §4.1: clarify the exact data splits and whether any hyperparameter tuning on the evaluation sets was performed, given the empirical nature of the comparisons.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of MASPrism and the recommendation for minor revision. We appreciate the recognition that the approach offers a practical advance for debugging long-horizon multi-agent executions through prefill-stage signals without requiring full decoding or task-specific training.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical framework for failure attribution that extracts token-level negative log-likelihood and attention weights from prefill passes on a small language model, then ranks candidates via a second prefill. All performance claims (Top-1 accuracy gains, relative improvements over Gemini-2.5-Pro, 6.69× speedup) are supported by direct experimental comparisons on the Who&When and TRAIL datasets against external baselines. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the method or results. The approach is self-contained because its central claims rest on observable signals and replicable benchmark outcomes rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5581 in / 1229 out tokens · 41747 ms · 2026-05-11T01:48:03.900376+00:00 · methodology