Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Antoine Bosselut; Ayush K Tarun; Maksym Andriushchenko; Murari Mandal; Nivya Talokar

arxiv: 2602.16346 · v4 · pith:55TDSEUFnew · submitted 2026-02-18 · 💻 cs.CL · cs.LG

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar , Ayush K Tarun , Murari Mandal , Maksym Andriushchenko , Antoine Bosselut This is my paper

Pith reviewed 2026-05-21 13:16 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords LLM agentsred-teamingmulti-turn interactionsillicit assistancejailbreakmultilingualAgentHarm

0 comments

The pith

STING shows higher illicit task completion in LLM agents than single-turn baselines

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops STING to test LLM agents for helping with illicit tasks through multiple turns of conversation. It builds plans step by step under a benign cover and uses other agents as judges to track how far the target gets. This approach matters because agents in practice handle extended interactions with tools, unlike the single-prompt tests used before. Results indicate STING detects more completed harmful tasks on the AgentHarm set and that patterns in other languages differ from what chatbot research shows.

Core claim

By treating red-teaming as a sequential process with adaptive follow-up questions based on a step-by-step illicit plan grounded in a benign persona, and using judge agents to monitor phase completion, the STING framework achieves substantially higher rates of illicit-task completion in AgentHarm scenarios than single-turn prompting or adapted chat baselines. The work also introduces analysis methods based on modeling the process as time-to-first-jailbreak and finds that multilingual attack success does not consistently increase in lower-resource languages.

What carries the argument

STING (Sequential Testing of Illicit N-step Goal execution), which iteratively probes target agents with adaptive follow-ups derived from an N-step illicit plan and employs judge agents to determine phase completion.

If this is right

STING produces higher illicit-task completion rates than single-turn prompting and chat-oriented multi-turn baselines on AgentHarm scenarios.
The time-to-first-jailbreak modeling enables tools such as discovery curves and hazard-ratio attribution by attack language.
Restricted Mean Jailbreak Discovery serves as a new metric for evaluating multi-turn red-teaming.
Multilingual evaluations show attack success and illicit-task completion do not consistently increase in lower-resource languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests that safety measures for agents should focus on detecting gradual, multi-turn escalations rather than single harmful requests.
The method could be applied to evaluate agent behaviors in other high-stakes domains like financial or medical advice.
Future work might test whether similar sequential probing improves detection of other risks such as privacy leaks over conversations.

Load-bearing premise

Judge agents can reliably and without bias determine when each phase of an illicit plan has been completed by the target agent.

What would settle it

Finding a large number of cases where human evaluators disagree with the judge agents on whether a phase of the illicit plan was completed would falsify the reliability of the measured completion rates.

Figures

Figures reproduced from arXiv: 2602.16346 by Antoine Bosselut, Ayush K Tarun, Maksym Andriushchenko, Murari Mandal, Nivya Talokar.

**Figure 1.** Figure 1: STING: (a) A Strategist constructs a deceptive persona and decomposes the harmful intent into executable phases. (b) The Attacker embodies the persona and attempts each phase against the Target agent. After each target response, the (c) Refusal Detector checks for refusal; if none is detected, the (d) Phase-Completion Checker assesses whether the phase objective has been met. Both evaluators provide action… view at source ↗

**Figure 2.** Figure 2: Kaplan–Meier discovery curves (95% CI) showing the fraction of harmful behaviours for which at least one strategy succeeds (jailbreak) for a given strategy budget; RMJD summarizes each curve (higher = earlier/more jailbreak successes). ples that are not jailbroken within Smax strategies are treated as right-censored at Smax (no jailbreak observed by the budget limit). We additionally report the Restricted… view at source ↗

**Figure 3.** Figure 3: AgentHarm Score (%) comparison between single-turn prompting and STING across 7 languages for 3 models. Differences in misuse outcomes are less pronounced than those reported in prior chatbot-focused jailbreak studies (Yong et al., 2023). 6.1. Language Has Limited Effect on Jailbreak Success [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: AgentHarm Score (AHS) for Qwen3-Next and GPT-5.1 under varying reasoning settings across languages. No-thinking settings are consistently less safe. For GPT-5.1, medium reasoning is safer than high reasoning [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STING gives a workable multi-turn red-teaming setup for agents and reports higher illicit completion rates, but the unvalidated LLM judges are a real weak point in the measurement.

read the letter

The main thing to know is that this paper introduces STING, a framework that builds step-by-step illicit plans and probes agents across multiple adaptive turns while tracking progress with judge agents. It claims substantially higher task completion than single-turn prompts or adapted chat baselines, plus multilingual results that do not show the usual boost in lower-resource languages. That multi-turn focus and the time-to-first-jailbreak modeling with the Restricted Mean Jailbreak Discovery metric are the concrete additions here. They directly target how agents actually operate in workflows with tools and memory, which single-prompt benchmarks miss. The setup is presented as newly built rather than derived from prior work, and the empirical comparisons are the core evidence offered. The multilingual divergence is also worth noting because it pushes against patterns seen in simpler chatbot tests. The clearest soft spot is the judge agents used to decide phase completion. The abstract describes feeding trajectories to these judges but gives no human validation, inter-judge agreement numbers, or error analysis. If the judges over-count or under-count completions, especially with follow-ups or non-English text, the reported deltas and the new metric lose grounding. That assumption sits at the center of the main claim, so it is not a minor issue. The paper is aimed at researchers who evaluate or build tool-using agents and need practical red-teaming methods. Readers working on agent safety or misuse benchmarks will find the framework and analysis tools directly usable. It has enough new machinery and addresses a clear gap to deserve a serious referee, even with the validation gap. I would send it to peer review and ask specifically for human checks on the judges and clearer details on how baselines were adapted.

Referee Report

2 major / 2 minor

Summary. The paper introduces the STING (Sequential Testing of Illicit N-step Goal execution) framework, an automated red-teaming approach that constructs step-by-step illicit plans grounded in benign personas, iteratively probes target LLM agents with adaptive follow-ups, and employs separate judge agents to track completion of each phase of the plan. It reports substantially higher illicit-task completion rates than single-turn prompting and adapted chat-oriented multi-turn baselines on AgentHarm scenarios, introduces analysis tools including discovery curves, hazard-ratio attribution, and the Restricted Mean Jailbreak Discovery metric, and presents multilingual results across six non-English languages indicating that attack success and completion do not consistently increase in lower-resource languages.

Significance. If the central empirical claims hold after addressing validation concerns, the work fills a clear gap in agent misuse evaluation by shifting focus from single-prompt tests to realistic multi-turn interactions. The new metric and time-to-first-jailbreak modeling provide useful analysis tools for the field, and the multilingual findings challenge assumptions carried over from chatbot literature. The framework is presented as practical for deployment stress-testing.

major comments (2)

[§3 and §4] §3 (STING Framework description) and §4 (Experimental Setup): The headline results on illicit-task completion rates and the Restricted Mean Jailbreak Discovery metric are computed by feeding target-agent trajectories to separate judge agents that decide phase completion. The manuscript provides no human validation, inter-judge agreement statistics, or error analysis for these judges, particularly under adaptive follow-ups or in non-English languages. This is load-bearing for the central claim of substantially higher completion versus baselines.
[§5] §5 (Results, baseline comparisons): The abstract and results section state that STING outperforms 'chat-oriented multi-turn baselines adapted to tool-using agents,' yet the manuscript supplies no detailed description of the adaptation procedure, no ablation of the adaptations, and no confirmation that the baselines received equivalent tool access and memory. Without this, the performance deltas cannot be confidently attributed to the STING framework itself.

minor comments (2)

[Abstract] The abstract could include the exact number of AgentHarm scenarios and languages tested to give readers immediate context for the scale of the evaluation.
[Figures] Discovery curves and hazard plots should include confidence bands or error bars so that the reported 'substantial' differences can be visually assessed for statistical separation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of validation and experimental clarity that we will address in revision. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [§3 and §4] §3 (STING Framework description) and §4 (Experimental Setup): The headline results on illicit-task completion rates and the Restricted Mean Jailbreak Discovery metric are computed by feeding target-agent trajectories to separate judge agents that decide phase completion. The manuscript provides no human validation, inter-judge agreement statistics, or error analysis for these judges, particularly under adaptive follow-ups or in non-English languages. This is load-bearing for the central claim of substantially higher completion versus baselines.

Authors: We agree that the absence of human validation and agreement statistics for the judge agents is a limitation that affects confidence in the headline results. The current implementation follows common LLM-as-judge practices but does not include the requested checks. In the revised manuscript we will add a human validation study on a stratified sample of trajectories (covering both English and non-English cases as well as adaptive follow-ups), report inter-judge agreement metrics such as Cohen’s kappa, and include a concise error analysis. These additions will be placed in §4. revision: yes
Referee: [§5] §5 (Results, baseline comparisons): The abstract and results section state that STING outperforms 'chat-oriented multi-turn baselines adapted to tool-using agents,' yet the manuscript supplies no detailed description of the adaptation procedure, no ablation of the adaptations, and no confirmation that the baselines received equivalent tool access and memory. Without this, the performance deltas cannot be confidently attributed to the STING framework itself.

Authors: We acknowledge that the description of how the chat-oriented baselines were adapted for tool-using agents is insufficiently detailed. The adaptations consisted of adding tool-calling interfaces and preserving full conversation history to match the agent memory setup, but these steps were not fully documented or ablated. In the revision we will expand §5 with (i) a precise description of the adaptation procedure, (ii) an ablation isolating the contribution of each adaptation, and (iii) explicit confirmation that all baselines received identical tool access and memory mechanisms. This will allow readers to attribute performance differences more confidently to the STING framework. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework and metric introduced without reduction to inputs or self-citation chains

full rationale

The paper constructs STING as a new multi-turn red-teaming procedure and models completion via a time-to-first-jailbreak random variable to define the Restricted Mean Jailbreak Discovery metric. These are presented as methodological innovations rather than derived quantities. Central results consist of direct empirical comparisons of illicit-task completion rates against single-turn and adapted baselines across AgentHarm scenarios and languages. No equations or load-bearing steps reduce a claimed prediction or result to fitted parameters or prior self-citations by construction. Judge-agent phase tracking is a design choice whose accuracy is a validity question outside the circularity criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the reliability of judge agents for phase tracking and on the realism of the constructed illicit plans; these are introduced by the paper rather than drawn from prior independent evidence.

axioms (1)

domain assumption Judge agents can accurately track completion of illicit plan phases without systematic bias or error.
The framework uses these judges to determine success; this assumption is invoked when reporting phase completion rates.

invented entities (1)

STING framework no independent evidence
purpose: Constructs step-by-step illicit plans grounded in benign personas and iteratively probes agents with adaptive follow-ups.
Newly introduced automated red-teaming system; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5784 in / 1265 out tokens · 40406 ms · 2026-05-21T13:16:19.162186+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan ... using judge agents to track phase completion.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling ... Kaplan–Meier discovery curves ... Restricted Mean Jailbreak Discovery (RMJD).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.