arxiv: 2604.03242 · v1 · submitted 2026-02-11 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

DRAFT: Task Decoupled Latent Reasoning for Agent Safety

Lin Wang , Junfeng Fang , Dan Zhang , Fei Shen , Xiang Wang , Tat-Seng Chua

Authors on Pith no claims yet

Pith reviewed 2026-05-16 02:18 UTC · model grok-4.3

classification 💻 cs.LG

keywords agent safetylatent reasoningLLM agentstrajectory analysissparse evidenceend-to-end trainingsafety benchmarks

0 comments

The pith

DRAFT decouples safety judgment into an Extractor that compresses agent trajectories into a compact latent draft and a Reasoner that attends to both draft and original sequence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM agents generate long noisy trajectories where safety-critical evidence is sparse, making binary supervision ineffective for credit assignment. DRAFT splits the task into an Extractor that maps the full trajectory to a continuous latent draft and a Reasoner that jointly attends to the draft and raw data for the safety label. This keeps aggregation inside differentiable latent space instead of forcing an explicit summary first, enabling end-to-end training. On ASSEBench and R-Judge the method raises average accuracy from 63.27 percent with LoRA to 91.18 percent while learning more separable representations.

Core claim

DRAFT establishes that safety judgment over long noisy agent trajectories becomes more accurate when evidence is first aggregated into a trainable continuous latent draft and then jointly reasoned over with the original sequence rather than through lossy explicit summarization.

What carries the argument

The two-stage DRAFT architecture: an Extractor that distills each trajectory into a compact continuous latent draft followed by a Reasoner that jointly attends to the draft and the original trajectory to predict safety.

If this is right

Safety classifiers can be trained end-to-end on full trajectories without intermediate human-readable summaries.
Representations from the joint Extractor-Reasoner become more separable than those from single-stage baselines.
Ablations show that neither module alone reaches the reported accuracy on sparse-evidence benchmarks.
The approach scales directly to long-context safety tasks such as ASSEBench and R-Judge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-draft decoupling could be tested on other rare-event detection tasks in long sequences beyond safety.
Integration with existing agent planners might enable real-time safety filtering during execution.
Further experiments could check whether the pattern improves non-safety metrics such as task completion under uncertainty.

Load-bearing premise

Evidence aggregation performed in continuous latent space avoids lossy explicit summarize-then-judge pipelines and enables effective end-to-end training for sparse risk evidence in long trajectories.

What would settle it

Train an identical model that first writes an explicit natural-language summary of each trajectory and then classifies safety from that summary alone; if its accuracy and representation separability match or exceed DRAFT on the same benchmarks, the advantage of latent-space aggregation is refuted.

Figures

Figures reproduced from arXiv: 2604.03242 by Dan Zhang, Fei Shen, Junfeng Fang, Lin Wang, Tat-Seng Chua, Xiang Wang.

**Figure 1.** Figure 1: Left: Comparison between standard one-stage methods and DRAFT. (a, c) Illustration of objectives, where θ, λ, γ denote different parameter spaces. (b, d) t-SNE visualization of hidden representations. (e) Comparison between explicit and latent reasoning outputs. Right: Accuracy of Qwen3Guard-Gen-4B on three agent safety datasets. AA denotes AgentAuditor (Luo et al., 2025). make evidence extraction easier u… view at source ↗

**Figure 2.** Figure 2: Overview of the DRAFT two-stage latent reasoning framework. 2.2. Latent Reasoning as a General Paradigm and Risk Decomposition Many long-context judgment tasks exhibit sparse, decisioncritical evidence under weak supervision. We posit latent risk factors R such that y ⊥⊥ X | R and R ∼ p(R | X). Intuitively, R captures the minimal risk-critical evidence distilled from the noisy trajectory, so that the safe… view at source ↗

**Figure 3.** Figure 3: Last-layer feature t-SNE of the Reasoner on three benchmarks (colors denote benchmarks); marker shape and intensity indicate the safe and unsafe labels. Top: LoRA-SFT Reasoner hidden state features. Bottom: DRAFT Reasoner hidden state features. supervised fine-tuning configuration. Across backbones and datasets, DRAFT achieves consistently higher Accuracy than Vanilla, SFT/LoRA, and retrieval-augmented bas… view at source ↗

**Figure 4.** Figure 4: Accuracy (%) based on different Extractor latent reasoning length Ls across datasets and backbones. Shaded regions indicate standard deviation over seeds. A longer latent inference length is not necessarily better; the stability of training depends on the dataset quality and the amount of trainable data. Tail Lower Middle Upper Head Explicit 40 50 60 70 80 90 100 Percentage (%) Accurate F1 Precision Recall… view at source ↗

**Figure 5.** Figure 5: Position ablation on latent reasoning insertion. “Explicit” denotes summarize-then-judge. We further observe that the optimal length depends on the characteristics of the dataset. Datasets with more uniform structure and stable labeling tend to saturate earlier, while noisier datasets with higher trajectory variance may require slightly more latent capacity to reliably capture sparse risk evidence. Overall… view at source ↗

**Figure 6.** Figure 6: Ablation study of DRAFT. A and B represent LoRA Block of Extractor and Reasoner, respectively. DRAFT acts as a structural refactorization of trajectory reasoning under weak supervision. Removing either module causes a substantial performance drop: on Qwen3-8B, ablating LoRA-A or LoRA-B reduces the average accuracy on ASSEBench and AuraGen from 91.57 The Extractor alone lacks a stable decision readout tra… view at source ↗

**Figure 7.** Figure 7: t-SNE visualization of Extractor latent reasoning feature generated by Llama-3.1-8B across all the datasets. Limitations. • Synthetic distribution artifacts. Generated trajectories may encode patterns specific to the generator/injector models, which can reduce transfer to organic logs. • Tool realism gap. Even with refined tools, synthetic tool APIs and environments may not fully reflect deployment complex… view at source ↗

**Figure 8.** Figure 8: Generalization study of DRAFT. We train DRAFT and SFT on ASSEBench AuraGen and evaluate it on all other two datasets [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

read the original abstract

The advent of tool-using LLM agents shifts safety monitoring from output moderation to auditing long, noisy interaction trajectories, where risk-critical evidence is sparse-making standard binary supervision poorly suited for credit assignment. To address this, we propose DRAFT (Task Decoupled Latent Reasoning for Agent Safety), a latent reasoning framework that decouples safety judgment into two trainable stages: an Extractor that distills the full trajectory into a compact continuous latent draft, and a Reasoner that jointly attends to the draft and the original trajectory to predict safety. DRAFT avoids lossy explicit summarize-then-judge pipelines by performing evidence aggregation in latent space, enabling end-to-end differentiable training.Across benchmarks including ASSEBench and R-Judge, DRAFT consistently outperforms strong baselines, improving accuracy from 63.27% (LoRA) to 91.18% averaged over benchmarks, and learns more separable representations. Ablations demonstrate a clear synergy between the Extractor and the Reasoner.Overall, DRAFT suggests that continuous latent reasoning prior to readout is a practical path to robust agent safety under long-context supervision with sparse evidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DRAFT's latent decoupling for sparse safety signals in agent trajectories is a reasonable idea but the abstract's big accuracy claims rest on unshown experiments and possibly weak supervision for the extractor.

read the letter

The main thing here is that DRAFT splits safety judgment on long agent trajectories into an extractor that compresses everything into a continuous latent draft and a reasoner that attends to both the draft and the raw trajectory. It targets the real problem that binary labels give poor credit assignment when risk evidence is sparse and scattered across noisy interactions from tool-using LLM agents. Doing the aggregation in latent space instead of explicit summaries is a straightforward way to keep more signal and allow end-to-end training, and the reported jump from 63.27% LoRA accuracy to 91.18% average across ASSEBench and R-Judge plus the ablations on extractor-reasoner synergy suggest the split helps in practice if the numbers check out. The approach builds on existing latent reasoning work but applies the decoupling specifically to this safety setting, which is a clear incremental step. The soft spots are straightforward. We only have the abstract, so there are no methods details, baseline descriptions, statistical tests, or error breakdowns to evaluate whether the gains are robust or just from particular implementation choices. The stress-test concern lands: if the extractor gets no auxiliary objective beyond the final binary safety label, gradients may stay too weak to reliably surface the sparse risk evidence, recreating the credit-assignment issue the method is meant to fix. The abstract mentions end-to-end differentiability but says nothing about reconstruction, contrastive, or localization losses on the draft, so that needs checking in the full text. This work is for people already working on LLM agent safety and long-context monitoring. A reader focused on practical deployment issues or latent-space alternatives to summarization pipelines would find the concrete architecture useful to examine. It deserves peer review because the problem is timely, the proposed split is well-motivated, and the claimed results are large enough to warrant proper scrutiny and possible additions like auxiliary supervision, even if the current write-up is light on verification.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes DRAFT, a task-decoupled latent reasoning framework for agent safety monitoring. It decouples safety judgment into an Extractor that distills long noisy trajectories into a compact continuous latent draft and a Reasoner that jointly attends to the draft and original trajectory to predict safety. The approach claims to enable end-to-end differentiable training by performing evidence aggregation in latent space rather than via explicit summarize-then-judge pipelines. Across benchmarks including ASSEBench and R-Judge, DRAFT is reported to outperform strong baselines such as LoRA, raising average accuracy from 63.27% to 91.18%, while producing more separable representations; ablations are said to confirm synergy between the two stages.

Significance. If the empirical claims hold under full experimental scrutiny, the work would address a practically important credit-assignment problem in sparse-risk, long-context agent trajectories. Performing aggregation in continuous latent space could reduce information loss relative to discrete summarization pipelines and support more robust end-to-end training for safety classifiers.

major comments (2)

[Abstract] Abstract: The central performance claim (accuracy rising from 63.27% for LoRA to 91.18% averaged over benchmarks) is presented without any description of experimental protocol, baseline implementations, number of runs, statistical tests, or error analysis. This absence prevents verification of the reported gains and of the assertion that the latent draft retains sparse risk evidence effectively.
[Abstract] Abstract: The claim of 'end-to-end differentiable training' is load-bearing for the method's solution to credit assignment, yet the text supplies no information on the training objective or auxiliary supervision (reconstruction, contrastive, or localization) provided to the Extractor. If the Extractor receives gradients only from the final binary safety loss, the credit-assignment problem the framework is designed to solve may remain unresolved.

minor comments (1)

[Abstract] Abstract: The benchmarks ASSEBench and R-Judge are referenced without citations or brief characterizations of their trajectory lengths, risk sparsity, or label distributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the presentation of our experimental claims and training procedure. We will revise the manuscript to incorporate the requested details while preserving the core contributions of DRAFT.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claim (accuracy rising from 63.27% for LoRA to 91.18% averaged over benchmarks) is presented without any description of experimental protocol, baseline implementations, number of runs, statistical tests, or error analysis. This absence prevents verification of the reported gains and of the assertion that the latent draft retains sparse risk evidence effectively.

Authors: We agree that the abstract should be expanded for self-containment. In the revision we will add a concise description of the protocol: evaluation on ASSEBench and R-Judge, LoRA baseline trained with identical data and hyperparameters, results averaged over 5 random seeds with standard deviation, and paired t-test significance (p < 0.01). A brief error analysis paragraph will also be added to the results section showing that the latent draft preserves sparse risk tokens more reliably than explicit summarization baselines. revision: yes
Referee: [Abstract] Abstract: The claim of 'end-to-end differentiable training' is load-bearing for the method's solution to credit assignment, yet the text supplies no information on the training objective or auxiliary supervision (reconstruction, contrastive, or localization) provided to the Extractor. If the Extractor receives gradients only from the final binary safety loss, the credit-assignment problem the framework is designed to solve may remain unresolved.

Authors: We acknowledge the current draft omits explicit description of the Extractor objective. The revision will clarify that the Extractor is trained jointly with the Reasoner under the final binary cross-entropy safety loss plus an auxiliary reconstruction loss on the latent draft (Equation 3). This auxiliary term supplies direct gradient signal to the Extractor, ensuring it retains sparse risk evidence rather than relying solely on the downstream loss. The entire pipeline remains end-to-end differentiable because the draft is a continuous latent vector. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims independent of derivations

full rationale

The manuscript describes DRAFT as a two-stage framework (Extractor producing continuous latent draft, Reasoner attending to draft plus trajectory) trained end-to-end on safety prediction. No equations, derivations, or parameter-fitting steps are shown that reduce any claimed prediction to its own inputs by construction. No self-citations appear in the provided text, and the performance numbers (e.g., accuracy lift from 63.27% to 91.18%) are benchmark comparisons rather than fitted quantities renamed as predictions. The end-to-end differentiability assertion is a design claim, not a mathematical reduction. This is a standard empirical ML contribution whose central results do not collapse to self-definition or self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; insufficient detail to populate ledger entries.

pith-pipeline@v0.9.0 · 5499 in / 1020 out tokens · 114918 ms · 2026-05-16T02:18:03.551620+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

S=ϕ_ΔB(X), S∈R^{Ls×d} ... Y=[P;S] ... p(y|X)≈p(y|S,X)≈p(y|S)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat embedding and orbit separation echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

excess risk decomposes as Extraction Error + Readout Error

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 13 internal anchors

[1]

A., Fischer, I., Dillon, J

Alemi, A. A., Fischer, I., Dillon, J. V ., and Murphy, K. Deep variational information bottleneck.arXiv preprint arXiv:1612.00410,

work page arXiv
[2]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

work page 1901
[3]

Towards tool use alignment of large language models

Chen, Z.-Y ., Shen, S., Shen, G., Zhi, G., Chen, X., and Lin, Y . Towards tool use alignment of large language models. InProceedings of the 2024 Conference on Em- pirical Methods in Natural Language Processing, pp. 1382–1400,

work page 2024
[4]

The Llama 3 Herd of Models

URL https: //arxiv.org/abs/2407.21783. Ghosh, S., Varshney, P., Galinkin, E., and Parisien, C. Aegis: Online adaptive ai content safety moderation with ensem- ble of llm experts.arXiv preprint arXiv:2404.05993,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Training Large Language Models to Reason in a Continuous Latent Space

Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y . Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection

Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., and Kamar, E. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509,

work page arXiv
[7]

Building a foundational guardrail for general agentic sys- tems via synthetic data.arXiv preprint arXiv:2510.09781,

Huang, Y ., Hua, H., Zhou, Y ., Jing, P., Nagireddy, M., Padhi, I., Dolcetti, G., Xu, Z., Chaudhury, S., Rawat, A., et al. Building a foundational guardrail for general agentic sys- tems via synthetic data.arXiv preprint arXiv:2510.09781,

work page arXiv
[8]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

URL https: //arxiv.org/abs/2312.06674. Lian, L., Wang, S., Juefei-Xu, F., Fu, T.-J., Li, X., Yala, A., Darrell, T., Suhr, A., Tian, Y ., and Lin, X. V . Thread- weaver: Adaptive threading for efficient parallel reason- ing in language models.arXiv preprint arXiv:2512.07843,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Agentauditor: Human-level safety and security evaluation for llm agents.arXiv preprint arXiv:2506.00641,

Luo, H., Dai, S., Ni, C., Li, X., Zhang, G., Wang, K., Liu, T., and Salam, H. Agentauditor: Human-level safety and security evaluation for llm agents.arXiv preprint arXiv:2506.00641,

work page arXiv
[10]

WebGPT: Browser-assisted question-answering with human feedback

Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V ., Saunders, W., et al. Webgpt: Browser-assisted question-answering with hu- man feedback.arXiv preprint arXiv:2112.09332,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Ignore Previous Prompt: Attack Techniques For Language Models

Perez, F. and Ribeiro, I. Ignore previous prompt: At- tack techniques for language models.arXiv preprint arXiv:2211.09527,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Identifying the Risks of LM Agents with an LM-Emulated Sandbox

Ruan, Y ., Dong, H., Wang, A., Pitis, S., Zhou, Y ., Ba, J., Dubois, Y ., Maddison, C. J., and Hashimoto, T. Identify- ing the risks of lm agents with an lm-emulated sandbox. arXiv preprint arXiv:2309.15817,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Evil geniuses: Delving into the safety of llm-based agents

9 DRAFT: Task Decoupled Latent Reasoning for Agent Safety Tian, Y ., Yang, X., Zhang, J., Dong, Y ., and Su, H. Evil geniuses: Delving into the safety of llm-based agents. arXiv preprint arXiv:2311.11855,

work page arXiv
[14]

The information bottleneck method

Tishby, N., Pereira, F. C., and Bialek, W. The informa- tion bottleneck method.arXiv preprint physics/0004057,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Wang, G., Xie, Y ., Jiang, Y ., Mandlekar, A., Xiao, C., Zhu, Y ., Fan, L., and Anandkumar, A. V oyager: An open- ended embodied agent with large language models.arXiv preprint arXiv: Arxiv-2305.16291,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency im- proves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Wang, X., Li, B., Song, Y ., Xu, F. F., Tang, X., Zhuge, M., Pan, J., Song, Y ., Li, B., Singh, J., et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Toolsafety: A comprehensive dataset for enhancing safety in llm-based agent tool invocations

Xie, Y ., Yuan, Y ., Wang, W., Mo, F., Guo, J., and He, P. Toolsafety: A comprehensive dataset for enhancing safety in llm-based agent tool invocations. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 14146–14167,

work page 2025
[19]

Qwen3 Technical Report

URL https: //arxiv.org/abs/2505.09388. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. R., and Cao, Y . React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

R-judge: Benchmark- ing safety risk awareness for llm agents.arXiv preprint arXiv:2401.10019,

Yuan, T., He, Z., Dong, L., Wang, Y ., Zhao, R., Xia, T., Xu, L., Zhou, B., Li, F., Zhang, Z., et al. R-judge: Benchmark- ing safety risk awareness for llm agents.arXiv preprint arXiv:2401.10019,

work page arXiv
[21]

Zelikman, E., Harik, G., Shao, Y ., Jayasiri, V ., Haber, N., and Goodman, N. D. Quiet-star: Language models can teach themselves to think before speaking.arXiv preprint arXiv:2403.09629,

work page internal anchor Pith review arXiv
[22]

URLhttps:// arxiv.org/abs/2407.21772

URLhttps://arxiv.org/abs/2407.21772. Zhan, Q., Fang, R., Panchal, H. S., and Kang, D. Adaptive attacks break defenses against indirect prompt injection attacks on llm agents.arXiv preprint arXiv:2503.00061,

work page arXiv
[23]

Rest-mcts*: Llm self-training via process reward guided tree search

Zhang, D., Zhoubian, S., Hu, Z., Yue, Y ., Dong, Y ., and Tang, J. Rest-mcts*: Llm self-training via process reward guided tree search. InAdvances in Neural Information Processing Systems, pp. 64735–64772, 2024a. Zhang, D., Zhoubian, S., Cai, M., Li, F., Yang, L., Wang, W., Dong, T., Hu, Z., Tang, J., and Yue, Y . Datascibench: An llm agent benchmark for ...

work page arXiv
[24]

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

10 DRAFT: Task Decoupled Latent Reasoning for Agent Safety Zhang, H., Huang, J., Mei, K., Yao, Y ., Wang, Z., Zhan, C., Wang, H., and Zhang, Y . Agent security bench (asb): For- malizing and benchmarking attacks and defenses in llm- based agents.arXiv preprint arXiv:2410.02644, 2024b. Zhang, Z., Cui, S., Lu, Y ., Zhou, J., Yang, J., Wang, H., and Huang, M...

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Reasoning by superposition: A theoretical perspec- tive on chain of continuous thought.arXiv preprint arXiv:2505.12514,

Zhu, H., Hao, S., Hu, Z., Jiao, J., Russell, S., and Tian, Y . Reasoning by superposition: A theoretical perspec- tive on chain of continuous thought.arXiv preprint arXiv:2505.12514,

work page arXiv
[26]

IB intuition.Let S=ϕ ∆B(X) be an intermediate latent variable used for prediction

through an Information Bottleneck (IB) perspective (Tishby et al., 2000; Alemi et al., 2016). IB intuition.Let S=ϕ ∆B(X) be an intermediate latent variable used for prediction. A compact draft should (i) preserve information about the labelywhile (ii) discarding irrelevant details inX. This can be expressed by the IB objective: max ϕ I(S;y)−β I(S;X),(31) ...

work page 2000
[27]

risk awareness

is a curated benchmark for evaluatingrisk awarenessin tool-using agents by judging whether an interaction record is safe or unsafe. It comprises569annotated multi-turn interaction cases across 5application categories and27scenarios, with10risk types. The dataset is approximately balanced (about half unsafe) and has moderate trajectory length (on average ∼...

work page 2024
[28]

LLM-as-a-judge

as a benchmark for evaluating whether LLM-based evaluators can detectboth safety risks and security threatsin agent interaction trajectories. It consists of2,293 meticulously annotated interaction records, covering15risk types across29application scenarios. A distinctive feature is its ambiguity-aware labeling, includingStrictandLenientjudgment standards ...

work page arXiv 2025