9 From Reward-Hack Activations to Agentic Risk States Table 5.Feature groups used for next-step risk prediction

URL https: //arxiv · 2026 · arXiv 2603.02798

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

AuditFlow combines a graph-grounded symbolic environment with a multi-agent LLM setup to reach 82.09% joint audit accuracy on structured financial reports, 14.93 points above the strongest baseline.

From Reward-Hack Activations to Agentic Risk States: Context-Calibrated Mechanistic Monitoring in LLM Agents

cs.AI · 2026-06-04 · unverdicted · novelty 3.0

Reward-hack activations flag latent policy states in LLM agents but require added entropy and context features to better predict when those states lead to exploit actions.

citing papers explorer

Showing 2 of 2 citing papers.

AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification cs.AI · 2026-06-02 · unverdicted · none · ref 35
AuditFlow combines a graph-grounded symbolic environment with a multi-agent LLM setup to reach 82.09% joint audit accuracy on structured financial reports, 14.93 points above the strongest baseline.
From Reward-Hack Activations to Agentic Risk States: Context-Calibrated Mechanistic Monitoring in LLM Agents cs.AI · 2026-06-04 · unverdicted · none · ref 13
Reward-hack activations flag latent policy states in LLM agents but require added entropy and context features to better predict when those states lead to exploit actions.

9 From Reward-Hack Activations to Agentic Risk States Table 5.Feature groups used for next-step risk prediction

fields

years

verdicts

representative citing papers

citing papers explorer