cs.CR — Pith

0

cs.CR 2026-05-13 2 theorems

Watermark detects AI text in mixed documents and distilled models

TextSeal: A Localized LLM Watermark for Provenance & Distillation Protection

Dual-key sampling and localized scoring keep quality intact while flagging origins even after heavy blending or reuse.

abstract click to expand

We introduce TextSeal, a state-of-the-art watermark for large language models. Building on Gumbel-max sampling, TextSeal introduces dual-key generation to restore output diversity, along with entropy-weighted scoring and multi-region localization for improved detection. It supports serving optimizations such as speculative decoding and multi-token prediction, and does not add any inference overhead. TextSeal strictly dominates baselines like SynthID-text in detection strength and is robust to dilution, maintaining confident localized detection even in heavily mixed human/AI documents. The scheme is theoretically distortion-free, and evaluation across reasoning benchmarks confirms that it preserves downstream performance; while a multilingual human evaluation (6000 A/B comparisons, 5 languages) shows no perceptible quality difference. Beyond its use for provenance detection, TextSeal is also ``radioactive'': its watermark signal transfers through model distillation, enabling detection of unauthorized use.

0

cs.CR 2026-05-13 2 theorems

Compromised AI provider enables data theft and policy bypass

Attacks and Mitigations for Distributed Governance of Agentic AI under Byzantine Adversaries

Monitoring and resilient upgrades protect agentic AI governance against most insider attacks while limiting slowdown.

abstract click to expand

Agentic AI governance is a critical component of agentic AI infrastructure ensuring that agents follow their owner's communication and interaction policies, and providing protection against attacks from malicious agents. The state-of-the-art solution, SAGA, assumes a logically centralized point of trust, the Provider, which serves as a repository for user and agent information and actively enforces policies. While SAGA provides protection against malicious agents, it remains vulnerable to a malicious Provider that deviates from the protocol, undermining the security of the identity and access control infrastructure. Deployment on both private and public clouds, each susceptible to insider threats, further increases the risk of Provider compromise. In this work, we analyze the attacks that can be mounted from a compromised Provider, taking into account the different system components and realistic deployments. We identify and execute several concrete attacks with devastating effects: undermining agent attributability, extracting private data, or bypassing access control. We then present three types of solutions for securing the Provider that offer different trade-offs between security and performance. We first present SAGA-BFT, a fully byzantine-resilient architecture that provides the strongest protection, but incurs significant performance degradation, due to the high-cost of byzantine resilient protocols. We then propose SAGA-MON and SAGA-AUD, two novel solutions that leverage lightweight server-side monitoring or client-side auditing to provide protection against most classes of attacks with minimal overhead. Finally, we propose SAGA-HYB, a hybrid architecture that combines byzantine-resilience with monitoring and auditing to trade-off security for performance. We evaluate all the architectures and compare them with SAGA. We discuss which solution is best and under what conditions.

0

cs.CR 2026-05-13 Recognition

New decoder recovers personal data from finetuned models

Reconstruction of Personally Identifiable Information from Supervised Finetuned Models

COVA outperforms prior methods when attackers hold partial knowledge of training examples in medical and legal domains.

abstract click to expand

Supervised Finetuning (SFT) has become one of the primary methods for adapting a large language model (LLM) with extensive pre-trained knowledge to domain-specific, instruction-following tasks. SFT datasets, composed of instruction-response pairs, often include user-provided information that may contain sensitive data such as personally identifiable information (PII), raising privacy concerns. This paper studies the problem of PII reconstruction from SFT models for the first time. We construct multi-turn, user-centric Q&A datasets in sensitive domains, specifically medical and legal settings, that incorporate PII to enable realistic evaluation of leakage. Using these datasets, we evaluate the extent to which an adversary, with varying levels of knowledge about the fine-tuning dataset, can infer sensitive information about individuals whose data was used during SFT. In the reconstruction setting, we propose COVA, a novel decoding algorithm to reconstruct PII under prefix-based attacks, consistently outperforming existing extraction methods. Our results show that even partial attacker knowledge can significantly improve reconstruction success, while leakage varies substantially across PII types.

0

cs.CR 2026-05-13 2 theorems

Phase oscillators reach 100% consensus up to 40% faults

ORCHID: Orchestrated Reduction Consensus for Hash-based Integrity in Distributed Ledgers

Binding threshold on Kuramoto order parameter yields linear message cost and full agreement in simulations of 150 nodes.

abstract click to expand

We present \textbf{ORCHID} (\textit{Orchestrated Reduction Consensus for Hash-based Integrity in Distributed Ledgers}), a novel bio-inspired consensus protocol that maps the neuroscientific \emph{binding problem} -- how the brain integrates distributed neural oscillations into a unified conscious percept -- onto the distributed systems \emph{consensus problem}, how blockchain nodes agree on a single ledger state under Byzantine faults. Grounded in the Penrose--Hameroff Orchestrated Objective Reduction (Orch~OR) hypothesis and the Kuramoto synchronisation model, ORCHID equips each node with a quantum-noisy phase oscillator; consensus is triggered when the network's order parameter $r(t)$ crosses a \emph{binding threshold} $\theta_b$, mirroring the gamma-band binding event in conscious perception. ORCHID is further strengthened by a coherence-weighted Quantum Secret Sharing (QSS) layer, extending the survey framework of Weinberg to a concrete consensus application. Simulation results on Watts--Strogatz small-world networks ($n=10$--$150$) demonstrate: (i)~the Kuramoto order parameter reaches $r_{\max}=0.988$ under coupling $K=3.0$, well above the theoretical critical coupling $K_c \approx 1.41$; (ii)~a sharp QSS fidelity phase transition at coherence $c^*\approx 0.82$, confirming Theorem~2; (iii)100\% consensus rate at all tested Byzantine fractions (0\%--40\%), with median convergence under 4~s for $n=30$; and (iv)~ORCHID achieves $O(n{\cdot}k)$ message complexity, outperforming PBFT's $O(n^2)$ at $n\geq150$. These results establish ORCHID as a scalable, biologically plausible, and quantum-augmented consensus mechanism for post-quantum distributed ledgers.

0

cs.CR 2026-05-13 2 theorems

New language lets cyber ranges share and automate complex defense exercises

ACTING: A Platform for Cyber Ranges Federation

ACTING platform's EDL-FG describes infrastructure and scenario logic so deployment and scoring run automatically across federated ranges.

abstract click to expand

Cyber Defence (CD) training requires interoperable cyber-range environments capable of supporting complex, multidomain exercises across distributed infrastructures. This paper presents three main contributions addressing this challenge. First, we introduce the Exercise Description Language - First Generation (EDL-FG), a structured language for formally describing cyber-range training services and exercises. EDL-FG captures both the technical infrastructure required to emulate ICT/OT environments and the scenario logic governing cyber events, injects, and participant interactions, enabling interoperable and automated scenario deployment across federated Cyber Ranges (CRs). Second, the ACTING platform introduces automated PE and scoring mechanisms that assess trainee actions during exercises through coordinated data collection and analysis across participating CRs. Third, the platform enables multi-domain cyber training scenarios that combine civilian and military operational contexts. Building upon federation capabilities established under the H2020 ECHO project, ACTING demonstrates how interoperable scenario description and automated evaluation support scalable and realistic CD training.

0

cs.CR 2026-05-13 2 theorems

Persona details raise LLM privacy match rate to 40.4 percent

PrivacySIM: Evaluating LLM Simulation of User Privacy Behavior

Conditioning on demographics, experience and attitudes improves results but still leaves most individual decisions unmatched across five 1,0

abstract click to expand

Large language models (LLMs) are increasingly used to simulate human behavior, but their ability to simulate $individual$ privacy decisions is not well understood. In this paper, we address the problem of evaluating whether a core set of user persona attributes can drive LLMs to simulate individual-level privacy behavior. We introduce PrivacySIM, an evaluation suite that benchmarks LLM simulation of user privacy behavior against the ground-truth responses of 1,000 users. These users are drawn from five published user studies on privacy spanning LLM healthcare consultations, conversational agents, and chatbots. Drawing on these user studies, we hypothesize three persona facets as plausible predictors of privacy decision-making: demographics, previous experiences, and stated privacy attitudes. We condition nine frontier LLMs on subsets of these three facets and measure how often each model's response to a data-sharing scenario matches the user's actual response. Our findings show that (1) privacy persona conditioning consistently improves simulation quality over no-persona conditioning, but even the strongest model (40.4\% accuracy) remains far from faithfully simulating individual privacy decisions. (2) A user's stated privacy attitudes alone may not be the best predictor because they often diverge from the user's actual privacy behavior. (3) Users with high AI/chatbot experience but low stated privacy attitudes are the most challenging to simulate. PrivacySIM is a first step toward understanding and improving the capabilities of LLMs to simulate user privacy decisions. We release PrivacySIM to enable further evaluation of LLM privacy simulation.

0

cs.CR 2026-05-13 2 theorems

Deepfake detectors prepared for the wrong threat

The Deepfakes We Missed: We Built Detectors for a Threat That Didn't Arrive

Research stayed focused on public-figure video swaps while scams and intimate imagery became the real problems

abstract click to expand

Nearly a decade of Machine Learning (ML) research on deepfake detection has been organized around a threat model inherited from 2017--2019, revolving around face-swap and talking-head manipulation of public figures, motivated by concerns about large-scale misinformation and video-evidence fraud. This position paper argues that the threat the field prepared for did not arrive, and the threats that did arrive are substantially different. An accounting of deepfake incidents in 2022--2026 shows that the dominant observed harms are peer-generated Non-Consensual Intimate Imagery (NCII), voice-clone scam calls targeting families and finance workers, and emotional-manipulation fraud. The predicted large-scale public-figure deepfake catastrophe did not materialize during the 2024 global information environment despite extensive preparation. Meanwhile, research effort, benchmarks, and detection methods remain concentrated on the inherited threat model. The central claim of this paper is that this misalignment is now the dominant bottleneck on real-world deepfake defense, not model capability. We argue the ML research community should substantially rebalance its research agenda toward the harm categories that are actually growing. We support this position with empirical accounting of research effort and harm distribution, identify the structural reasons the misalignment persists, and outline three concrete technical research agendas for the under-defended harm categories.

0

cs.CR 2026-05-13 1 theorem

Benchmark finds skills expose agents to unsafe attacks

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

Even benign requests lead to harmful actions when agents use modular skills, with experiments revealing consistent failures across setups.

abstract click to expand

Reusable skills are becoming a common interface for extending large language model agents, packaging procedural guidance with access to files, tools, memory, and execution environments. However, this modularity introduces attack surfaces that are largely missed by existing safety evaluations: even when the user request is benign, task-relevant skill materials or local artifacts can steer an agent toward unsafe actions. We present SkillSafetyBench, a runnable benchmark for evaluating such skill-mediated safety failures. SkillSafetyBench includes 155 adversarial cases across 47 tasks, 6 risk domains, and 30 safety categories, each evaluated with a case-specific rule-based verifier. Experiments with multiple CLI agents and model backends show that localized non-user attacks can consistently induce unsafe behavior, with distinct failure patterns across domains, attack methods, and scaffold-model pairings. Our findings suggest that agent safety depends not only on model-level alignment, but also on how agents interpret skills, trust workflow context, and act through executable environments.

0

cs.CR 2026-05-13 2 theorems

Microservices platform spots data leaks and hate speech via NLP

A microservices-based endpoint monitoring platform with predictive NLP models for real-time security and hate-speech risk alerting

Endpoint telemetry and predictive text models are linked to centralize alerts and speed responses to security and compliance issues.

abstract click to expand

Organizations increasingly depend on endpoint devices and corporate communication channels, yet they still face critical risks such as sensitive data leakage, suspicious user behavior, and the circulation of hateful or harmful language in workplace contexts. Current solutions frequently address these issues in isolation (e.g., productivity tracking, data loss prevention, or hate-speech detection), limiting correlation across signals and delaying incident response. This work proposes a unified, microservices-based platform that collects endpoint telemetry and applies predictive natural language processing models to support real-time security and compliance alerting. The architecture is modular and scalable, relying on RabbitMQ for event ingestion and routing and Redis for low-latency data access and alert delivery. For text classification, transformer-based models such as BERT are evaluated for hate-speech risk detection, achieving an average accuracy of 87\%. Experimental results indicate that the proposed platform can promptly surface indicators of data exfiltration and policy violations while centralizing alert management, providing an integrated framework that combines monitoring, security analytics, and predictive capabilities.

0

cs.CR 2026-05-13 2 theorems

Earphones can authenticate you using your heartbeat

AccLock: Unlocking Identity with Heartbeat Using In-Ear Accelerometers

Passive system extracts unique signals from in-ear accelerometers to achieve low false rates without user effort

abstract click to expand

The widespread use of earphones has enabled various sensing applications, including activity recognition, health monitoring, and context-aware computing. Among these, earphone-based user authentication has become a key technique by leveraging unique biometric features. However, existing earphone-based authentication systems face key limitations: they either require explicit user interaction or active speaker output, or suffer from poor accessibility and vulnerability to environmental noise, which hinders large-scale deployment. In this paper, we propose a passive authentication system, called AccLock, which leverages distinctive features extracted from in-ear BCG signals to enable secure and unobtrusive user verification. Our system offers several advantages over previous systems, including zero-involvement for both the device and the user, ubiquitous, and resilient to environmental noise. To realize this, we first design a two-stage denoising scheme to suppress both inherent and sporadic interference. To extract user-specific features, we then propose a disentanglement-based deep learning model, HIDNet, which explicitly separates user-specific features from shared nuisance components. Lastly, we develop a scalable authentication framework based on a Siamese network that eliminates the need for per-user classifier training. We conduct extensive experiments with 33 participants, achieving an average FAR of 3.13% and FRR of 2.99%, which demonstrates the practical feasibility of AccLock.

0

cs.CR 2026-05-13 3 theorems

Adaptive red team bypasses agent skill auditors at 40-90 percent

Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems

Proteus shows feedback-driven revisions let attackers evade vetting and produce runtime harm across LLM skill ecosystems.

abstract click to expand

Agent skills extend LLM agents with reusable instructions, tool interfaces, and executable code, and users increasingly install third-party skills from marketplaces, repositories, and community channels. Because a skill exposes both executable behavior and context-setting documentation, its deployment risk cannot be measured by single-shot audits or prompt-level red teams alone: a realistic attacker can use audit and runtime feedback to repeatedly rewrite the skill. We frame this risk as \emph{adaptive leakage} -- whether a budgeted attacker can iteratively revise a skill until it passes audit and produces verified runtime harm -- and present \ours{}, a grey-box self-evolving red-team framework for measuring it. Proteus searches a formalized five-axis skill-attack space. Each candidate is evaluated through a unified audit-sandbox-oracle pipeline that returns structured audit findings and runtime evidence to guide cross-round mutation. Beyond initial evasion, Proteus performs path expansion, which finds alternative implementations of successful attacks, and surface expansion, which transfers learned implementation patterns to new attack objectives beyond the original seed catalogue. Across eight phase-1 cells, Proteus reaches 40--90\% Attack Success Rate at $5$ rounds (ASR@5) with positive learning-curve slopes on both evaluated auditors. Phase-2 path/surface expansion produces 438 jointly bypassing and lethal variants, with SkillVetter bypassed at $\geq 93\%$ in every cell and AI-Infra-Guard, the strongest public auditor we evaluate, still admitting up to 41.3\% joint-success. These results show that current skill vetting substantially underestimates residual risk when evaluated against adaptive, feedback-driven attackers.

0

cs.CR 2026-05-13 1 theorem

Proxy rewrites live pages to test AI agents for prompt injection

IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection

Enables red-teaming of web-browsing agents against indirect prompt injection on real whitelisted domains instead of mock pages.

abstract click to expand

Web-browsing AI agents are increasingly deployed in enterprise settings under strict whitelists of approved domains, yet adversaries can still influence them by embedding hidden instructions in the HTML pages those domains serve. Existing red-teaming resources fall short of this scenario: prompt-injection benchmarks ship pre-built adversarial pages that whitelisted agents cannot reach, and generic LLM scanners probe the model API rather than its retrieved content. We present IPI-proxy, an open-source toolkit for red-teaming web-browsing agents against indirect prompt injection (IPI). At its core is an intercepting proxy that rewrites real HTTP responses from whitelisted domains in flight, embedding payloads drawn from a unified library of 820 deduplicated attack strings extracted from six published benchmarks (BIPIA, InjecAgent, AgentDojo, Tensor Trust, WASP, and LLMail-Inject). A YAML-driven test harness independently parameterizes the payload set, the embedding technique (HTML comment, invisible CSS, or LLM-generated semantic prose), and the HTML insertion point (6 locations from \icode{head\_meta} to \icode{script\_comment}), enabling parameter-sweep evaluation without mock pages or sandboxed environments. A companion exfiltration tracker logs successful callbacks. This paper describes the threat model, situates IPI-proxy among contemporary IPI benchmarks and red-teaming tools, and details its architecture, design decisions, and configuration interface. By bridging static benchmarks and live deployment, IPI-proxy gives AI security teams a reproducible substrate for measuring and hardening web-browsing agents against indirect prompt injection on the same retrieval surface attackers exploit in production.

0

cs.CR 2026-05-13 2 theorems

Five attacks break x402 micropayment protocol

Five Attacks on x402 Agentic Payment Protocol

The protocol's web and blockchain mix allows unpaid access or post-payment denial in tested environments.

abstract click to expand

The x402 protocol revives the HTTP 402 Payment Required status code to enable web-native micropayments across APIs, content, and agents. It combines synchronous HTTP authorization with asynchronous blockchain settlement and introduces a cross-layer attack surface absent from conventional web and on-chain payments. In this paper, we formally analyze x402 and empirically show that it is vulnerable in both design and implementation. We present five concrete attacks that reveal weaknesses in authorization, binding, replay protection, and web-layer handling, showing that x402 is vulnerable across multiple stages of the payment workflow. We validate these attacks through a reproducible testbed on local chains, Base Sepolia, and live endpoints and further audit three open-source SDKs and endpoints. Our results show that all five attacks are practical and can cause either unpaid service or paid-but-denied outcomes. We also propose practical mitigations.

0

cs.CR 2026-05-13 Recognition

80% of AI agent skills deviate from declared behavior

Behavioral Integrity Verification for AI Agent Skills

Verification framework attributes most gaps to oversight and detects malicious cases with 0.946 F1 score.

abstract click to expand

Agent skills extend LLM agents with privileged third-party capabilities such as filesystem access, credentials, network calls, and shell execution. Existing safety work catches malicious prompts and risky runtime actions, but the skill artifact itself goes unverified. We formalize this as the behavioral integrity verification (BIV) problem: a typed set comparison between declared and actual capabilities over a shared taxonomy that bridges code, instructions, and metadata. The BIV framework instantiates this comparison by pairing deterministic code analysis with LLM-assisted capability extraction. The resulting structured evidence supports three downstream analyses: deviation taxonomy, root-cause classification, and malicious-skill detection. On 49,943 skills from the OpenClaw registry, the deviation taxonomy reveals a pervasive description-implementation gap: 80.0% of skills deviate from declared behavior, with four novel compound-threat categories surfaced. Root-cause classification finds that deviations are mostly oversight, not malice: 81.1% trace to developer oversight and 18.9% to adversarial intent, with 5.0% of skills carrying predicted multi-stage attack chains. On a 906-skill malicious-skill detection benchmark, BIV reaches an F1 of 0.946, outperforming state-of-the-art rule-based and single-pass LLM baselines. These results demonstrate behavioral integrity auditing for agent skills at scale.

0

cs.CR 2026-05-13 2 theorems

Signature scheme adds scoped linkability and threshold deanonymization

Deanonymizable Scoped Linkable Ring Signatures

Dynamic key images and embedded ElGamal components let signers stay anonymous within a context while a k-of-N network can recover identity.

abstract click to expand

Although ring signatures offer highly desirable privacy requirements like anonymity and ad-hoc group formation with signer autonomy, they partially lack trust requirements like linkability and accountability that are required for strict use-cases, such as consent management in healthcare. Existing signature schemes fail to natively integrate scoped linkability with decentralized accountability (on-demand deanonymization) in a single scheme without relying on separate commitments or a centralized opener. We therefore introduce Deanonymizable Scoped Linkable Ring Signatures (DSLRS). The originality of the DSLRS is manifold. DSLRS uses scopes (context identifiers) and dynamic key images to provide scoped linkability and unlinkability across different scopes. Decentralized accountability is provided thanks to two ELGamal components deeply embedded in the signature, and a decentralized deanonymization network of k-of-N nodes that can collaboratively extract the signer's public key. DSLRS scheme is defined and proved under the ECDLP and DDH hardness assumptions in the Random Oracle Model (ROM). Formal security definitions and formal reduction proofs are provided before introducing a blockchain-based instantiation for a consent management application using DSLRS.

0

cs.CR 2026-05-13 Recognition

Hybrid reasoning speeds CPS digital twin threat detection by 21.5%

HySecTwin: A Knowledge-Driven Digital Twin Framework Augmented with Hybrid Reasoning for Cyber-Physical Systems

Semantic models plus fuzzy logic deliver faster, transparent assessments with sub-millisecond synchronization in testbed trials.

abstract click to expand

Existing Digital Twin (DT) approaches often lack semantic reasoning capabilities for effective cybersecurity modelling in Cyber-Physical Systems (CPS). This paper presents HySecTwin, a knowledge-driven digital twin architecture that places automated reasoning at the core of real-time threat detection. HySecTwin incorporates semantic modelling to transform heterogeneous CPS telemetry, device attributes, and operational relationships into machine-interpretable representations, combined with an embedded reasoning engine operating over contextualized system states. Unlike opaque detection methods, the framework integrates deterministic rule-based inference with hybrid fuzzy reasoning to generate explicit, interpretable, and auditable security assessments from live device telemetry. This enables context-aware monitoring of complex CPS environments while preserving transparency and trust. Experimental evaluation using a representative CPS testbed and MITRE ATT\&CK campaign-inspired attack scenarios demonstrates sub-millisecond twin synchronization latency and up to 21.5\% faster threat detection compared with deterministic reasoning alone. The results show that semantic modelling, semantic enrichment, and hybrid reasoning improve explainability and resilience without extra system overhead. HySecTwin provides a lightweight, containerized, and extensible framework for secure-by-design digital twin deployments in mission-critical infrastructures

0

cs.CR 2026-05-13 Recognition

597-line harness supports fair comparisons of LLM pen-testing agents

Cochise: A Reference Harness for Autonomous Penetration Testing

Planner-Executor split plus released logs and analysis tools let researchers isolate model and architecture effects on GOAD scenarios

abstract click to expand

Recent work on LLM-driven autonomous penetration testing reports promising results, but existing systems often combine many architectural, prompting, and tool-integration choices, making it difficult to tell what is gained over a simple agent scaffold. We present cochise, a 597 LOC Python reference harness for autonomous penetration-testing experiments. Cochise connects an LLM-driven agent to a Linux execution host over SSH and supports controlled target environments reachable from that jump host. The prototype implements a separated Planner--Executor architecture in which long-term state is maintained outside the LLM context, while a ReAct-style executor issues commands over SSH and self-corrects based on command outputs. The scenario prompt can be adapted to different target environments. To demonstrate the efficacy of our minimal harness, we evaluate it against a live third-party testbed called Game of Active Directory (GOAD). Alongside the harness, we release replay and analysis tools: (i) cochise-replay for offline visualization of captured runs, (ii) cochise-analyze-alogs and cochise-analyze-graphs for cost, token, duration, and compromise analysis, and (iii) a corpus of JSON trajectory logs from GOAD runs, allowing researchers to study agent behavior without provisioning the 48--64 GB RAM / 190 GB storage testbed themselves. Cochise is intended not as a state-of-the-art pen-testing agent, but as reusable experimental infrastructure for comparing models, agent architectures, and penetration-testing traces.

0

cs.CR 2026-05-13 2 theorems

Injected safety reports cut jailbreak success in reasoning models

Safety Context Injection: Inference-Time Safety Alignment via Static Filtering and Agentic Analysis

External risk assessment prepended before queries lowers attack rates and toxicity for black-box models on standard benchmarks.

abstract click to expand

Large Reasoning Models (LRMs) improve performance on complex tasks, but they also make safety control harder at deployment time. In black-box settings, defenders cannot modify model weights and must instead intervene at inference time. This setting creates three practical challenges: harmful intent may be hidden by educational or role-play framing, deep safety analysis can introduce non-trivial latency, and long adversarial contexts can dilute the local cues that simpler filters rely on. These challenges can expose an apparent thinking--output gap, where the model appears cautious during reasoning but still produces an unsafe final answer. To address this problem, we propose Safety Context Injection (SCI), an inference-time framework that separates safety assessment from task generation and prepends a structured external risk report as injected safety context for the protected model. The framework is instantiated in two complementary variants: Static Model Filtering (SMF), a lightweight one-pass guard for fast deployment, and Dynamic Agents Filtering (DAF), an agentic-loop-based analyzer that iteratively gathers and synthesizes evidence for ambiguous or long-context attacks. Across AdvBench and GPTFuzz, spanning base and reasoning models under five jailbreak families, both variants reduce attack success rate and toxicity in the evaluated settings. SMF offers an efficient low-latency option, while DAF is more effective when harmful intent is semantically disguised or dispersed across long contexts.

0

cs.CR 2026-05-13 Recognition

Binomial encoding embeds every payload bit at every LLM token

Every Bit, Everywhere, All at Once: A Binomial Multibit LLM Watermark

Stateful redirection keeps 64-bit messages recoverable with little change to the generated text

abstract click to expand

With LLM watermarking already being deployed commercially, practical applications increasingly require multibit watermarks that encode more complex payloads, such as user IDs or timestamps, into the generated text. In this work, we propose a fundamentally new approach for multibit watermarking: introducing binomial encoding to directly encode every bit of the payload at every token position. We complement our approach with a stateful encoder that during generation dynamically redirects encoding pressure toward underencoded bits. Our evaluation against 8 baselines on up to 64-bit payloads shows that our scheme achieves superior message accuracy and robustness, with the gap to baseline methods widening in more relevant settings (i.e., large payloads and low-distortion regimes). At the same time, we challenge prior works' evaluation metrics, highlighting their lack of practical insights, and introduce per-bit confidence scoring as a practically relevant metric for evaluating multibit LLM watermarks.

0

cs.CR 2026-05-13 2 theorems

Typed relations detect phishing despite word padding attacks

PhishSigma++: Malicious Email Detection with Typed Entity Relations

Graph of 40 entity types and 5 relations holds 0.96 F1 when text filters fall to 0.02 and 0.73.

abstract click to expand

Here is a further shortened version (pure text, no extra formatting, academic style preserved, no content change): Abstract. With the rise of AI-generated content (AIGC), phishing actors now possess richer linguistic capabilities and evasion techniques. Most existing detectors over-rely on mutable textual features, achieving high accuracy on clean data but degrading severely under text-focused adversarial manipulation. This mirrors the lab-to-real performance gap. We investigate invariant signals in phishing emails: even when attackers modify surface text, functional intent constrains relations among typed entities. Threat-actor tradecraft is described via high-level TTPs, but rule-based systems like Sigma express invariants only through manually curated, field-specific patterns, limiting flexibility. We introduce PhishSigma++, an entity-relation-based malicious email detector for RFC822 messages that generalizes Sigma's design. It extracts 40 typed entity classes, computes 5 cross-type relations to build a typed email graph, and uses particle swarm optimization (PSO) to select a sparse discriminative mask, supporting classification and type-level evidence summary. On 29,142 messages, PhishSigma++ achieves 0.9675 F1 on clean data and outperforms text-centric baselines under non-adaptive Good Word padding at \r{ho}=0.8. It maintains 0.9579 F1, while a token-based Bayesian filter collapses to 0.0243 and a DistilBERT phishing checkpoint falls to 0.7284. Compared with traditional Sigma rules, PhishSigma++ offers higher detection, broader relational invariance coverage, and data-driven feature selection. We also show that thresholded typed relation scores encode a useful fragment of Sigma-style field conditions, unifying hand-crafted rule logic and learned relation masks in a single-email framework.

0

cs.CR 2026-05-13 Recognition

CNNs cannot deanonymize I2P traffic in lab or real tests

Convolutional-Neural-Networks for Deanonymisation of I2P Traffic

Synthetic training data and Fano's inequality analysis show that encrypted patterns do not reveal service identities.

abstract click to expand

This study investigates the potential for deanonymizing services within the Invisible Internet Project (I2P) network through passive traffic analysis and machine learning techniques. The primary objective is to identify distinctive patterns in I2P traffic despite the encryption of its payload. To achieve this, a controlled laboratory environment was established to generate synthetic I2P traffic, providing a training dataset for machine learning models. Furthermore, Fano's inequality is employed to perform a theoretical analysis of anonymous data transmission in mix networks such as I2P, thereby supporting a data-driven approach to uncover causal relationships. In computer experiments, advanced deep learning methods - particularly Convolutional Neural Networks - are applied within the laboratory I2P network, and their effectiveness is further evaluated using real-world traffic data. The results indicate that the proposed methodologies do not compromise the anonymity guarantees of the I2P network.

0

cs.CR 2026-05-13 Recognition

Single prompt steers multi-agent LLM planning to 55% higher attack success

FlowSteer: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems

FlowSteer exploits position and framing weaknesses during workflow formation, showing that final-plan checks miss planning-time bias.

abstract click to expand

Multi-agent systems (MAS) powered by large language models (LLMs) increasingly adopt planner--executor architectures, where planners convert prompts into subtasks, roles, dependencies, and routing paths. This flexibility enables adaptive coordination, but exposes an attack surface in workflow formation: prompts can shape agent organization without modifying MAS infrastructure. We study this risk through social influence probing workflows to identify high-impact subtasks and malicious-signal propagation. The analysis reveals two vulnerabilities: workflow position can amplify or suppress a malicious signal, and sycophantic framing makes downstream agents more likely to relay it. We translate these findings into FlowSteer, a prompt-only workflow steering attack that converts vulnerability priors into one crafted prompt. FlowSteer aligns a malicious signal with influential task components and guides replanning toward dependencies that preserve propagation. Experiments show that FlowSteer increases malicious success by up to 55% over naive prompting, transfers across MAS setups, and remains effective with black-box topology inference. As FlowSteer biases the planning signals that generate the workflow, MAS defenses that inspect only the generated workflow provide limited protection. As such, we introduce FlowGuard, an input-side defense that reduces malicious success by up to 34% while preserving prompt utility. Our results position workflow formation as a new safety frontier for multi-agent LLM systems, opening a planning-time security perspective on how agent coordination itself can be attacked and defended.

0

cs.CR 2026-05-13 Recognition

Agents carry explicit permissions that stay consistent across systems

Digital Identity for Agentic Systems: Toward a Portable Authorization Standard for Autonomous Agents

A model using issuer-written payloads and typed constraints lets independent receivers agree on what an autonomous agent may do.

abstract click to expand

Enterprise AI is shifting from copilots to autonomous agents capable of executing workflows, negotiating outcomes, and making decisions with limited human oversight. As these systems extend across organizational boundaries, identity alone is insufficient: an agent's authority must also be explicit, constrained, auditable, revocable, and consistently interpretable by independent receivers. This paper analyzes representative enterprise use cases in insurance claims processing and supply chain integrity to surface structural gaps in existing identity and access models. It proposes a portable authorization model for autonomous agents based on issuer-authored authorization payloads, typed constraint algebra, decision-consistent evaluation semantics, delegation attenuation, governed semantic resolution, fail-closed processing, and pre-flight discovery. The model separates credential containers, authorization payload semantics, and enforcement engines, allowing profiles such as JWT/JWS, Verifiable Credentials, OAuth Rich Authorization Requests, or policy-engine bindings to preserve a common authorization meaning across trust boundaries.

0

cs.CR 2026-05-13 2 theorems

One message turns LLM agents into DDoS amplifiers

Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection

Semantic closure sustains recursive calls that multiply node load up to 51x and network latency up to 229x.

abstract click to expand

Large Language Model (LLM) agents have emerged as key intermediaries, orchestrating complex interactions between human users and a wide range of digital services and LLM infrastructures. While prior research has extensively examined the security of LLMs and agents in isolation, the systemic risk of the agent acting as a disruptive hub within the user-agent-service chain remains largely overlooked. In this work, we expose a novel threat paradigm by introducing Mobius Injection, a sophisticated attack that weaponizes autonomous agents into zombie nodes to launch what we define as gent-based and -Oriented DDoS (AbO-DDoS) attacks. By exploiting a structural vulnerability in agentic logic named Semantic Closure, an adversary can induce sustained recursive execution of agent components through a single textual injection. We demonstrate that this attack is exceptionally lightweight, stealthy against both traditional DDoS monitors and contemporary AI safety filters, and highly configurable, allowing for surgical targeting of specific environments or model providers. To evaluate the real-world impact, we conduct extensive experiments across three representative claw-style agents and three mainstream coding agents, integrated with 12 frontier proprietary or open-weight LLMs. Our results demonstrate that Mobius Injection achieves substantial attack success across diverse tasks, driving single-node call amplification up to 51.0x and multi-node p95 latency inflation up to 229.1x. The attack performance exhibits a superlinear increase with the number of poisoning nodes. To mitigate Mobius Injection, we propose a proactive defense mechanism using Agent Component Energy (ACE) Analysis, which detects malicious recursive triggers by measuring anomalous energy in the agent's component graph.

0

cs.CR 2026-05-13 2 theorems

Risk lattice turns consent clicks into reusable options

Options, Not Clicks: Lattice Refinement for Consent-Driven MCP Authorization

Conleash auto-permits safe MCP tool calls within boundaries, escalates risks, and achieves 98% accuracy with low overhead.

abstract click to expand

As Model Context Protocol adoption grows, securing tool invocations via meaningful user consent has become a critical challenge, as existing methods, broad always allow toggles or opaque LLM-based decisions, fail to account for dangerous call arguments and often lead to consent fatigue. In this work, we present Conleash, a client-side middleware that enforces boundary-scoped authorization by utilizing a risk lattice to auto-permit safe calls within known boundaries while escalating risks, a policy engine for user-defined invariants, and a refinement loop that converts user decisions into reusable rules. Evaluated on 984 real-world traces, Conleash achieved 98.2% accuracy, caught 99.4% of escalations, and added only 8.2 ms of overhead for policy verification; furthermore, in a user study where N=16, participants significantly preferred Conleash scoped permissions over traditional methods, citing higher trust and reduced prompting.

0

cs.CR 2026-05-12 Recognition

One InterUSS deployment yields security testing guide for UTM

A Systematic Security Testing Approach for InterUSS-based environments

Tests based on mTLS and OAuth 2.0 support component checks and interaction analysis in federated unmanned aircraft systems.

abstract click to expand

Unmanned Traffic Management (UTM) federated ecosystems, such as InterUSS, enable secure coordination among UAS Service Suppliers (USSs). However, they bring up some security challenges at the infrastructure level that haven't been fully explored. This paper presents a security testing approach for InterUSS-based environments from the maintainer's perspective. By deploying and analyzing a working InterUSS infrastructure, we pinpoint key components and develop specific security tests aligned with established standards and protocols, such as mTLS and OAuth 2.0. We compiled these tests into a Testing Guide that aids both component validation and interaction analysis across InterUSS-based ecosystems, filling a gap in current research.

0

cs.CR 2026-05-12 Recognition

AI turns public social posts into personalized phishing emails

Context-Aware Spear Phishing: Generative AI-Enabled Attacks Against Individuals via Public Social Media Data

Small public activity per target produces messages that beat real phishing on eight measures and raise less suspicion in tests.

abstract click to expand

We demonstrate how publicly available social-media data and generative AI (GenAI) can be misused to automate and scale highly personalized, context-aware spear-phishing campaigns. With minimal attacker effort, a small amount of public activity per target is sufficient for GenAI models to extract interests and contextual cues, producing persuasive messages that mirror a target's style while bypassing generic content-moderation safeguards. We introduce a modular framework that combines multimodal signal extraction, communication-style profiling, and attack-type instantiation across seven strategies (baiting, scareware, honey trap, tailgating, impersonation, quid pro quo, and personalized emotional exploitation). We conduct a large-scale, multi-model evaluation covering thousands of generated emails and eight security-relevant criteria, benchmarking against a corpus of real-world phishing messages. The GenAI-produced emails exhibit markedly higher personalization, contextual grounding, and persuasive leverage. Importantly, a complementary user study corroborates these results, revealing that LLM-generated attacks consistently outperform APWG eCrimeX emails across eight dimensions while eliciting lower suspicion among human recipients. Finally, we measure and analyze the behavior of existing proactive, prompt-level defense mechanisms, which incorporate adaptive mechanisms, as well as two complementary defense approaches-policy-augmented SOTA safeguard models and system-instruction chain-of-thought moderation. We document how these defenses respond to contextualized and adaptive attack prompts, underscoring the need for platform-level safeguards that explicitly account for contextualized abuse at scale.

0

cs.CR 2026-05-12 Recognition

4714 GitHub workflows hijackable via crafted comments

Comment and Control: Hijacking Agentic Workflows via Context-Grounded Evolution

Context-grounded evolution turns untrusted inputs into control over LLM agents in popular automation platforms.

abstract click to expand

Automation platforms such as GitHub Actions and n8n are increasingly adopting so-called agentic workflows, which integrate Large Language Model (LLM) agents for tasks such as code review and data synchronization. While bringing convenience for developers, this integration exposes a new risk: An adversary may control and craft certain inputs, such as GitHub issue comments, to manipulate the LLM agent for unwanted actions, such as credential exfiltration and arbitrary command execution. To our knowledge, no prior academic work has studied such a risk in agentic workflows. In this paper, we design the first detection and exploitation framework, called JAW, to hijack agentic workflows hosted on automation platforms via a novel approach called Context-Grounded Evolution. Our key idea is to evolve agentic workflow inputs under the contexts derived from hybrid program analysis for hijacking purposes. Specifically, JAW generates agentic workflow contexts through three analyses: (i) static path-feasibility analysis to identify feasible agent-invocation paths and the input constraints required to trigger them, (ii) dynamic prompt-provenance analysis to determine how that input is transformed and embedded into the LLM context, and (iii) capability analysis to identify the actions and restrictions available to the agent at runtime. Our evaluation of JAW on GitHub workflows and n8n templates showed that 4714 GitHub workflows and eight n8n templates can be successfully hijacked, for example, to leak user credentials. Our findings span 15 widely-used GitHub Actions, including official GitHub Actions for Claude Code, Gemini CLI, Qwen CLI, and Cursor CLI, and two official n8n nodes. We responsibly disclosed all findings to the affected vendors and received many acknowledgements, fixes, and bug bounties, notably from GitHub, Google, and Anthropic.

0

cs.CR 2026-05-12 Recognition

Fuzzer finds 15 vulnerabilities in LLM serving engines

Continuous Discovery of Vulnerabilities in LLM Serving Systems with Fuzzing

By testing timed concurrent request traces, it reveals cache isolation failures and performance interference missed by standard tests.

abstract click to expand

LLM inference and serving systems have become security-critical infrastructure; however, many of their most concerning failures arise from the serving layer rather than from model behavior alone. Modern inference engines combine KV cache, batching, prefix sharing, speculative decoding, adapters, and multi-tenant scheduling, creating shared-state behavior that only emerges under realistic concurrent workloads and is missed by standard model, safety, and API tests. We present GRIEF, a greybox fuzzer for LLM inference engines that treats timed multi-request traces as first-class inputs, uses lightweight oracles to detect crashes, hangs, performance pathologies, and silent output corruption, and applies controlled replay with log-probability checks to confirm reproducible serving-layer failures. Across early campaigns on vLLM and SGLang, GRIEF discovers 15 vulnerabilities, 10 confirmed by engine developers, including 2 CVEs, spanning KV-cache isolation failures, cross-request performance interference, and crash or liveness bugs. These results show that concurrency, caching, and state reuse can induce silent cross-request contamination, noisy-neighbor denial of service, and delayed crashes without malformed inputs or explicit server errors, making concurrent serving behavior a first-class security and reliability boundary for LLM infrastructure.

0

cs.CR 2026-05-12 Recognition

LLM generates SQLi that bypasses 22.73% of web firewalls

Adversarial SQL Injection Generation with LLM-Based Architectures

RADAGAS model outperforms baselines across ten firewalls, with stronger results against AI-based defenses than rule-based ones.

abstract click to expand

SQL injection (SQLi) attacks are still one of the serious attacks ranked in the Open Worldwide Application Security Project (OWASP) Top 10 threats. Today, with advances in Artificial Intelligence (AI), especially in Large Language Models (LLMs), an opportunity has been created for automating adversarial attack tests to measure the defense mechanisms. In this paper, we aim to create a comprehensive evaluation of use cases that utilize LLMs for adversarial SQL injection generation. We introduce two novel LLM-based systems, Retrieval Augmented Generation for Adversarial SQLi (RADAGAS) and Reflective Chain-of-Thought SQLi (RefleXQLi), and compare them with existing baselines against 10 Web Application Firewalls (WAFs) and one execution-based MySQL validator. To perform a comprehensive test, we used six rule-based open-source WAFs (ModSecurity PL1--3, Coraza PL1--3), 2 AI/ML-based WAFs (WAF Brain, CNN-WAF), and 2 commercial WAFs (AWS WAF and Cloudflare WAF). For the LLM models, we used GPT-4o, Claude 3.7 Sonnet, and DeepSeek R1. Our tests consist of 240 experiments that generate 240,000 payloads and perform 2.2 million tests against WAFs. Our comprehensive evaluation reveals that RADAGAS-GPT4o outperforms other baseline models with a 22.73\% bypass rate. The proposed RADAGAS variants are highly successful on AI/ML-based WAFs (92.49\% on WAF-Brain by RADAGAS-DeepSeek, 80.48\% on CNN-WAF by RADAGAS-Claude), but struggle to bypass rule-based WAFs (0--5.70\% on ModSecurity and Coraza). In addition to these findings, another observation is that creating less diverse payloads achieves more bypasses, however they show poor results if the initially chosen payload is not successful. We observe that our findings provide a comprehensive view on using LLM-based approaches in security testing.

0

cs.CR 2026-05-12 2 theorems

LLMs flag too many false issues in smart contract code

Benchmarking LLM-Based Static Analysis for Secure Smart Contract Development: Reliability, Limitations, and Potential Hybrid Solutions

Benchmarking reveals reliance on variable names over program logic, limiting standalone use for blockchain security.

abstract click to expand

The irreversible nature of blockchain transactions makes the identification of smart contract vulnerabilities an essential requirement for secure system development. While Large Language Models (LLMs) are increasingly integrated into developer workflows, their reliability as autonomous security auditors remains unproven. We assess whether current generative models are a viable replacement for, or only a complement to, traditional static-analysis tools. Our findings indicate that LLM efficacy is undermined by both inherent lexical bias and a lack of rigorous validation of external data inputs. This reliance on non-semantic heuristics, such as identifier naming, leads to a high frequency of false positives. Furthermore, prompting techniques reveal a trade-off between precision and recall. These results were derived using our custom automated framework, which achieves 92% accuracy in classifying model outputs.

0

cs.CR 2026-05-12 Recognition

Surrogates from benign clients neutralize backdoors with false positives under 10%

FedSurrogate: Backdoor Defense in Federated Learning via Layer Criticality and Surrogate Replacement

By swapping malicious updates after identifying critical layers, the method preserves accuracy and limits attack success below 2 percent in

abstract click to expand

Federated Learning remains highly susceptible to backdoor attacks--malicious clients inject targeted behaviours into the global model. Existing defenses suffer from substantial false-positive rates under realistic non-independent and identically distributed (non-IID) data, incorrectly flagging benign clients and degrading model accuracy even when adversaries are correctly identified. We present FedSurrogate, a novel backdoor defense that addresses this limitation by combining bidirectional gradient alignment filtering with layer-adaptive anomaly detection. FedSurrogate performs selective clustering on security-critical layers identified via directional divergence analysis, concentrating the detection signal on a low-dimensional subspace. A bidirectional soft-filtering stage screens trusted clients for residual contamination while rescuing false positives from suspects, substantially reducing misclassifications under heterogeneous conditions. Rather than removing confirmed malicious updates, FedSurrogate replaces them with downscaled surrogate updates from structurally similar benign clients, preserving gradient diversity while neutralising adversarial influence. Extensive evaluations demonstrate that FedSurrogate maintains false-positive rates below 10% across all datasets and attack types, compared to 31-32% for the nearest comparably effective baseline, while achieving superior main-task accuracy and maintaining attack success rates below 2.1% across all tested datasets and attack types under challenging non-IID settings.

0

cs.CR 2026-05-12 2 theorems

Frontier models exploit 157 vulnerabilities in new benchmark

ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks?

Agents turn supplied triggers into working attacks on real code, with success persisting when defenses are enabled.

abstract click to expand

AI agents are rapidly gaining capabilities that could significantly reshape cybersecurity, making rigorous evaluation urgent. A critical capability is exploitation: turning a vulnerability, which is not yet an attack, into a concrete security impact, such as unauthorized file access or code execution. Exploitation is a particularly challenging task because it requires low-level program reasoning (e.g., about memory layout), runtime adaptation, and sustained progress over long horizons. Meanwhile, it is inherently dual-use, supporting defensive workflows while lowering the barrier for offense. Despite its importance and diagnostic value, exploitation remains under-evaluated. To address this gap, we introduce ExploitGym, a large-scale, diverse, realistic benchmark on the exploitation capabilities of AI agents. Given a program input that triggers a vulnerability, ExploitGym tasks agents with progressively extending it into a working exploit. The benchmark comprises 898 instances sourced from real-world vulnerabilities across three domains, including userspace programs, Google's V8 JavaScript engine, and the Linux kernel. We vary the security protections applied to each instance, isolating their impact on agent performance. All configurations are packaged in reproducible containerized environments. Our evaluation shows that while exploitation remains challenging, frontier models can successfully exploit a non-trivial fraction of vulnerabilities. For example, the strongest configurations are Anthropic's latest model Claude Mythos Preview and OpenAI's GPT-5.5, which produce working exploits for 157 and 120 instances, respectively. Notably, even with widely used defenses enabled, models retain non-trivial success rates. These results establish ExploitGym as an effective testbed for exploitation and highlight the growing cybersecurity risks posed by increasingly capable AI agents.

0

cs.CR 2026-05-12 Recognition

Reusable hardened workflows could replace on-the-fly AI agent improvisation

Engineering Robustness into Personal Agents with the AI Workflow Store

Amortizing engineering costs across users allows rigorous testing and evaluation without sacrificing agent responsiveness.

abstract click to expand

The dominant paradigm for AI agents is an "on-the-fly" loop in which agents synthesize plans and execute actions within seconds or minutes in response to user prompts. We argue that this paradigm short-circuits disciplined software engineering (SE) processes -- iterative design, rigorous testing, adversarial evaluation, staged deployment, and more -- that have delivered the (relatively) reliable and secure systems we use today. By focusing on rapid, real-time synthesis, are AI agents effectively delivering users improvised prototypes rather than systems fit for high-stakes scenarios in which users may unwittingly apply them? This paper argues for the need to integrate rigorous SE processes into the agentic loop to produce production-grade, hardened, and deterministically-constrained agent *workflows* that substantially outperform the potentially brittle and vulnerable results of on-the-fly synthesis. Doing so may require extra compute and time, and if so, we must amortize the cost of rigor through reuse across a broad user community. We envision an *AI Workflow Store* that consists of hardened and reusable workflows that agents can invoke with far greater reliability and security than improvised tool chains. We outline the research challenges of this vision, which stem from a broader flexibility-robustness tension that we argue requires moving beyond the ``on-the-fly'' paradigm to navigate effectively.

0

cs.CR 2026-05-12 Recognition

Pre-engineered workflows replace on-the-fly AI agent synthesis

Engineering Robustness into Personal Agents with the AI Workflow Store

Disciplined engineering amortized through community reuse delivers production-grade reliability and security.

abstract click to expand

The dominant paradigm for AI agents is an "on-the-fly" loop in which agents synthesize plans and execute actions within seconds or minutes in response to user prompts. We argue that this paradigm short-circuits disciplined software engineering (SE) processes -- iterative design, rigorous testing, adversarial evaluation, staged deployment, and more -- that have delivered the (relatively) reliable and secure systems we use today. By focusing on rapid, real-time synthesis, are AI agents effectively delivering users improvised prototypes rather than systems fit for high-stakes scenarios in which users may unwittingly apply them? This paper argues for the need to integrate rigorous SE processes into the agentic loop to produce production-grade, hardened, and deterministically-constrained agent *workflows* that substantially outperform the potentially brittle and vulnerable results of on-the-fly synthesis. Doing so may require extra compute and time, and if so, we must amortize the cost of rigor through reuse across a broad user community. We envision an *AI Workflow Store* that consists of hardened and reusable workflows that agents can invoke with far greater reliability and security than improvised tool chains. We outline the research challenges of this vision, which stem from a broader flexibility-robustness tension that we argue requires moving beyond the ``on-the-fly'' paradigm to navigate effectively.

0

cs.CR 2026-05-12 Recognition

Dataset releases 100 hours of Valorant play for biometric fingerprinting

BEACON: A Multimodal Dataset for Learning Behavioral Fingerprints from Gameplay Data

BEACON captures mouse, keyboard, video and network data from 28 players to enable continuous authentication research in competitive gaming.

abstract click to expand

Continuous authentication in high-stakes digital environments requires datasets with fine-grained behavioral signals under realistic cognitive and motor demands. But current benchmarks are often limited by small scale, unimodal sensing or lack of synchronised environmental context. To address this gap, this paper introduces BEACON ( Behavioral Engine for Authentication \& Continuous Monitoring), a large-scale multimodal dataset that captures diverse skill tiers in competitive \textit{Valorant} gameplay. BEACON contains approximately 430 GB of synchronised modality data (461 GB total on-disk including auxiliary \textit{Valorant} configuration captures) from 79 sessions across 28 distinct players, estimated at 102.51 hours of active gameplay, including high-frequency mouse dynamics, keystroke events, network packet captures, screen recordings, hardware metadata, and in-game configuration context. BEACON leverages the high precision motor skills and high cognitive load that are inherent to tactical shooters, making it a rigorous stress test for the robustness of behavioral biometrics. The dataset allows for the study of continuous authentication, behavioral profiling, user drift and multimodal representation learning in a high-fidelity esports setting. The authors release the dataset and code on Hugging Face and GitHub to create a reproducible benchmark for evaluating next-generation behavioral fingerprinting and security models

0

cs.CR 2026-05-12 2 theorems

Domain-adapted LLMs no better than general ones at 5G threat modeling

Threat Modelling using Domain-Adapted Language Models: Empirical Evaluation and Insights

Tests on 52 configurations show adaptation and scale give inconsistent gains while decoding choices control output validity.

abstract click to expand

Large Language Models(LLMs) are increasingly explored for cybersecurity applications such as vulnerability detection. In the domain of threat modelling, prior work has primarily evaluated a number of general-purpose Large Language Models under limited prompting settings. In this study, we extend the research area of structured threat modelling by systematically evaluating domain-adapted language models of different sizes to their general counterparts. We use both LLMs and Small Language Models(SLMs) that were domain adapted to telecommunications and cybersecuirty. For the structured threat modelling, we selected the widely used STRIDE approach and the application area is 5G security. We present a comprehensive empirical evaluation using 52 different configurations (on 8 different language models) to analyze the impact of 1) domain adaptation, 2) model scale, 3) decoding strategies (greedy vs. stochastic sampling), and 4) prompting technique on STRIDE threat classification. Our results show that domain-adapted models do not consistently outperform their general-purpose counterparts, and decoding strategies significantly affect model behavior and output validity. They also show that while larger models generally achieve higher performance, these gains are neither consistent nor sufficient for reliable threat modelling. These findings highlight fundamental limitations of current LLMs for structured threat modelling tasks and suggest that improvements require more than additional training data or model scaling, motivating the need for incorporating more task-specific reasoning and stronger grounding in security concepts. We present insights on invalid outputs encountered and present suggestions for prompting tailored specifically for STRIDE threat modelling.

0

cs.CR 2026-05-12 Recognition

LLMs generate hardware code but introduce security risks

LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges

Review maps opportunities in RTL automation and testbenches against vulnerabilities from memorization and attacks, outlining needed defenses

abstract click to expand

The integration of Large Language Models (LLMs) into Electronic Design Automation (EDA) and hardware security is rapidly reshaping the semiconductor industry. While LLMs offer unprecedented capabilities in generating Register Transfer Level (RTL) code, automating testbenches, and bridging the semantic gap between high-level specifications and silicon, they simultaneously introduce severe vulnerabilities. This comprehensive review provides an in-depth analysis of the state-of-the-art in LLM-driven hardware design, organized around key advancements in EDA synthesis, hardware trust, design for security, and education. We systematically expand on the methodologies of recent breakthroughs -- from reasoning-driven synthesis and multi-agent vulnerability extraction to data contamination and adversarial machine learning (ML) evasion. We integrate general discussions on critical countermeasures, such as dynamic benchmarking to combat data memorization and aggressive red-teaming for robust security assessment. Finally, we synthesize cross-cutting lessons learned to guide future research toward secure, trustworthy, and autonomous design ecosystems.

0

cs.CR 2026-05-12 2 theorems

Models leak secret words thematically in stories despite hiding instructions

Can You Keep a Secret? Involuntary Information Leakage in Language Model Writing

A second model recovers the hidden word from topic and imagery choices at rates up to 79 percent, and avoidance itself becomes detectable.

abstract click to expand

Language models are deployed in settings that require compartmentalization: system prompts should not be disclosed, chain-of-thought reasoning is hidden from users, and sensitive data passes through shared contexts. We test whether models can keep prompted information out of their writing. We give each model a secret word with instructions not to reveal it, then ask it to write a story. A second model tries to identify the secret from the story in a binary discrimination test. The secret word never appears literally in any output, but all five frontier models we test leak it thematically -- through topic choice, imagery, and setting--6hy-at rates significantly different from chance, up to 79\%. When told to actively hide the secret, models write \emph{away from} it, and this avoidance is itself detectable. The leakage is cross-model readable, scales sharply with model size within two model families, and disappears entirely for short-form writing like jokes. Giving the model a decoy concept to ``focus on instead'' partially redirects the leakage from the real secret to the decoy. Attending to a secret appears to open up an information channel that frontier LLMs cannot close, even when instructed to.

0

cs.CR 2026-05-12 Recognition

Agents finish dangerous OS tasks before refusing them

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

New benchmark shows verbal refusals often come after the harmful operation has already run at the system level.

abstract click to expand

The rapid proliferation of LLM-based autonomous agents in real operating system environments introduces a new category of safety risk beyond content safety: behavior jailbreak, where an adversary induces an agent to execute dangerous OS-level operations with irreversible consequences. Existing benchmarks either evaluate safety at the semantic layer alone, missing physical-layer harms, or fail to isolate test cases, letting earlier runs contaminate later ones. We present LITMUS (LLM-agents In-OS Testing for Measuring Unsafe Subversion), a benchmark addressing both gaps via a semantic-physical dual verification mechanism and OS-level state rollback. LITMUS comprises 819 high-risk test cases organized into one harmful seed subset and six attack-extended subsets covering three adversarial paradigms (jailbreak speaking, skill injection, and entity wrapping), plus a fully automated multi-agent evaluation framework judging behavior at both conversational and OS-level physical layers. Evaluation across frontier agents reveals three findings: (1) current agents lack effective safety awareness, with strong models (e.g., Claude Sonnet 4.6) still executing 40.64% of high-risk operations; (2) agents exhibit pervasive Execution Hallucination (EH), verbally refusing a request while the dangerous operation has already completed at the system level, invisible to every prior semantic-only framework; and (3) skill injection and entity wrapping attacks achieve high success rates, exposing pronounced agent vulnerabilities. LITMUS provides the first standardized platform for reproducible, physically grounded behavioral safety evaluation of LLM agents in real OS environments.

0

cs.CR 2026-05-12 Recognition

Blotto models guide optimal allocation against social engineering

Cybercrime and Prevention: Colonel Blotto in Social Engineering

Models using criminology theories and attack data help nations and organizations target awareness efforts where they reduce incidents most.

abstract click to expand

Cybercriminals increasingly target the human factor rather than continuously advancing technological defense mechanisms. Consequently, institutions that allocate substantial resources to strengthening their cybersecurity infrastructure may remain vulnerable if a deceived employee voluntarily transmits sensitive information or financial assets to attackers. Therefore, alongside the implementation of technological defense mechanisms, particular emphasis must be placed on mitigating human vulnerabilities, which can be achieved through preventive awareness programs. However, such training activities can only be effective if they are organization- and context-specific. In this paper, we develop two Colonel Blotto game models to determine the optimal allocation of defensive resources across dominant social engineering attack vectors. We ground the models in Routine Activity Theory (RAT), borrowed from criminology, that describes crime as an event involving a motivated offender, a suitable target, and the absence of a capable guardian. Next, we quantify relevant factors via the VIVA (Value, Inertia, Visibility, Accessibility) framework, and operationalize the models by feeding real-world cybercrime data into them. The first model investigates optimal population-level prevention, focusing on nation-states as defenders; we present and compare use cases of three different countries. The second model focuses on the organization as a decision-maker; here, we analyze five use cases involving organizations of different characteristics. Our results demonstrate that theoretically grounded and data-driven models can provide decision support to policymakers and organizational leaders in allocating their efforts optimally to prevent social engineering attacks and improve their overall cyber resilience.

0

cs.CR 2026-05-12 2 theorems

Content embeddings detect LLM agent attacks at 0.975 AUROC

MCPShield: Content-Aware Attack Detection for LLM Agent Tool-Call Traffic

SBERT features on tool-call graphs outperform metadata baselines and graph models on task-stratified benchmarks

abstract click to expand

The Model Context Protocol (MCP) has become a widely adopted interface for LLM agents to invoke external tools, yet learned monitoring of MCP tool-call traffic remains underexplored. In this article, MCPShield is presented as an attack detection framework for MCP tool-call traffic that encodes each agent session as a graph (tool calls as nodes, sequential and data-flow links as edges), enriches nodes with sentence-embedding features over arguments and responses, and classifies sessions as benign or attacked. Three GNN architectures (GAT, GCN, GraphSAGE), a no-graph MLP, and classical baselines (XGBoost, random forest, logistic regression, linear SVM) are evaluated, with the full architecture comparison conducted on RAS-Eval (task-stratified splits) and GraphSAGE retained as the GNN baseline on ATBench and a combined-source variant (both label-stratified). Three findings emerge. First, content-level features are essential: metadata-only detection plateaus around an AUROC of 0.64 regardless of architecture, while content embeddings push the AUROC above 0.89. Second, naive random-split evaluation inflates AUROC by up to 26 percentage points relative to task-disjoint splits, a memorization confound that prior agent-detection work has not addressed. Third, the detection signal resides primarily in the SBERT content embeddings: an AUROC of 0.975 was reached by tree ensembles on pooled embeddings, performing, for the most part, better than the neural architectures in the primary RAS-Eval setting including GNNs (0.917) and the MLP (0.896), and self-supervised pre-training does not deliver a label-efficiency advantage on this task.

0

cs.CR 2026-05-12 1 theorem

Attacks cut Bluetooth ranging distances by up to 18 meters

Security Analysis of Time-of-Arrival Estimation via Cross-Correlation under Narrow-Band Conditions

Symbol-agnostic waveform multiplication and negative group delay filters produce earlier cross-correlation peaks in narrowband ToA systems.

abstract click to expand

Time-of-arrival (ToA) estimation via cross-correlation is an essential building block of time-of-flight ranging. However, in narrowband systems, it is notoriously difficult to protect against distance-decreasing attacks such as Early-Detect/Late-Commit (ED/LC). We present and analyze two new attacks that reshape ranging signals to compromise correlation-based ToA estimation. The first attack multiplies the signal by a symbol-periodic waveform in the time domain, while the second passes it through a negative group delay (NGD) filter. In contrast to ED/LC, our attacks do not require real-time symbol detection or adaptive compensation; they are completely symbol-agnostic. We describe implementation strategies for both attacks and discuss NGD filtering in the context of Bluetooth Channel Sounding (CS), a recent narrowband ranging system. To this end, we simulate an NGD circuit in LTspice and a ToA estimator in MATLAB, demonstrating that the attack can result in distance reductions of up to 18 m against Bluetooth CS RTT ranging. Finally, we verify the feasibility of the NGD approach by building a prototype using commercial off-the-shelf components.

0

cs.CR 2026-05-12 Recognition

Embedding disruptions re-trigger LLM safeguards to block jailbreaks

Re-Triggering Safeguards within LLMs for Jailbreak Detection

By tweaking prompt embeddings, the method activates built-in defenses against attacks that bypass safety features.

abstract click to expand

This paper proposes a jailbreaking prompt detection method for large language models (LLMs) to defend against jailbreak attacks. Although recent LLMs are equipped with built-in safeguards, it remains possible to craft jailbreaking prompts that bypass them. We argue that such jailbreaking prompts are inherently fragile, and thus introduce an embedding disruption method to re-activate the safeguards within LLMs. Unlike previous defense methods that aim to serve as standalone solutions, our approach instead cooperates with the LLM's internal defense mechanisms by re-triggering them. Moreover, through extensive analysis, we gain a comprehensive understanding of the disruption effects and develop an efficient search algorithm to identify appropriate disruptions for effective jailbreak detection. Extensive experiments demonstrate that our approach effectively defends against state-of-the-art jailbreak attacks in white-box and black-box settings, and remains robust even against adaptive attacks.

0

cs.CR 2026-05-12 Recognition

AI image editors auto-add hidden logos from inputs

Generate "Normal", Edit Poisoned: Branding Injection via Hint Embedding in Image Editing

Stealthy branding injection reaches 44 percent success in service-controlled workflows without user notice.

abstract click to expand

With the rapid advancement of generative AI, users increasingly rely on image-generation models for image design and creation. To achieve faithful outputs, users typically engage in multi-turn interactions during image refinement: a text-to-image generation phase followed by a text-guided image-to-image editing phase. In this paper, we investigate a novel security vulnerability associated with such a workflow. Our key insight is that a nearly invisible hint, like branding information (e.g., a logo), embedded in an input image can be recognized by downstream generative models and subsequently re-rendered onto semantically related objects, even when the user prompt does not explicitly mention it. This form of hidden payload injection makes the attack stealthy. We study two realistic attack scenarios. The first is a phishing-based setting, in which an attacker controls an online image generation service and injects hidden content into generated images before they are returned to users. The second is a poison-based setting, where an attacker distributes a compromised text-to-image diffusion model whose output contains hidden content. We evaluate both attacks using six injected payloads, including well-known logos and customized designs, and demonstrate that the two attacks can achieve success rates of 44.4% and 32.2% on average, respectively, while ensuring the injected logos are visually imperceptible. We also develop a mitigation solution that achieves an average success rate of 87.4% and 92.3% against the phishing-based and poison-based attacks, respectively.

0

cs.CR 2026-05-12 Recognition

Disrupt-and-rectify smoothing guarantees jailbreak defense

Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing

The two-stage method supplies provable success bounds while keeping the model helpful on normal inputs.

abstract click to expand

This paper proposes a guaranteed defense method for large language models (LLMs) to safeguard against jailbreaking attacks. Drawing inspiration from the denoised-smoothing approach in the adversarial defense domain, we propose a novel smoothing-based defense method, termed Disrupt-and-Rectify Smoothing (DR-Smoothing). Specifically, we integrate a two-stage prompt processing scheme-first disrupting the input prompt, then rectifying it-into the conventional smoothing defense framework. This disrupt-and-rectify approach improves upon previous disrupt-only approaches by restoring out-of-distribution disrupted prompts to an in-distribution form, thereby reducing the risk of unpredictable LLM behavior. In addition, this two-stage scheme offers a distinct advantage in striking a balance between harmlessness and helpfulness in jailbreaking defense. Notably, we present a theoretical analysis for generic smoothing framework, offering a tight bound for the defense success probability and the requirements on the disruption strength. Our approach can defend against both token-level and prompt-level jailbreaking attacks, under both established and adaptive attacking scenarios. Extensive experiments demonstrate that our approach surpasses current state-of-the-art defense methods in terms of both harmlessness and helpfulness.

0

cs.CR 2026-05-12 Recognition

Four diagnostics required to validate safe fine-tuning claims

Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims

Gap reductions alone can come from noise or artifacts; Acceptance Cards demands reliability, fresh generalization, mechanism match, and task

abstract click to expand

Safe fine-tuning defenses are often endorsed on the basis of a held-out gap reduction, but the same reduction can come from sampling noise, subject artifacts, capability loss, or a mechanism that does not transfer. We introduce Acceptance Cards: an evaluation protocol, a documentation object, an executable audit package, and a claim-specific evidential standard for safe fine-tuning defense claims. The protocol checks statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer before treating a gap reduction as a full-card pass. Re-scored under this installed-gap protocol, SafeLoRA fails the full-card pass on Gemma-2-2B-it: under strict mechanism-class coding it fails all four diagnostics, and under a permissive shrinkage relabel it still fails three of four. This is a narrow installed-gap audit on one model family, not a global judgment of SafeLoRA's effectiveness. In a 46-cell audit, no cell satisfies the strict conjunction. The closest family is a near miss that passes reliability and mechanism checks where the required data are available, but fails the fresh-subject threshold, lacks a strict transfer pass, and carries a measurable deployment-accuracy cost.

0

cs.CR 2026-05-12 Recognition

Context edits make agents unsafe yet still finish user tasks

Red-Teaming Agent Execution Contexts: Open-World Security Evaluation on OpenClaw

DeepTrap finds hidden context changes that bypass output-only safety checks across 42 test cases and nine models.

abstract click to expand

Agentic language-model systems increasingly rely on mutable execution contexts, including files, memory, tools, skills, and auxiliary artifacts, creating security risks beyond explicit user prompts. This paper presents DeepTrap, an automated framework for discovering contextual vulnerabilities in OpenClaw. DeepTrap formulates adversarial context manipulation as a black-box trajectory-level optimization problem that balances risk realization, benign-task preservation, and stealth. It combines risk-conditioned evaluation, multi-objective trajectory scoring, reward-guided beam search, and reflection-based deep probing to identify high-value compromised contexts. We construct a 42-case benchmark spanning six vulnerability classes and seven operational scenarios, and evaluate nine target models using attack and utility grading scores. Results show that contextual compromise can induce substantial unsafe behavior while preserving user-facing task completion, demonstrating that final-response evaluation is insufficient. The findings highlight the need for execution-centric security evaluation of agentic AI systems. Our code is released at: https://github.com/ZJUICSR/DeepTrap

0

cs.CR 2026-05-12 1 theorem

This paper performs a structured bidirectional review of peer-reviewed studies on AI and…

SoK: A Systematic Bidirectional Literature Review of AI & DLT Convergence

A systematic review of AI-DLT convergence finds research clustered on execution/consensus layers for AI-to-DLT and data/model layers for…

abstract click to expand

The integration of Artificial Intelligence (AI) with Distributed Ledger Technology (DLT) has become a growing research area, yet contributions tend to cluster around specific application domains or examine only one direction of the integration, leaving the broader architectural interplay between the two technologies poorly understood. This work addresses that gap through a structured, bidirectional review of peer-reviewed studies published between 2020 and 2025. We classify contributions along two directions: AI-enhanced DLT, and DLT-enhanced AI. In the first case, we examine how AI techniques improve DLT systems across five layers: data, network, consensus, execution, and application layers. In the second case, we analyse how DLT supports AI systems across five layers: infrastructure, data, model, inference, and application layers, with particular attention to federated learning, model evaluation, and multi-agent coordination. The analysis reveals that most works concentrate on a small subset of layers: execution and consensus for AI-enhanced DLT, data and model for DLT-enhanced AI. Other layers remain comparatively neglected. Despite reported improvements in controlled settings, no study demonstrates deployment at production scale, and the field has not yet offered satisfying answers to fundamental questions around scalability, interoperability, and verifiable execution. We argue that progress will require cross-layer co-design and empirical validation in real-world settings.

0

cs.CR 2026-05-12 1 theorem

Mild condition tightens Banaszczyk lattice Gaussian bound

A Note on Banaszczyk's Inequality

Adding a usable restriction on parameters improves the tail estimate used in LWE dual-attack analysis.

abstract click to expand

Banaszczyk's inequality establishes a tail estimate for the discrete Gaussian measure on a lattice in $\mathbb{R}^n$. This classic result has been influential and plays an important role in lattice-based cryptography. An improvement of the inequality with a transparent proof was given by Tian, Liu and Xu. In this note, we further improve this inequality by imposing an appropriate condition, obtaining a significantly better bound. This refined inequality can be used to investigate dual attacks against the Learning With Errors (LWE) problem.

0

cs.CR 2026-05-12 2 theorems

Hybrid tokenization curbs DGA drift over 9 years

DRIFT: Drift-Resilient Invariant-Feature Transformer for DGA Detection

Character and subword encodings plus self-supervised tasks keep detection rates stable as new domain generators appear.

abstract click to expand

Domain Generation Algorithms (DGAs) evolve continuously to evade botnet detection, posing a persistent challenge for dependable network defense. While deep learning-based detectors achieve strong performance under static conditions, they suffer severe degradation when facing temporal drift. Through a 9-year longitudinal study (2017-2025), we empirically show that state-of-the-art character- and word-based DGA classifiers rapidly lose effectiveness as new DGA variants emerge. To address this problem, we propose a drift-resilient Transformer-based framework that learns invariant representations through a hybrid tokenization strategy and multi-task self-supervised pre-training. The model integrates (i) character-level encoding to capture stochastic morphological patterns and (ii) subword-level encoding for word-based DGAs. Three pre-training tasks enable the model to learn robust structural and contextual features prior to supervised fine-tuning. Comprehensive evaluations demonstrate that our method significantly mitigates temporal degradation and consistently outperforms state-of-the-art baselines in forward-chaining experiments. The proposed approach offers a dependable foundation for long-term DGA defense in evolving threat landscapes. Our code is available at: https://github.com/snsec-net/2026-DSN-DRIFT.

0

cs.CR 2026-05-12 Recognition

DAO voting data flags future forks months early

Mapping Partisan Fault Lines Within DAOs

Nouns analysis shows 90% of fork addresses clustered in final 44 proposals versus 47% random.

abstract click to expand

Decentralised Autonomous Organisations (DAO) can fragment when partisan communities emerge within their governance structures, leading to organisational splits known as "forks". We present a method to detect these emerging communities by analysing on-chain voting behaviour before fragmentation occurs. Our approach extracts voting events from governance smart contracts, constructs voter matrices encoding participation patterns, and applies pairwise dissimilarity analysis to quantify ideological divergence between addresses. We visualise these relationships using multidimensional scaling and identify partisan communities through k-means clustering with silhouette score optimisation. Using Nouns DAO as a case study, a protocol that has experienced multiple documented forks, we demonstrate that addresses destined to fork cluster together months before actual fragmentation events. Our analysis of 330 proposals spanning from contract deployment to the first major fork shows that 90% of fork addresses cluster together in the final 44 proposals, compared to only 47% in randomised data. These results indicate that partisan communities can be detected and visualised through on-chain governance analysis, offering early warnings of emerging divisions before they cause organisational fragmentation.

0

cs.CR 2026-05-12 2 theorems

Tiny visual changes poison medical RAG outputs

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

A query-agnostic attack injects ambiguous misinformation triggered by imperceptible image perturbations to force wrong but believable AI生成.

abstract click to expand

Retrieval-augmented generation (RAG) is a widely adopted paradigm for enhancing LLMs in medical applications by incorporating expert multimodal knowledge during generation. However, the underlying retrieval databases may naturally contain, or be intentionally injected with, adversarial knowledge, which can perturb model outputs and undermine system reliability. To investigate this risk, prior studies have explored knowledge poisoning attacks in medical RAG systems. Nevertheless, most of them rely on the strong assumption that adversaries possess prior knowledge of user queries, which is unrealistic in deployments and substantially limits their practical applicability. In this paper, we propose M\textsuperscript{3}Att, a knowledge-poisoning framework designed for medical multimodal RAG systems, assuming only limited distribution knowledge of the underlying database. Our core idea is to inject covert misinformation into textual data while using paired visual data as a query-agnostic trigger to promote retrieval. We first propose a unified framework that introduces imperceptible perturbations to visual inputs to manipulate retrieval probabilities. Besides, due to the prior medical knowledge in LLMs, naively poisoned medical content with explicit factual errors can be corrected during generation. Thus, we leverage the inherent ambiguity of medical diagnosis and design a covert misinformation injection strategy that degrades diagnostic accuracy while evading model self-correction. Experiments on five LLMs and datasets demonstrate that M\textsuperscript{3}Att consistently produces clinically plausible yet incorrect generations. Codes: https://github.com/ypr17/M3Att.

0

cs.CR 2026-05-12 2 theorems

Framework stops SQL attacks hidden in LLM prompts

When Prompts Become Payloads: A Framework for Mitigating SQL Injection Attacks in Large Language Model-Driven Applications

Multi-layer system sanitizes inputs and detects anomalies to keep natural language database queries safe.

abstract click to expand

Natural language interfaces to structured databases are becoming increasingly common, largely due to advances in large language models (LLMs) that enable users to query data using conversational input rather than formal query languages such as SQL. While this paradigm significantly improves usability and accessibility, it introduces new security risks, particularly the amplification of SQL injection vulnerabilities through the prompt-to-SQL translation process. Malicious users can exploit these mechanisms by crafting adversarial prompts that manipulate model behavior and generate unsafe queries. In this work, we propose a multi-layered security framework designed to detect and mitigate LLM-mediated SQL injection attacks. The framework integrates a front-end security shield for prompt sanitization, an advanced threat detection model for behavioral and semantic anomaly identification, and a signature-based control layer for known attack patterns. We evaluate the proposed framework under diverse and realistic attack scenarios, including prompt injection, obfuscated SQL payloads, and context-manipulation attacks. To ensure robustness, we generate and curate a comprehensive benchmark dataset of adversarial prompts and assess performance across a fine-tuned LLM configuration. Experimental results demonstrate that the proposed approach achieves high detection accuracy while maintaining low false-positive rates, significantly improving the secure deployment of LLM-powered database applications.

0

cs.CR 2026-05-12 2 theorems

KEM-IES upgrades ECIES with PQC KEM and Ascon

Key Encapsulation Mechanism-Based Integrated Encryption Scheme (KEM-IES)

The replacement resists quantum attacks and reduces computation for embedded devices such as Raspberry Pi.

abstract click to expand

The Elliptic Curve Integrated Encryption Scheme (ECIES) is widely regarded as a practical method and has been adopted by multiple standards. However, the advancement of quantum computing technologies poses potential security risks to ECIES. Therefore, this study proposes a Key Encapsulation Mechanism-Based Integrated Encryption Scheme (KEM-IES), which enhances resistance to quantum attacks by incorporating a Post-Quantum Cryptography (PQC)-based Key Encapsulation Mechanism (KEM). Furthermore, the study integrates the Ascon algorithm, released by the National Institute of Standards and Technology (NIST) in August 2025, to further improve computational efficiency and enable applicability to embedded systems. The proposed KEM-IES and its Ascon-based variant are implemented on a Raspberry Pi 4, and evaluations are conducted to compare the performance of ECIES and KEM-IES.

0

cs.CR 2026-05-12 2 theorems

Usability demands trick LLMs into insecure code

Usability as a Weapon: Attacking the Safety of LLM-Based Code Generation via Usability Requirements

Explicit goals for features and performance lead models to drop implicit security rules at high rates.

abstract click to expand

Large Language Models (LLMs) are increasingly used for automated software development, making their ability to preserve secure coding practices critical. In practice, however, many security requirements are implicit or underspecified, whereas usability requirements are explicit and high-signal. This asymmetry motivates our investigation of usability pressure as a practical attack surface: realistic usability-oriented requirements (e.g., new features, performance constraints, or simplicity demands) can cause coding LLMs to satisfy explicit usability goals while silently dropping implicit security constraints -- a form of reward hacking. We formalize this threat as UPAttack and propose U-SPLOIT, an automated framework to craft UPAttack that (i) selects tasks where a model is initially secure, (ii) synthesizes usability pressures by identifying usability rewards of insecure alternatives across three vectors (Functionality, Implementation, Trade-off), and (iii) verifies security regression via both existing test cases and dynamically generated exploit payloads. Across 75 seed scenarios (25 CWEs x 3 cases), spanning multiple languages (Python, C, and JavaScript), U-SPLOIT achieves attack success rates up to 98.1% on multiple state-of-the-art models (e.g., GPT-5.2-chat and Gemini-3-Flash-Preview).

0

cs.CR 2026-05-12 Recognition

LLM agents discover 40 bugs in V8 JavaScript engine

Agentic Fuzzing: Opportunities and Challenges

By reasoning from past bugs to new scenarios and testing them, the method uncovers logic issues across multiple engines.

abstract click to expand

Fuzzers and static analyzers find many bugs but struggle with logic bugs in mature codebases. Triggering such a bug often requires multi-step reasoning that produces no distinctive execution feedback, and variants can appear across implementations too different for a single pattern to match. Recent LLM-assisted approaches help, but they use LLMs as auxiliaries rather than as the reasoning engine. We propose agentic fuzzing, a bug-finding approach seeded by historical bugs in which deep agents perform the reasoning directly. Given a reference bug, the agent analyzes its root cause, hypothesizes new scenarios elsewhere in the codebase that may share that cause, and verifies each hypothesis by generating and running proof-of-concept code. This lets the agent find variants that differ completely in trigger path or code structure from the reference. We identify three practical challenges in implementing agentic fuzzing: harness engineering, redundant investigations across seeds with similar root causes, and scheduling seeds in a large corpus. We address these in AFuzz through a four-stage agent pipeline, scenario coverage that deduplicates previously explored scenarios, and a DPP-MAP scheduler that orders seeds by diversity. We ran AFuzz on the V8 JavaScript engine for about one month, finding 40 bugs (including three duplicates), receiving a total $35,000 bounty, and being assigned two CVEs. AFuzz also found 19 bugs (including one duplicate) in SpiderMonkey and JavaScriptCore using the seeds from V8. However, agentic fuzzing is in its early stages with several remaining open problems we discuss in the paper. Still, we think it points to a promising direction for finding logic bugs.

0

cs.CR 2026-05-12 1 theorem

LPN weak solvers amplify to strong ones by scaling n and noise rate

Hardness Amplification for (Sparse) LPN

An algorithm succeeding on a small fraction of instances converts to one succeeding on almost all scaled instances, strengthening the LPN in

abstract click to expand

We prove new hardness amplification results for Learning Parity with Noise ($\mathsf{LPN}$) and its sparse variants. In $\mathsf{LPN}_{\eta,n,m}$, the goal is to recover a secret $\vec s\in\mathbb{F}_2^n$ from $m$ noisy linear samples $(\vec a,b)$, where $\vec a\leftarrow \mathbb{F}_2^n$ is uniform and $b=\langle \vec a,\vec s\rangle + e$ with $e\leftarrow \mathrm{Ber}(\eta)$. Building on the direct-product framework introduced by Hirahara and Shimizu [HS23], we show an 'instance-fraction amplification' theorem: for any $\varepsilon,\delta>0$, any algorithm that solves $\mathsf{LPN}_{\eta,n,m}$ with success probability $\varepsilon$ can be transformed into an algorithm that succeeds with probability $1-\delta$ on a related $\mathsf{LPN}$ distribution with scaled parameters $\mathsf{LPN}_{\eta/k,\;n/k,\;m}$, where $ k=\Theta\!\left(\frac{1}{\delta}\log\frac{1}{\varepsilon}\right). $ Equivalently, an algorithm that solves $\mathsf{LPN}$ on a 'small fraction of instances' can be converted into an algorithm that solves $\mathsf{LPN}$ on 'almost all instances', yielding a self-amplification for a wide range of parameters. We extend the same amplification approach to $\mathsf{LPN}$ over $\mathbb{F}_q$ and to Sparse-$\mathsf{LPN}$, where each query vector $\vec a$ has exactly $\sigma$ nonzero entries. Together, these results establish hardness self-amplification for a broad family of $\mathsf{LPN}$-type problems, strengthening the foundations for assuming the average-case hardness of $\mathsf{LPN}$ and its sparse variants.

0

cs.CR 2026-05-12 Recognition

LPN hardness amplifies from few instances to almost all

Hardness Amplification for (Sparse) LPN

Scaling the noise rate and dimension converts small-success solvers into ones succeeding on nearly every instance for LPN and its variants.

abstract click to expand

We prove new hardness amplification results for Learning Parity with Noise ($\mathsf{LPN}$) and its sparse variants. In $\mathsf{LPN}_{\eta,n,m}$, the goal is to recover a secret $\vec s\in\mathbb{F}_2^n$ from $m$ noisy linear samples $(\vec a,b)$, where $\vec a\leftarrow \mathbb{F}_2^n$ is uniform and $b=\langle \vec a,\vec s\rangle + e$ with $e\leftarrow \mathrm{Ber}(\eta)$. Building on the direct-product framework introduced by Hirahara and Shimizu [HS23], we show an 'instance-fraction amplification' theorem: for any $\varepsilon,\delta>0$, any algorithm that solves $\mathsf{LPN}_{\eta,n,m}$ with success probability $\varepsilon$ can be transformed into an algorithm that succeeds with probability $1-\delta$ on a related $\mathsf{LPN}$ distribution with scaled parameters $\mathsf{LPN}_{\eta/k,\;n/k,\;m}$, where $ k=\Theta\!\left(\frac{1}{\delta}\log\frac{1}{\varepsilon}\right). $ Equivalently, an algorithm that solves $\mathsf{LPN}$ on a 'small fraction of instances' can be converted into an algorithm that solves $\mathsf{LPN}$ on 'almost all instances', yielding a self-amplification for a wide range of parameters. We extend the same amplification approach to $\mathsf{LPN}$ over $\mathbb{F}_q$ and to Sparse-$\mathsf{LPN}$, where each query vector $\vec a$ has exactly $\sigma$ nonzero entries. Together, these results establish hardness self-amplification for a broad family of $\mathsf{LPN}$-type problems, strengthening the foundations for assuming the average-case hardness of $\mathsf{LPN}$ and its sparse variants.

0

cs.CR 2026-05-12 2 theorems

Compiler adds ARM PA and BTI checks to stop Spectre attacks

Janus: Compiler-Based Defense Against Transient Execution Attacks Using ARM Hardware Primitives

Janus instruments code at compile time to cover speculative paths with existing CFI hardware, adding 3.85 percent average overhead on SPEC.

abstract click to expand

We present Janus, a compiler-based security framework that mitigates transient execution attacks like Spectre and control-flow hijacking on ARM64 platforms. Janus integrates speculative execution and control flow dependencies with PA modifiers, using PA and BTI microarchitectural features to prevent control-flow speculation attacks and secure both control flow and speculative execution through existing control-flow integrity mechanisms. To optimize performance, Janus minimizes overhead by merging defense operations across different defense layers (modifier fusion) and reusing registers of protected variables (carrier reuse), while maintaining strong security guarantees. Evaluation on SPEC CPU2017 shows an average performance overhead of 3.85%, with real-world applications exhibiting overheads ranging from 2.97% to 7.80%. Janus offers effective speculative execution security and low performance and code size overhead, making it a robust solution for ARM-based systems.

0

cs.CR 2026-05-12 Recognition

Low-cost methods yield validated firmware from three Holy Stone drones

A Multi-Interface Firmware Acquisition and Validation Methodology for Low-Cost Consumer Drones: A Case Study on Three Holy Stone Platforms

Four acquisition interfaces and entropy-plus-signature checks create a reusable baseline for consumer UAV security analysis.

abstract click to expand

Consumer unmanned aerial vehicles (UAVs) have evolved into capable computing platforms, yet their embedded firmware remains largely inaccessible to the security community. Entry-level models, in particular those marketed to first-time and younger operators, commonly ship with limited protection mechanisms and no public documentation of their software internals. This paper presents a systematic study of firmware extraction and validation applied to three Holy Stone consumer drone models: the HS175D, HS720, and HS360S. Rather than pursuing reverse-engineering outcomes, the work focuses on obtaining reliable, ground-truth firmware images across heterogeneous hardware designs using only commercially available, low-cost tooling. Four acquisition methods are evaluated SPI flash in-circuit reading, SWD/JTAG debug-port access, UART boot-message capture, and a clip-based contact approach that avoids chip desoldering and each is assessed for success rate, image completeness, and operational practicality. Post-acquisition quality is evaluated through sliding-window Shannon entropy profiling and structural-signature analysis using binwalk, together forming a three-tier validation framework that distinguishes validated images from those that appear successful at the tool level but contain no meaningful firmware content. Static analysis via the EMBA framework confirms that validated images contain identifiable OS components, aging library stacks with known CVE exposure, and no binary-hardening mechanisms. The resulting corpus and methodology provide a reproducible baseline for firmware rehosting, vulnerability analysis, secure-boot assessment, and embedded-systems education within the consumer UAV domain. Index Terms: consumer UAV, drone firmware, embedded systems security, entropy analysis, firmware extraction, IoT security, SPI flash, SWD/JTAG, UART.

0

cs.CR 2026-05-12 Recognition

Automated labels turn obfuscated code into LLM-friendly chunks

Towards LLM-Based Analysis of Virtualization-Obfuscated Code through Automated Data Generation

A static analysis framework breaks down virtualization-obfuscated binaries into structurally labeled units, enabling large datasets for LLM

abstract click to expand

Virtualization-based obfuscation produces extremely large and structurally complex binaries, posing challenges for LLM-based analysis due to input size limits and the need for large-scale labeled data. We address this by focusing on structural rather than full semantic analysis. Obfuscated binaries are decomposed into the largest semantically coherent units that fit within LLM constraints and are labeled according to their structural roles. We implement a static analysis framework to automate labeling and enable large-scale dataset generation. Our prototype shows strong performance on real-world virtualization obfuscators.

0

cs.CR 2026-05-12 Recognition

Argument provenance secures LLM tool calls without losing utility

The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck

PACT tracks each argument's origin and enforces role-specific trust contracts, reaching 100% security on strong models while recovering 38-

abstract click to expand

Tool-using LLM agents must act on untrusted webpages, emails, files, and API outputs while issuing privileged tool calls. Existing defenses often mediate trust at the granularity of an entire tool invocation, forcing a brittle choice in mixed-trust workflows: allow external content to influence a call and risk hijacked destinations or commands, or quarantine the call and block benign retrieval-then-act behavior. The key observation behind this paper is that indirect prompt injection becomes dangerous not when untrusted content appears in context, but when it determines an authority-bearing argument. We present \textsc{PACT} (\emph{Provenance-Aware Capability Contracts}), a runtime monitor that assigns semantic roles to tool arguments, tracks value provenance across replanning steps, and checks whether each argument's origin satisfies its role-specific trust contract. Under oracle provenance, \textsc{PACT} achieves 100\% utility and 100\% security on mixed-trust diagnostic suites, while flat invocation-level monitors incur false positives or false negatives. In full AgentDojo deployments across five models, \textsc{PACT} reaches 100\% security on the three strongest models while recovering 38.1--46.4\% utility, 8--16 percentage points above CaMeL at the same security level. Ablations show that both semantic roles and cross-step provenance are necessary. \textsc{PACT} reframes agent security as authority binding, and isolates the remaining deployment bottleneck to provenance inference and contract synthesis.

0

cs.CR 2026-05-12 3 theorems

Fractional memory in DP-SGD keeps sensitivity at most βC

Deep Learning under Fractional-Order Differential Privacy

The recursive query blends current and past private sums while sensitivity analysis supports standard Rényi accounting and accuracy gains on

abstract click to expand

Differentially private stochastic gradient descent (DP-SGD) is a standard approach to privacy-preserving learning based on per-example clipping, subsampling, Gaussian perturbation, and privacy accounting. Classical DP-SGD releases a noisy version of the current clipped subsampled gradient sum. We propose Fractional-Order Differentially Private Stochastic Gradient Descent (\textbf{FO-DP-SGD}), a mechanism-level extension that replaces this current-only query, before Gaussian noise is added, with a fractional recursive query combining the current clipped sum with a finite-window, power-law-weighted aggregation of previously released private sum-level outputs. This injects fractional memory into the release mechanism while preserving the standard \emph{sum-then-noise-then-divide} structure. Under add/remove adjacency with Poisson subsampling, the current-step sensitivity analysis shows that the only newly data-dependent term is the scaled current clipped sum. Hence, conditioned on the private history, the effective $\ell_2$-sensitivity is at most $\beta C$, where $C$ is the clipping threshold and $\beta\in(0,1]$ controls the current-step contribution. Thus, FO-DP-SGD admits standard per-step R\'enyi differential privacy accounting via a Poisson-subsampled Gaussian mechanism with effective noise-to-sensitivity ratio $\sigma/\beta$, and composes to yield overall $(\varepsilon,\delta)$-differential privacy guarantees. FO-DP-SGD provides a framework for studying long-memory effects in private optimization. The fractional order, memory window, and mixing coefficient govern the trade-off among current-step sensitivity, signal retention, and private-history influence. Experiments on SVHN, CIFAR-10, and CIFAR-100 show improved test accuracy and privacy--utility performance over DP-SGD and private baselines including DP-Adam, DP-IS, SA-DP-SGD, ADP-AdamW, DP-SAT, and DP-Adam-AC.

0

cs.CR 2026-05-12 2 theorems

SeqWM watermarks LLM agent trajectories by embedding signals in history-conditioned…

Sequential Behavioral Watermarking for LLM Agents

Signals embedded in history-conditioned transition patterns survive corruption and preserve utility, unlike earlier round-indexed methods.

abstract click to expand

LLM-based agents act through sequences of executable decisions, but their trajectories provide little evidence of which agent or policy produced them, making provenance, ownership, and unauthorized reuse difficult to establish from observed behavior alone. This motivates watermarking signals embedded directly into agent behavior rather than only into generated text, since text watermarking cannot capture the action-level decisions that define agent execution. Recent agent watermarking methods address this gap by moving the watermark from generated text to behavioral choices. However, by treating each action step as an independent trial, they overlook trajectory structure and become fragile when trajectories are perturbed, truncated, or observed without reliable alignment. We propose SeqWM, a sequential behavioral watermarking framework that embeds signals into history-conditioned transition patterns and verifies trajectories position-agnostically against random-key baselines. Experiments across diverse agent benchmarks and LLM backbones show that SeqWM consistently achieves reliable detection while preserving agent utility, and remains robust under trajectory corruption where round-indexed behavioral watermarks collapse.

0

cs.CR 2026-05-12 Recognition

Mamba-2 classifies network bursts directly from raw bytes

MambaNetBurst: Direct Byte-level Network Traffic Classification without Tokenization or Pretraining

A compact model skips tokenization and pretraining yet matches heavier baselines on app identification, VPN, malware, and IoT attack tasks.

abstract click to expand

We present MambaNetBurst, a compact tokenizer-free byte-level sequence classifier for network burst classification based on a Mamba-2 backbone. In contrast to most recent strong traffic-classification and intrusion-detection approaches, our method operates directly on raw packet bytes, avoids tokenization, patching, and heavy engineered multimodal representations, and does not require any self-supervised pre-training stage. Given a packet flow, we form a fixed-length burst from the first few packets, embed the resulting byte sequence appending a learnable CLS token, and process it with a stack of residual pre-normalized Mamba-2 blocks for end-to-end supervised classification. Across six public benchmarks spanning encrypted mobile app identification, VPN/Tor traffic classification, malware traffic classification, and IoT attack traffic, MambaNetBurst achieves consistently strong results and is competitive with, or outperforms, substantially heavier and often pre-trained baselines. Our ablation study shows that preserving byte-level temporal resolution is critical, that early downsampling through striding is consistently harmful, and that moderate state sizes are sufficient for robust generalization. We further show that Mamba-2, despite its more constrained transition structure relative to Mamba-1, remains highly effective for packet-byte modeling while providing clear efficiency advantages, particularly in training speed. Overall, our results demonstrate that direct **undiluted** byte-to-classification learning with compact selective state space models is a practical, effective and novel direction for efficient, deployable traffic analysis that bypasses the complexity of pre-training pipelines even over highly optimized linear attention architectures.

0

cs.CR 2026-05-12 Recognition

Black-box method flags LLM agent drift at 0.83 AUC

Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents

Nautilus Compass compares prompt embeddings to behavioral anchors on closed APIs without model weights or fact extraction.

abstract click to expand

Production LLM coding agents drift over long sessions: they forget user-specified constraints, slip into mistakes the user already flagged, and confabulate prior agreements. White-box approaches such as persona vectors require model weights and so cannot be applied to closed APIs (Claude, GPT-4) that most users actually interact with. We present Nautilus Compass, a black-box persona drift detector and agent memory layer for production coding agents. The method operates entirely at the prompt-text layer: cosine similarity between user prompts and behavioral anchor texts, aggregated by a weighted top-k mean using BGE-m3 embeddings. Compass is, to our knowledge, the only public agent memory layer (among Mem0, Letta, Cognee, Zep, MemOS, smrti verified May 2026) that does not call an LLM at index time to extract facts or build a graph; raw conversation text is embedded directly. The system ships as a Claude Code plugin, an MCP 2024-11-05 A2A server (Cursor, Cline, Hermes), a CLI, and a REST API on one daemon, with a Merkle-chained audit log for tamper-evident anchor updates. On a held-out test set built from real Claude Code session traces and labeled by an independent LLM judge, Compass reaches ROC AUC 0.83 for drift detection. The embedded retrieval pipeline scores 56.6% on LongMemEval-S v0.8 and 44.4% on EverMemBench-Dynamic (n=500), topping the four published EverMemBench Table 4 baselines. LongMemEval-S 56.6% is ~30 points below recent white-box leaders (90+%); we treat that as the architectural ceiling of the no-extraction design. End-to-end reproduction cost is $3.50 (~14x cheaper than GPT-4o-judged stacks). A paired cross-vendor behavior A/B accompanies these numbers as preliminary system-level evidence. Code, anchors, frozen test data, and audit-log tooling are MIT-licensed at github.com/chunxiaoxx/nautilus-compass.

0

cs.CR 2026-05-11 Recognition

AI agents accept poisoned graph data at 100 percent trust

Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise AI Agent Reasoning

In 269 of 270 directed queries, models reasoned over fabricated security claims from a corrupted 42-million-node graph.

abstract click to expand

We define Oracle Poisoning, an attack class in which an adversary corrupts a structured knowledge graph that AI agents query at runtime via tool-use protocols, causing incorrect conclusions through correct reasoning. Unlike prompt injection, Oracle Poisoning manipulates the data agents reason over, not their instructions. We demonstrate six attack scenarios against a production 42-million-node code knowledge graph, providing the first empirical demonstration of knowledge graph poisoning against a production-scale agentic system, distinct from CTI embedding poisoning. Primary evaluation uses real SDK tool-use across nine models from three providers (N=30 per model), where models autonomously invoke a graph query tool and reason from results. The result is unambiguous: every tested model trusts poisoned data at 100% at moderate attacker sophistication(L2), with 269 valid trials (of 270) accepting fabricated security claims under directed queries. Under open-ended prompts, trust drops to 3-55%, confirming prompt framing as a confound; we report both conditions. An attacker sophistication gradient reveals discrete break points, a minimum skill at which trust flips from 0% to 100%, reframing the attack as a question not of whether but of how much. A controlled delivery-mode comparison shows that inline evaluation produces false negatives: GPT-5.1 shows 0% trust inline but 100% under both simulated and real agentic tool-use, demonstrating that delivery mode is a first-order confound. We evaluate five defences; read-only access control eliminates the direct mutation vector, while the remaining four are partial and model-dependent. Analysis of four additional platforms suggests the attack may generalise across the knowledge-graph ecosystem.

0

cs.CR 2026-05-11 Recognition

Protocol moves verified memory between AI agents on different models

Portable Agent Memory: A Protocol for Cryptographically-Verified Memory Transfer Across Heterogeneous AI Agents

Merkle-DAG links provide tamper evidence and rehydration adapts content to GPT-4, Claude, Gemini or Llama while blocking indirect injection.

abstract click to expand

We present Portable Agent Memory, an open protocol and reference implementation for transferring persistent memory state across heterogeneous AI agents. Modern AI agents accumulate rich context -- episodic events,semantic knowledge, procedural skills, working state, and identity preferences -- but this context remains locked within vendor-specific runtimes. Portable Agent Memory addresses this through: (1) a five-component structured memory model with content-addressable entries linked by a Merkle-DAG provenance graph providing tamper-evidence; (2) capability-based access control enabling selective, scoped disclosure of memory segments; (3) an injection-resistant rehydration protocol that adapts recalled content to heterogeneous target models while mitigating indirect prompt injection; and (4) a JSON-first serialization format with optional CBOR compaction for efficient transport. We provide a Python SDK with 54 passing tests, agent skills for multiple platforms, and demonstrate cross-model memory transfer between GPT-4, Claude, Gemini, and Llama architectures. The protocol is open-source under Apache 2.0.

0

cs.CR 2026-05-11 2 theorems

Maturity scores drive reinforcement-learned defense plans under budget limits

Operationalizing Cybersecurity Governance for Mitigation Planning with Attack-Path Modeling and Reinforcement Learning

Linking governance data to attack sequences produces stable cost-risk trade-offs for choosing which mitigations to fund.

abstract click to expand

We address a fundamental challenge in cybersecurity operations of translating governance frameworks into actionable mitigation decisions under realistic resource constraints. Frameworks such as the NIST Cybersecurity Framework (CSF) provide widely adopted measures of organizational maturity, but do not directly support the selection and prioritization of defensive strategies against adversarial behavior. We present a system that operationalizes governance frameworks by mapping CSF maturity assessments into MITRE ATT\&CK mitigation capabilities, which enables direct integration of organizational security posture with adversary-informed defensive planning. To manage adversary complexity, we employ a Variable-Order Markov Model (VOMM) trained on observed ATT\&CK technique sequences to enable scalable adversary simulation within a Deep Reinforcement Learning (DRL) environment. We reconstruct likely attack paths and defensive responses using beam search, and then jointly optimize mitigation selection under explicit budget constraints. Our environment supports concurrent adversaries and realistic mitigation costs. Across multiple reward formulations and configurations, we show that the approach produces stable policies, meaningful cost-risk trade-offs, and interpretable mitigation plans aligned with organizational maturity. These results demonstrate that adversary-aware DRL can generate practical, resource-constrained defense strategies grounded in real-world frameworks and threat behavior.

0

cs.CR 2026-05-11 Recognition

Graph models catch LLM attacks split across sessions

FragBench: Cross-Session Attacks Hidden in Benign-Looking Fragments

Single-turn judges miss fragments by construction, yet interaction graphs recover the malicious pattern at F1 0.88-0.96

abstract click to expand

An attacker can split a malicious goal into sub-prompts that each look benign on their own and only become harmful in combination. Existing LLM safety benchmarks evaluate prompts one at a time, or across turns of a single chat, and so do not look for a malicious signal spread across separate sessions with no shared context. We build FragBench, a benchmark drawn from 24 real-world cyber-incident campaigns, which keeps the full attack trail: the multi-fragment kill chain, the per-fragment safety-judge verdicts, sandboxed execution traces, and a matched set of benign cover sessions. FragBench splits this trail into two paired tasks: an adversarial rewriter that hardens fragments against a single-turn safety judge (FragBench Attack), and a graph-based user-level detector trained on the resulting interactions (FragBench Defense). The single-turn judge is near chance on the released corpus by construction, but four GNN variants and three classical-ML baselines all recover the cross-session feature, reaching aggregate event-level F1 = 0.88-0.96. Defending against fragmented LLM misuse therefore requires modeling the cross-session interaction graph, rather than isolated prompts. Our generator, rewriter, sandbox harness, and detector are released at https://github.com/LidaSafety/fragbench.

0

cs.CR 2026-05-11 2 theorems

Deception traps catch 90-100% of successful LLM agent attacks

AgentShield: Deception-based Compromise Detection for Tool-using LLM Agents

AgentShield plants fake tools and credentials that trigger only on compromise, yielding zero false positives and labels for transferrable,

abstract click to expand

Defenses against indirect prompt injection (IPI) in tool-using LLM agents share two structural weaknesses. First, they all attempt to prevent attacks rather than detect the compromises that slip through. Second, they have only been evaluated in English, leaving users of low-resource languages such as Kurdish and Arabic without tested protection. This paper addresses both gaps with AgentShield, a deception-based detection framework that places three layers of traps inside the agent's tool interface: fake tools, fake credentials, and allowlisted parameters. The same trap triggers serve as high-precision labels for a self-supervised classifier. An LLM agent that follows an attacker's hidden instruction almost always touches one of these traps, which gives both a real-time compromise signal and a zero-FP label for training a downstream detector without manual annotation. Across 176 cross-lingual attack prompts and four LLMs from three providers, and because modern LLMs already refuse most IPI attempts on their own (attack success rate <= 10%), AgentShield's job is to catch the attacks that do slip through. On commercial models, it catches 90.7%-100% of such successful attacks, with zero false alarms on 485 normal-use tests. It survives a systematic adaptive-attack evaluation with zero evasion on commercial models, and the self-supervised classifier transfers across models and languages without retraining.

0

cs.CR 2026-05-11 Recognition

Cloud AI agents risk over-privileged tools and authority leaks

Security Risks in Tool-Enabled AI Agents: A Systematic Analysis of Privileged Execution Environments

Analysis shows most dangers arise from capability-intent gaps and ambient permissions rather than new exploits, plus guidelines for safer de

abstract click to expand

Tool-enabled AI agents are increasingly deployed in cloud-hosted environments and offered as services, where they perform side-effecting operations through privileged tools within execution environments. While such agents enable powerful automation, the security implications of hosting autonomous agents in privileged execution environments are not yet fully explored. This paper presents a structured analysis of security risks associated with cloud-hosted AI agents. We introduce a taxonomy of risk categories, illustrate these risks through three representative agent scenarios, and discuss mitigation strategies along with their tradeoffs. A small controlled experiment empirically illustrates risk manifestation and the effect of lightweight mitigations in this setup. Our analysis suggests that many risks in autonomous cloud agents arise not from novel vulnerabilities, but from over-privileged tools, capability-intent mismatches, and ambient authority leakage in execution environments. Based on these findings, we derive practical design guidelines for deploying AI agents in the cloud more securely.

0

cs.CR 2026-05-11 Recognition

Red-teaming pipeline drops monitor catch rates from 95% to 60%

MonitoringBench: Semi-Automated Red-Teaming for Agent Monitoring

Decomposing attacks into strategy, execution and refinement creates harder tests that expose weaknesses in current coding-agent monitors.

abstract click to expand

We introduce a red-teaming methodology that exposes harder-to-catch attacks for coding-agent monitors, suggesting that current practices may under-elicit attacks and overstate monitor performance. We identify three challenges with current red-teaming. First, mode collapse in attack generation, which we reduce with a novel attack taxonomy for broader coverage. Second, a conceive-execute gap: frontier LLMs can propose strong attack ideas or execute them, but not all at once. We mitigate this by decomposing attack construction into strategy generation, execution, and post-hoc trajectory refinement. Third, manual elicitation is costly to scale, which we address with our semi-automated red-teaming pipeline. Applied to BashArena, an AI control setting for tool-using coding agents, this pipeline produces MonitoringBench, a benchmark of 2,644 attack trajectories for evaluating monitor capabilities and failure modes. Our pipeline produces more diverse and stronger attacks: Opus-4.5 monitor's catch rate falls from 94.9\% on elicited-only Opus attacks to 60.3\% on our best refined attacks, with larger drops for several mid-tier monitors. Attacks optimized against three development monitors generalize to ten held-out monitors, with catch rates generally increasing with monitor capability. Using this benchmark, we provide a snapshot of the current monitor capabilities and find that frontier monitors often detect suspicious actions but fall for persuasion or fail to calibrate suspiciousness scores appropriately, suggesting tractable paths for improvement. MonitoringBench provides both a static benchmark for current tool-use monitors and a reusable methodology for refreshing these evaluations as agents and monitors improve.

0

cs.CR 2026-05-11 2 theorems

Interpolation cuts forgetting in malware detectors without replay

FreeMOCA: Memory-Free Continual Learning for Malicious Code Analysis

Layer-wise blending of task optima preserves accuracy on prior threats while learning new ones, outperforming 11 baselines on EMBER and AZ.

abstract click to expand

As over 200 million new malware samples are identified each year, antivirus systems must continuously adapt to the evolving threat landscape. However, retraining solely on new samples leads to catastrophic forgetting and exploitable blind spots, while retraining on the entire dataset incurs substantial computational cost. We propose FreeMOCA, a memory- and compute-efficient continual learning framework for malicious code analysis that preserves prior knowledge via adaptive layer-wise interpolation between consecutive task updates, leveraging the fact that warm-started task optima are connected by low-loss paths in parameter space. We evaluate FreeMOCA in both class-incremental (Class-IL) and domain-incremental (Domain-IL) settings on large-scale Windows (EMBER) and Android (AZ) malware benchmarks. FreeMOCA achieves substantial gains in Class-IL, outperforming 11 baselines on both EMBER and AZ benchmarks. It also significantly reduces forgetting, achieving the best retention across baselines, and improving accuracy by up to 42% and 37% on EMBER and AZ, respectively. These results demonstrate that warm-started interpolation in parameter space provides a scalable and effective alternative to replay for continual malware detection. Code is available at: https://github.com/IQSeC-Lab/FreeMOCA.

0

cs.CR 2026-05-11 Recognition

"Training robust watermarking model may hurt authentication!'' Exploring and Mitigating the Identity Leakage in Robust Watermarking

W-IR adds residual loss to cut leakage while certifying robustness to pixel and coordinate edits.

abstract click to expand

The rapid advancement of generative AI has underscored the critical need for identifying image ownership and protecting copyrights. This makes post-processing image watermarking an essential tool -- it involves embedding a specific watermark message into an image, with successful verification if a similar message can be decoded from the watermarked image. However, this method is susceptible to both adversarial attacks that manipulate the watermarked image to yield an unverified message upon decoding, and the proposed identity leakage-related attacks (e.g., forging watermarked images). The threat of identity leakage is particularly exacerbated in both empirical and certified robust watermarking methods. To defend against the aforementioned attacks, we propose W-IR, the first image watermarking framework that simultaneously incorporates identity protection and robustness. To enhance model robustness, we introduce a novel randomized smoothing technique as part of a robust watermarking, that offers certified robustness against perturbations across two distinct transformation spaces: pixel-level and coordinate-level. Moreover, to further mitigate identity leakage, we propose a new strategy based on residual information loss, aimed at minimizing the mutual information between the residual and watermarked images. Our work strikes a superior balance between robustness and identity leakage mitigation. Extensive experiments demonstrate that our W-IR framework achieves high certified accuracy for authenticity while effectively reducing identity leakage. \footnote{The code is available at https://github.com/holdrain/W-I-R.}

0

cs.CR 2026-05-11 2 theorems

Image-to-3D models produce harmful shapes that bypass 99.7% of moderation

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

Tests across unsafe categories show models succeed at reconstruction while commercial flags trigger on under 0.3 percent of outputs; stacked

abstract click to expand

Recent advances in image-to-3D models have significantly improved the fidelity and accessibility of 3D content creation. Such a powerful reconstruction capability that enables creative design can also be misused by the adversary to generate harmful geometries, which can be further fabricated via 3D printers and pose real-world risks. However, such risks are largely underexplored: it remains unclear how well current image-to-3D models can produce these harmful geometries, and whether existing safeguards can reliably prevent such generation. To fill this gap, we conduct a systematic measurement study of harmful geometry generation and mitigation. We first describe this risk through three kinds of unsafe categories: direct-use physical hazards, risky templates or components, and deceptive replicas. Each category is instantiated with representative objects. We evaluate both open-source and commercial image-to-3D models under original, degraded, viewpoint-shifted, and semantically camouflaged inputs. We consider different evaluation metrics, including geometric validity, multi-view VLM-based semantic scoring, targeted human validation, and controlled physical fabrication. The results reveal a concerning reality that current image-to-3D models can effectively reconstruct the harmful geometries, while fewer than 0.3% of such geometries trigger commercial moderation flags. As a first step toward mitigation, we evaluate three representative safeguard families, including input moderation, model-level benign alignment, and output-level filtering. We find that existing safeguards have distinct weaknesses. We further develop a stacked defense that can reduce harmful retention to <1%, but still at 11% overall false-positive cost. Taken together, our findings demonstrate that the risk in current system and encourage better geometry-aware safeguards for moderation.

0

cs.CR 2026-05-11 2 theorems

Malicious skills steer agents to attacker packages

Trust Me, Import This: Dependency Steering Attacks via Malicious Agent Skills

Small semantic edits in persistent instructions cause LLMs to suggest specific malicious imports during normal coding work.

abstract click to expand

LLM-powered coding agents increasingly make software supply chain decisions. They generate imports, recommend packages, and write installation commands. Prior work showed that these systems can hallucinate non-existent package names, which attackers may register as malicious packages. In this paper, we show that this risk is not only a passive model failure. It can be actively induced through the persistent Skill artifact. We introduce Dependency Steering, an attack paradigm in which a malicious Skill biases a coding agent toward an attacker-controlled package during benign coding tasks. The attack does not require modifying model weights, training data, or user prompts. To construct realistic attacks, we design a Skill-level optimization method that searches for localized semantic edits that preserve the apparent purpose of the original Skill while increasing targeted package generation. Across multiple coding-oriented LLMs and programming benchmarks, Dependency Steering achieves high targeted hallucination rates, transfers across models and task domains, and remains difficult for evaluated Skill scanners and LLM-based auditors to detect. Our results show that persistent agent instructions form an underexplored software supply chain attack surface.

0

cs.CR 2026-05-11 2 theorems

Dual-channel fusion improves vulnerability detection and localization

DCVD: Dual-Channel Cross-Modal Fusion for Joint Vulnerability Detection and Localization

Control-dependency and semantic channels are aligned by contrastive learning and cross-attention with direct supervision at both function-1

abstract click to expand

Software vulnerability detection plays a critical role in ensuring system security, where real-world auditing requires not only determining whether a function is vulnerable but also pinpointing the specific lines responsible. However, existing approaches either rely on a single information source -- sequential, structural, or semantic -- failing to jointly exploit the complementary strengths across modalities, or treat statement-level localization merely as a byproduct of function-level detection without explicit line-level supervision. To address these limitations, we propose DCVD (Dual-Channel Cross-Modal Vulnerability Detection), a unified framework that performs joint function-level detection and statement-level localization. DCVD extracts control-dependency and semantic features through two parallel branches and integrates them via contrastive alignment coupled with bidirectional cross-attention, effectively bridging the cross-modal representation gap. It further introduces explicit supervision signals at both the function and statement levels, enabling collaborative optimization across the two granularities. Extensive experiments on a large-scale real-world vulnerability benchmark demonstrate that DCVD consistently outperforms state-of-the-art methods on both function-level detection and statement-level localization. Our code is available at https://github.com/vinsontang1/DCVD.

0

cs.CR 2026-05-11 Recognition

Framework governs AI use in security operations centers

Governing AI-Assisted Security Operations: A Design Science Framework for Operational Decision Support

It separates AI planning from execution with templates, validation, and review gates to keep accountability and privacy intact.

abstract click to expand

Engineering managers increasingly must decide how to introduce generative artificial intelligence (AI), retrieval-augmented generation, and coding agents into high-risk operational functions without weakening accountability, privacy, cost discipline, or auditability. The central message of this study is that AI-assisted operational decision support should be managed as a governed engineering capability before it is scaled as automation. Security operations centers (SOCs) provide a suitable setting because they combine privileged telemetry, specialist expertise, software repositories, cloud services, and evidence-sensitive decisions. This study uses Kusto Query Language (KQL) and Microsoft Azure security capabilities as a bounded technical instantiation of that broader engineering management problem. KQL is read-only in ordinary query use, but read-only does not mean risk-free: AI-assisted queries can still create privacy, cost, performance, schema-validity, and decision-quality risks through broad scans, sensitive-field exposure, stale intelligence, and misleading interpretations. Using design science research, the study develops a governed AI query-broker artifact that separates AI planning from operational execution through schema-grounded retrieval, approved templates, policy validation, read-only adapters, normalized outputs, auditable agent traces, and engineering review board gates. The contribution is not a new KQL technique, security product, or detection algorithm. Rather, the study contributes a management framework for governing AI-assisted operational decision support in high-risk digital infrastructure by specifying design propositions, role accountability, maturity stages, quality gates, evaluation criteria, and evidence boundaries.

0