hub Mixed citations

”Tensor trust: Interpretable prompt injection attacks from an online game.” arXiv preprint arXiv:2311.01011 (2023)

Toyer, S · 2023 · arXiv 2311.01011

Mixed citation behavior. Most common role is background (60%).

14 Pith papers citing it

Background 60% of classified citations

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 dataset 1

citation-polarity summary

background 3 unclear 1 use dataset 1

representative citing papers

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

cs.CR · 2026-04-11 · unverdicted · novelty 7.0

HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

cs.CR · 2024-10-03 · unverdicted · novelty 7.0

ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.

Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents

cs.CR · 2026-06-02 · unverdicted · novelty 6.0

Activation probes, calibrated honeytokens, and multi-turn leakage accounting detect credential exfiltration attempts in LLM agents with high accuracy in controlled open-model tests.

Provably Secure Agent Guardrail

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

Introduces ePCA framework using neural-symbolic isolation to force agents to formalize intentions as logical constraints, claiming zero attack success and false positive rates in tested scenarios.

Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents

cs.CR · 2026-05-13 · conditional · novelty 6.0

Sleeper channels enable persistent prompt injection in always-on AI agents via persistence substrate and firing separation, countered by provenance gates using action digests and owner attestations with a soundness theorem.

PAAC: Privacy-Aware Agentic Device-Cloud Collaboration

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

PAAC aligns planner-executor decomposition with the device-cloud boundary via typed placeholders and on-device sanitization, delivering 15-36% higher accuracy and 2-6x lower leakage than prior device-cloud baselines on agentic benchmarks.

Representation-Guided Parameter-Efficient LLM Unlearning

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.

Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs

cs.CL · 2026-05-28 · unverdicted · novelty 5.0

Mock tool call wrapping does not broadly improve and sometimes reduces robustness to attacks on untrusted inputs across seven models and three LLM-as-a-judge tasks.

Evaluation of Prompt Injection Defenses in Large Language Models

cs.CR · 2026-04-26 · unverdicted · novelty 5.0 · 2 refs

Only output filtering with hardcoded rules in application code prevented prompt injection leaks in LLMs, as all model-based defenses were defeated by an adaptive attacker.

SALLIE: Safeguarding Against Latent Language & Image Exploits

cs.CR · 2026-04-06 · unverdicted · novelty 5.0

SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.

A Comparative Evaluation of AI Agent Security Guardrails

cs.CR · 2026-04-27 · unverdicted · novelty 3.0

DKnownAI Guard achieves 96.5% recall and 90.4% true negative rate, outperforming three competing guardrails in AI agent security evaluations.

Generative Models and Connected and Automated Vehicles: A Survey in Exploring the Intersection of Transportation and AI

cs.LG · 2024-03-14 · unverdicted · novelty 2.0

A survey reviewing the integration of generative models with connected and automated vehicles to enhance predictive modeling, simulation accuracy, and decision-making.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Provably Secure Agent Guardrail cs.AI · 2026-05-28 · unverdicted · none · ref 50
Introduces ePCA framework using neural-symbolic isolation to force agents to formalize intentions as logical constraints, claiming zero attack success and false positive rates in tested scenarios.

”Tensor trust: Interpretable prompt injection attacks from an online game.” arXiv preprint arXiv:2311.01011 (2023)

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer