SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round repair success from 10% to 78%.
Building guardrails for large language models
7 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
RL-trained AI double agents using combined ToM and fooling rewards outperform prompted frontier models on a new belief-steering task and show bidirectional emergence between the two skills.
Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
LocalAlign generates near-target adversarial examples via prompting and applies margin-aware alignment training to enforce tighter boundaries against prompt injection attacks.
Agent-Sentry bounds LLM agent executions via structural provenance classification, sensitive-value allowlists, and selective LLM judgment, blocking 94.3% of injections while allowing 95.1% of benign actions on AgentDojo and AgentDyn.
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
citing papers explorer
-
Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries
SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round repair success from 10% to 78%.
-
Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs
Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
-
Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind
RL-trained AI double agents using combined ToM and fooling rewards outperform prompted frontier models on a new belief-steering task and show bidirectional emergence between the two skills.
-
Understanding Annotator Safety Policy with Interpretability
Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
-
LocalAlign: Enabling Generalizable Prompt Injection Defense via Generation of Near-Target Adversarial Examples for Alignment Training
LocalAlign generates near-target adversarial examples via prompting and applies margin-aware alignment training to enforce tighter boundaries against prompt injection attacks.
-
Agent-Sentry: Bounding LLM Agents via Execution Provenance
Agent-Sentry bounds LLM agent executions via structural provenance classification, sensitive-value allowlists, and selective LLM judgment, blocking 94.3% of injections while allowing 95.1% of benign actions on AgentDojo and AgentDyn.
-
StarCoder 2 and The Stack v2: The Next Generation
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.