pith. machine review for the scientific record. sign in

Building guardrails for large language models

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

years

2026 6 2024 1

representative citing papers

Understanding Annotator Safety Policy with Interpretability

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.

Agent-Sentry: Bounding LLM Agents via Execution Provenance

cs.CR · 2026-03-24 · unverdicted · novelty 6.0

Agent-Sentry bounds LLM agent executions via structural provenance classification, sensitive-value allowlists, and selective LLM judgment, blocking 94.3% of injections while allowing 95.1% of benign actions on AgentDojo and AgentDyn.

StarCoder 2 and The Stack v2: The Next Generation

cs.SE · 2024-02-29 · accept · novelty 6.0

StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.

citing papers explorer

Showing 7 of 7 citing papers.