Title resolution pending

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda · 2024

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

cs.CR · 2026-04-30 · unverdicted · novelty 7.0

MASCing uses an LSTM surrogate and optimized steering masks to enable flexible, inference-time control over MoE expert routing for safety objectives, improving jailbreak defense and content generation success rates substantially across multiple models.

Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses

cs.CR · 2026-05-04 · accept · novelty 6.0

JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer fingerprints reaches 0.99 AUROC and limits adaptive ASR to 7%.

Multi-Turn Neural Transparency: Surfacing Neural Activations Improves User Calibration to LLM Behavioral Drift

cs.HC · 2026-05-14 · unverdicted · novelty 5.0

Multi-turn neural transparency using behavioral vectors and dynamic visualizations improves user anticipation and evaluation of LLM trait expression while reducing overconfidence, per a randomized study with 246 participants.

Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research

cs.HC · 2026-04-20 · unverdicted · novelty 5.0

AVA is a specialized GenAI platform for development policy research that provides verifiable syntheses from World Bank reports and is associated with 2.4-3.9 hours of weekly time savings in a large-scale user evaluation.

citing papers explorer

Showing 4 of 4 citing papers.

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks cs.CR · 2026-04-30 · unverdicted · none · ref 2
MASCing uses an LSTM surrogate and optimized steering masks to enable flexible, inference-time control over MoE expert routing for safety objectives, improving jailbreak defense and content generation success rates substantially across multiple models.
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses cs.CR · 2026-05-04 · accept · none · ref 5
JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer fingerprints reaches 0.99 AUROC and limits adaptive ASR to 7%.
Multi-Turn Neural Transparency: Surfacing Neural Activations Improves User Calibration to LLM Behavioral Drift cs.HC · 2026-05-14 · unverdicted · none · ref 4
Multi-turn neural transparency using behavioral vectors and dynamic visualizations improves user anticipation and evaluation of LLM trait expression while reducing overconfidence, per a randomized study with 246 participants.
Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research cs.HC · 2026-04-20 · unverdicted · none · ref 3
AVA is a specialized GenAI platform for development policy research that provides verifiable syntheses from World Bank reports and is associated with 2.4-3.9 hours of weekly time savings in a large-scale user evaluation.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer