A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
Title resolution pending
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5verdicts
UNVERDICTED 5representative citing papers
AutoRISE evolves red-teaming attack strategies as editable executable programs via an agent, yielding 17-point higher average attack success rates than baselines across 11 models.
New RPS and AGS metrics show within-family distilled LLM agents have 5.9 pp higher tool-use graph similarity than cross-family pairs, with some models exceeding their teachers.
MADE creates a contamination-resistant living benchmark for multi-label classification of medical device adverse events, with evaluations revealing model-specific trade-offs in accuracy and uncertainty quantification.
SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.
citing papers explorer
-
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
-
AutoRISE: Agent-Driven Strategy Evolution for Red-Teaming Large Language Models
AutoRISE evolves red-teaming attack strategies as editable executable programs via an agent, yielding 17-point higher average attack success rates than baselines across 11 models.
-
When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors
New RPS and AGS metrics show within-family distilled LLM agents have 5.9 pp higher tool-use graph similarity than cross-family pairs, with some models exceeding their teachers.
-
MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events
MADE creates a contamination-resistant living benchmark for multi-label classification of medical device adverse events, with evaluations revealing model-specific trade-offs in accuracy and uncertainty quantification.
-
SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks
SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.