A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
Title resolution pending
11 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
AutoRISE evolves red-teaming attack strategies as editable executable programs via an agent, yielding 17-point higher average attack success rates than baselines across 11 models.
New RPS and AGS metrics show within-family distilled LLM agents have 5.9 pp higher tool-use graph similarity than cross-family pairs, with some models exceeding their teachers.
MADE creates a contamination-resistant living benchmark for multi-label classification of medical device adverse events, with evaluations revealing model-specific trade-offs in accuracy and uncertainty quantification.
Large reasoning models exhibit multilingual latent reasoning that is uneven across languages but internally consistent and English-centered.
A mutual evaluation system for LLMs that uses game-theoretic aggregation of peer reviews and validates alignment with human voting on subjective outputs.
GeoLaux is a new benchmark of 2186 long-step geometry problems requiring auxiliary lines, used to evaluate 23 MLLMs and reveal major drops in performance on complex tasks.
Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.
VeriScale adversarially scales test suites for the Verina benchmark into VerinaPlus (83x larger) and VerinaLite (14x variant) that expose hidden LLM weaknesses on SpecGen and CodeGen tasks.
SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.
PITMuS automates source-level bug dataset generation by mapping PIT bytecode mutants back to Java source using debug information, producing structured pairs and metadata evaluated on eight open-source systems.
citing papers explorer
-
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
-
AutoRISE: Agent-Driven Strategy Evolution for Red-Teaming Large Language Models
AutoRISE evolves red-teaming attack strategies as editable executable programs via an agent, yielding 17-point higher average attack success rates than baselines across 11 models.
-
When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors
New RPS and AGS metrics show within-family distilled LLM agents have 5.9 pp higher tool-use graph similarity than cross-family pairs, with some models exceeding their teachers.
-
MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events
MADE creates a contamination-resistant living benchmark for multi-label classification of medical device adverse events, with evaluations revealing model-specific trade-offs in accuracy and uncertainty quantification.
-
Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners
Large reasoning models exhibit multilingual latent reasoning that is uneven across languages but internally consistent and English-centered.
-
LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation
A mutual evaluation system for LLMs that uses game-theoretic aggregation of peer reviews and validates alignment with human voting on subjective outputs.
-
GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines
GeoLaux is a new benchmark of 2186 long-step geometry problems requiring auxiliary lines, used to evaluate 23 MLLMs and reveal major drops in performance on complex tasks.
-
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.
-
VeriScale: Adversarial Test-Suite Scaling for Verifiable Code Generation
VeriScale adversarially scales test suites for the Verina benchmark into VerinaPlus (83x larger) and VerinaLite (14x variant) that expose hidden LLM weaknesses on SpecGen and CodeGen tasks.
-
SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks
SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.
-
PITMuS: A Tool for Automated Bug Dataset Generation via Source-Level Mutant Reconstruction
PITMuS automates source-level bug dataset generation by mapping PIT bytecode mutants back to Java source using debug information, producing structured pairs and metadata evaluated on eight open-source systems.