SafetyRepro: Configuration-Conditional Rank Instability on Alignment Benchmarks

Yanhang Li; Zexin Zhuang; Zhichao Fan

arxiv: 2605.25492 · v1 · pith:CAINK6HYnew · submitted 2026-05-25 · 💻 cs.LG

SafetyRepro: Configuration-Conditional Rank Instability on Alignment Benchmarks

Yanhang Li , Zhichao Fan , Zexin Zhuang This is my paper

classification 💻 cs.LG

keywords benchmarksalignmentbenchmarkpairwisepropositionadmitsalonechoice

0 comments

read the original abstract

Pairwise model comparisons drawn from foundation-model benchmarks ("A is safer than B") are read as quantitative verdicts but hinge on harness choices benchmark papers under-specify. We close one theory-benchmark loop on this primitive: a finite-envelope proposition tying a measurable pairwise-disagreement rate to whether the strict ordering admits a configuration-pair reversal, paired with a commit-stamped evaluation protocol that operationalises it on widely cited alignment benchmarks. On every benchmark we test, configuration choice alone can flip the pairwise verdict; the proposition isolates this strict-reversal failure mode.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When AUC 0.998 Is Not Enough: A Candidate Evaluation Protocol for Hidden-State Probes of Indirect Prompt Injection in Multimodal Computer-Use Agents
cs.LG 2026-06 unverdicted novelty 7.0

High AUC from linear probes on model activations for indirect prompt injection does not license an unqualified claim of malicious-content detection, per a Qwen2.5-VL-7B case study with text and visual controls.
Chains That See, Answers That Don't: A Multi-Aspect Evaluation Recipe for Forced Chain-of-Thought on Video-MME
cs.CV 2026-06 conditional novelty 6.0

Forced CoT produces video-dependent reasoning chains but does not improve MCQ accuracy on Qwen2.5-VL with Video-MME and causes a small drop on the 7B variant.