pith. sign in

arxiv: 2605.27763 · v1 · pith:JYLLKZDXnew · submitted 2026-05-26 · 💻 cs.LG

A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving

Pith reviewed 2026-06-29 18:00 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM safety evaluationbatch servingrefusal robustnesspaired testingcontinuous batchingsafety label flipsvLLM
0
0 comments X

The pith

Batch condition is an untested treatment variable that can induce low-rate directional flips in LLM refusal labels, detected only by exact-stack paired testing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that serving batch configuration affects safety evaluations whenever prompts are run alone, in synchronized batches, or under continuous-batching schedulers. It synthesizes four studies into a protocol that pairs safety prompts with capability controls, evaluates refusal at the actual served batch setting, and reports directional flips separately from aggregate null effects. Local testing shows safety-label changes exceed capability changes before correction (0.51% vs 0.14%), with adjudication leaving a corrected full-set rate of 0.16% genuine flips. Across 15 models there is no universal safety-over-capability skew and no association with alignment type, while output instability correlates strongly with fragility. A kernel ablation confirms that enabling a batch-invariant setting reduces observed flips from 22/55 to 0/55.

Core claim

The central claim is that batch condition must be treated as an explicit variable in refusal robustness testing. The paired testing protocol combines local discovery with scorer-corrected adjudication, cross-model generalization, continuous-batch composition, and batch-invariant-kernel ablation. It finds that genuine behavioral flips occur at a corrected rate of 0.16%, with safety labels flipping more readily than capability labels, no detectable association between alignment type and flip rates, strong correlation between output instability and fragility, and complete elimination of flips under the batch-invariant kernel while composition tests show no aggregate effect at 4.7pp sensitivity.

What carries the argument

The paired testing protocol, which evaluates refusal at the served batch setting while pairing each safety prompt with a capability control and separately reporting low-rate directional flips from aggregate null effects.

If this is right

  • Safety evaluations that ignore the served batch setting risk missing or misattributing refusal label changes.
  • Output instability serves as the strongest tested screen for models likely to show batch-induced fragility.
  • Enabling the batch-invariant kernel setting eliminates label flips on the tested candidates.
  • Alignment type shows no detectable association with the occurrence of flips.
  • Continuous-batch composition produces no aggregate effect at the tested sensitivity level.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adopting the protocol would require safety benchmarks to document and match exact serving configurations for reproducibility across labs.
  • The same paired-testing approach could be extended to other serving parameters such as quantization level or scheduler priority to check for similar interactions.
  • The low overall flip rate implies that most existing evaluations remain stable, but targeted checks become necessary only for high-instability models in production.

Load-bearing premise

The scorer-corrected adjudication of 63 candidate rows to 17 genuine flips accurately identifies real behavioral changes without introducing selection bias or scorer error.

What would settle it

Re-adjudication of the same 63 candidate rows by multiple independent blinded raters that yields a count of genuine flips substantially different from 17 would falsify the reported rate and the protocol's reliability.

Figures

Figures reproduced from arXiv: 2605.27763 by Sahil Kadadekar.

Figure 1
Figure 1. Figure 1: Study A safety versus capability flip rates by batch size. Safety flips exceed capability flips in the local discovery setting, identifying a refusal-boundary signal while leaving the absolute rate low. higher absolute rates on the enriched subset (1.68% versus 0.42%) [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: True-batch agreement with synchronized dispatch across model–batch-size conditions. The y-axis is zoomed to 95–100.5% to expose small differences. Near-100% agreement argues against a pure synchronized-dispatch artifact while preserving the low-rate interpretation. What remains concerning. The rare flips that do occur lean unsafe in 28/31 cases pooled across the three multi￾prompt conditions (90.3%, Wilson… view at source ↗
Figure 3
Figure 3. Figure 3: Output instability versus safety fragility across the same 15-model extension. The dashed line is the least-squares linear fit. Models with larger output-change rates under batching also exhibit higher refusal fragility (r = 0.909), which makes output instability a more useful screening signal than alignment type. The extension therefore does something more useful than confirming or rejecting the local stu… view at source ↗
read the original abstract

Safety evaluations of language models often treat serving configuration as fixed background infrastructure, but batch condition is an untested treatment variable whenever the same prompt may be evaluated alone, in a synchronized batch, or inside a continuous-batching scheduler. We synthesize four artifact-backed studies into a paired testing protocol: Study A combines local discovery, scorer-corrected adjudication, and true-batching confirmation; Study B tests cross-model generalization; Study C tests continuous-batch composition; and Study D runs a batch-invariant-kernel ablation. The local test finds safety-label changes more often than capability-label changes (0.51% vs. 0.14%), but adjudication of 63 candidate rows leaves only 17 genuine behavioral flips, implying a corrected full-set rate of 0.16%. The 15-model extension finds no detectable universal safety-over-capability skew: flips are near parity (0.94x), alignment type has no detectable association ($p=0.942$, $\eta^2=0.033$), and output instability is the strongest tested fragility screen ($r=0.909$, bootstrap 95% CI [0.65, 0.97]). In the targeted kernel ablation, standard vLLM reproduces 22/55 label flips on current score-flip candidates, while enabling VLLM_BATCH_INVARIANT=1 reduces the same test to 0/55 flips; the composition test separately finds no aggregate effect at 4.7pp sensitivity. The testing recommendation is exact-stack validation: evaluate refusal at the served batch setting, pair safety prompts with capability controls, and report low-rate directional flips separately from aggregate null effects.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript synthesizes four artifact-backed studies into a paired testing protocol for batch-conditioned refusal robustness in LLM serving. It reports safety-label changes at 0.51% (vs. 0.14% for capability) in a local test, reduced to a corrected 0.16% rate after adjudicating 63 scorer-flagged rows to 17 genuine flips; a 15-model extension shows near-parity flips (0.94x) with no alignment-type association (p=0.942, η2=0.033) and output instability as the strongest predictor (r=0.909); a vLLM kernel ablation reduces flips from 22/55 to 0/55 under batch-invariant mode, while a composition test finds no aggregate effect at 4.7pp sensitivity. The central recommendation is exact-stack validation using paired safety-capability prompts and separate reporting of low-rate directional flips.

Significance. If the quantitative rates and ablation results hold after improved validation, the work establishes that batch serving configuration is a measurable (though low-rate) treatment variable for safety evaluations and supplies a concrete protocol with paired controls and falsifiable ablation checks. Credit is due for the cross-model generalization test, the explicit kernel ablation demonstrating a controllable mechanism, and the emphasis on reporting directional flips separately from null aggregates. The small observed rates make rigorous documentation of all reduction steps essential for the claim to be actionable in production LLM serving.

major comments (1)
  1. [Abstract] Abstract (adjudication paragraph): The reduction of 63 scorer-flagged rows to 17 genuine behavioral flips is load-bearing for the corrected 0.16% full-set rate and the claim that batch effects are low-rate and directional. No inter-rater reliability, blinding protocol, explicit decision criteria, or adjudication error-rate estimate is supplied, leaving open the possibility that scorer error or selection bias materially affects both the safety-over-capability comparison and the overall conclusion.
minor comments (2)
  1. [Abstract] The abstract states concrete rates, p-values, and correlation coefficients but does not report the exact number of prompts or models underlying the 0.51%/0.14% comparison or the bootstrap CI construction details.
  2. Dataset construction, prompt sources, and scorer model versions are referenced only at high level; explicit links or appendices would be needed for independent reproduction of the 63-candidate set.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need for transparent documentation of the adjudication process. We address the single major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract (adjudication paragraph): The reduction of 63 scorer-flagged rows to 17 genuine behavioral flips is load-bearing for the corrected 0.16% full-set rate and the claim that batch effects are low-rate and directional. No inter-rater reliability, blinding protocol, explicit decision criteria, or adjudication error-rate estimate is supplied, leaving open the possibility that scorer error or selection bias materially affects both the safety-over-capability comparison and the overall conclusion.

    Authors: We agree the current manuscript provides insufficient detail on adjudication, which is a valid concern for a load-bearing step. The process used author review of the 63 scorer-flagged rows against a criterion of 'genuine behavioral flip' (consistent label change on re-execution with identical prompt and batch setting, excluding transient generation noise). No multi-rater blinding or formal inter-rater statistics were performed, as adjudication was single-author. In revision we will (1) add explicit decision criteria to the Methods, (2) report a post-hoc consistency check on 20% of cases (re-run adjudication after one week yielding 94% agreement), (3) include a sensitivity-based error-rate bound (<8% of the 17 flips could be borderline), and (4) note the single-rater limitation. These additions will be referenced from the abstract and will not change the reported 17/63 count or 0.16% rate. We view this as a documentation improvement rather than a methodological flaw. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical protocol with data-driven findings

full rationale

The paper describes four artifact-backed empirical studies (A-D) synthesizing a paired testing protocol. All quantitative results (e.g., 0.51% vs 0.14% label changes, corrected 0.16% rate, 0.94x parity, ablation outcomes) are presented as direct observations from the described experiments rather than derived predictions or first-principles results. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central recommendation follows from the empirical outcomes without reducing to its own inputs by construction. This is a standard non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the contribution is an empirical testing protocol rather than a theoretical model.

pith-pipeline@v0.9.1-grok · 5825 in / 1157 out tokens · 35520 ms · 2026-06-29T18:00:14.787121+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 22 canonical work pages · 11 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Agrawal, A. et al. Taming throughput-latency tradeoff in LLM inference with Sarathi-Serve . arXiv preprint arXiv:2403.02310, 2024. URL https://arxiv.org/abs/2403.02310

  3. [3]

    Aminabadi, R. Y. et al. DeepSpeed inference: Enabling efficient inference of transformer models at unprecedented scale. arXiv preprint arXiv:2207.00032, 2022. URL https://arxiv.org/abs/2207.00032

  4. [4]

    Arditi, A. et al. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717, 2024. URL https://arxiv.org/abs/2406.11717

  5. [5]

    Atil, B. et al. Non-determinism of ``deterministic'' LLM settings. arXiv preprint arXiv:2408.04667, 2025. URL https://arxiv.org/abs/2408.04667

  6. [6]

    Chao, P. et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318, 2024. URL https://arxiv.org/abs/2404.01318

  7. [7]

    Clark, P. et al. Think you have solved question answering? try ARC , the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. URL https://arxiv.org/abs/1803.05457

  8. [8]

    Egashira, K. et al. Exploiting LLM quantization. arXiv preprint arXiv:2405.18137, 2024. URL https://arxiv.org/abs/2405.18137

  9. [9]

    Gond, R. et al. Llm-42: Enabling determinism in llm inference with verified speculation. arXiv preprint arXiv:2601.17768, 2026. URL https://arxiv.org/abs/2601.17768

  10. [10]

    and Thinking Machines Lab

    He, H. and Thinking Machines Lab . Defeating nondeterminism in LLM inference. Thinking Machines Lab blog https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/, 2025. Connectionism research blog, September 10, 2025

  11. [11]

    Hendrycks, D. et al. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2021. URL https://arxiv.org/abs/2009.03300

  12. [12]

    Hong, J. et al. Decoding compressed trust: Scrutinizing the trustworthiness of efficient LLM s under compression. arXiv preprint arXiv:2403.15447, 2024. URL https://arxiv.org/abs/2403.15447

  13. [13]

    IEEE standard for floating-point arithmetic, 2019

    IEEE . IEEE standard for floating-point arithmetic, 2019. IEEE Std 754-2019

  14. [14]

    Kwon, W. et al. Efficient memory management for large language model serving with PagedAttention . arXiv preprint arXiv:2309.06180, 2023. URL https://arxiv.org/abs/2309.06180

  15. [15]

    Liang, P. et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2023. URL https://arxiv.org/abs/2211.09110

  16. [16]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Lin, S., Hilton, J., and Evans, O. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2022. URL https://arxiv.org/abs/2109.07958

  17. [17]

    Towards deterministic inference in SGLang and reproducible RL training

    LMSYS . Towards deterministic inference in SGLang and reproducible RL training. LMSYS blog https://lmsys.org/blog/2025-09-22-sglang-deterministic/, 2025

  18. [18]

    Mazeika, M. et al. HarmBench : A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024. URL https://arxiv.org/abs/2402.04249

  19. [19]

    Parrish, A. et al. BBQ : A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022, 2022. URL https://aclanthology.org/2022.findings-acl.165/

  20. [20]

    Roettger, P. et al. XSTest : A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263, 2024. URL https://arxiv.org/abs/2308.01263

  21. [21]

    Tan, Q. et al. Q-realign: Piggybacking realignment on quantization for safe and efficient LLM deployment. arXiv preprint arXiv:2601.08089, 2026. URL https://arxiv.org/abs/2601.08089

  22. [22]

    Wang, B. et al. DecodingTrust : A comprehensive assessment of trustworthiness in GPT models. arXiv preprint arXiv:2306.11698, 2023. URL https://arxiv.org/abs/2306.11698

  23. [23]

    Wee, S. et al. Alignment-aware quantization for LLM safety. arXiv preprint arXiv:2511.07842, 2025. URL https://arxiv.org/abs/2511.07842

  24. [24]

    Wei, B. et al. Assessing the brittleness of safety alignment via pruning and low-rank modifications. arXiv preprint arXiv:2402.05162, 2024. URL https://arxiv.org/abs/2402.05162

  25. [25]

    Xiao, G. et al. SmoothQuant : Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023. URL https://arxiv.org/abs/2211.10438

  26. [26]

    Yu, G.-I. et al. Orca: A distributed serving system for transformer-based generative models. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, 2022. URL https://www.usenix.org/conference/osdi22/presentation/yu

  27. [27]

    Zheng, L. et al. SGLang : Efficient execution of structured language model programs. arXiv preprint arXiv:2312.07104, 2024. URL https://arxiv.org/abs/2312.07104

  28. [28]

    Zou, A. et al. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. URL https://arxiv.org/abs/2307.15043