A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving

Sahil Kadadekar

arxiv: 2605.27763 · v1 · pith:JYLLKZDXnew · submitted 2026-05-26 · 💻 cs.LG

A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving

Sahil Kadadekar This is my paper

Pith reviewed 2026-06-29 18:00 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM safety evaluationbatch servingrefusal robustnesspaired testingcontinuous batchingsafety label flipsvLLM

0 comments

The pith

Batch condition is an untested treatment variable that can induce low-rate directional flips in LLM refusal labels, detected only by exact-stack paired testing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that serving batch configuration affects safety evaluations whenever prompts are run alone, in synchronized batches, or under continuous-batching schedulers. It synthesizes four studies into a protocol that pairs safety prompts with capability controls, evaluates refusal at the actual served batch setting, and reports directional flips separately from aggregate null effects. Local testing shows safety-label changes exceed capability changes before correction (0.51% vs 0.14%), with adjudication leaving a corrected full-set rate of 0.16% genuine flips. Across 15 models there is no universal safety-over-capability skew and no association with alignment type, while output instability correlates strongly with fragility. A kernel ablation confirms that enabling a batch-invariant setting reduces observed flips from 22/55 to 0/55.

Core claim

The central claim is that batch condition must be treated as an explicit variable in refusal robustness testing. The paired testing protocol combines local discovery with scorer-corrected adjudication, cross-model generalization, continuous-batch composition, and batch-invariant-kernel ablation. It finds that genuine behavioral flips occur at a corrected rate of 0.16%, with safety labels flipping more readily than capability labels, no detectable association between alignment type and flip rates, strong correlation between output instability and fragility, and complete elimination of flips under the batch-invariant kernel while composition tests show no aggregate effect at 4.7pp sensitivity.

What carries the argument

The paired testing protocol, which evaluates refusal at the served batch setting while pairing each safety prompt with a capability control and separately reporting low-rate directional flips from aggregate null effects.

If this is right

Safety evaluations that ignore the served batch setting risk missing or misattributing refusal label changes.
Output instability serves as the strongest tested screen for models likely to show batch-induced fragility.
Enabling the batch-invariant kernel setting eliminates label flips on the tested candidates.
Alignment type shows no detectable association with the occurrence of flips.
Continuous-batch composition produces no aggregate effect at the tested sensitivity level.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adopting the protocol would require safety benchmarks to document and match exact serving configurations for reproducibility across labs.
The same paired-testing approach could be extended to other serving parameters such as quantization level or scheduler priority to check for similar interactions.
The low overall flip rate implies that most existing evaluations remain stable, but targeted checks become necessary only for high-instability models in production.

Load-bearing premise

The scorer-corrected adjudication of 63 candidate rows to 17 genuine flips accurately identifies real behavioral changes without introducing selection bias or scorer error.

What would settle it

Re-adjudication of the same 63 candidate rows by multiple independent blinded raters that yields a count of genuine flips substantially different from 17 would falsify the reported rate and the protocol's reliability.

Figures

Figures reproduced from arXiv: 2605.27763 by Sahil Kadadekar.

**Figure 1.** Figure 1: Study A safety versus capability flip rates by batch size. Safety flips exceed capability flips in the local discovery setting, identifying a refusal-boundary signal while leaving the absolute rate low. higher absolute rates on the enriched subset (1.68% versus 0.42%) [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 4.** Figure 4: True-batch agreement with synchronized dispatch across model–batch-size conditions. The y-axis is zoomed to 95–100.5% to expose small differences. Near-100% agreement argues against a pure synchronized-dispatch artifact while preserving the low-rate interpretation. What remains concerning. The rare flips that do occur lean unsafe in 28/31 cases pooled across the three multiprompt conditions (90.3%, Wilson… view at source ↗

**Figure 3.** Figure 3: Output instability versus safety fragility across the same 15-model extension. The dashed line is the least-squares linear fit. Models with larger output-change rates under batching also exhibit higher refusal fragility (r = 0.909), which makes output instability a more useful screening signal than alignment type. The extension therefore does something more useful than confirming or rejecting the local stu… view at source ↗

read the original abstract

Safety evaluations of language models often treat serving configuration as fixed background infrastructure, but batch condition is an untested treatment variable whenever the same prompt may be evaluated alone, in a synchronized batch, or inside a continuous-batching scheduler. We synthesize four artifact-backed studies into a paired testing protocol: Study A combines local discovery, scorer-corrected adjudication, and true-batching confirmation; Study B tests cross-model generalization; Study C tests continuous-batch composition; and Study D runs a batch-invariant-kernel ablation. The local test finds safety-label changes more often than capability-label changes (0.51% vs. 0.14%), but adjudication of 63 candidate rows leaves only 17 genuine behavioral flips, implying a corrected full-set rate of 0.16%. The 15-model extension finds no detectable universal safety-over-capability skew: flips are near parity (0.94x), alignment type has no detectable association ($p=0.942$, $\eta^2=0.033$), and output instability is the strongest tested fragility screen ($r=0.909$, bootstrap 95% CI [0.65, 0.97]). In the targeted kernel ablation, standard vLLM reproduces 22/55 label flips on current score-flip candidates, while enabling VLLM_BATCH_INVARIANT=1 reduces the same test to 0/55 flips; the composition test separately finds no aggregate effect at 4.7pp sensitivity. The testing recommendation is exact-stack validation: evaluate refusal at the served batch setting, pair safety prompts with capability controls, and report low-rate directional flips separately from aggregate null effects.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a usable paired protocol for spotting batch effects on LLM refusal labels, with small rates after correction and a clean ablation on the vLLM flag, but the manual adjudication step lacks needed reliability details.

read the letter

The main thing to know is that batching in serving can flip safety labels at low rates, and the authors supply a protocol plus an ablation that removes those flips by setting a kernel flag. The work is worth a look for anyone running refusal tests on real deployments.

What is new is the synthesis of four pieces: local discovery with scorer adjudication, cross-model checks on 15 models, continuous-batch composition tests, and the targeted kernel ablation. The ablation stands out because standard vLLM reproduces 22 out of 55 flips while the invariant flag drops it to zero. They also pair safety prompts against capability controls and report directional flips separately, which is a sensible control.

The paper does a few things well. It gives concrete rates (0.51% safety vs 0.14% capability before correction, 0.16% after), p-values, a correlation with output instability (r=0.909), and a null result on alignment type. The recommendation to evaluate at the exact served batch setting follows directly from the data.

The soft spot is the adjudication itself. They move from 63 scorer-flagged rows to 17 genuine flips, but the abstract gives no inter-rater reliability, blinding, or explicit decision rules. At these low base rates that step carries a lot of weight, and without those checks the 0.16% figure is harder to trust. The scope is also narrow—refusal robustness only—and the effect sizes stay small even before correction.

This is for researchers who maintain safety evals on batched LLM servers. A reader who needs a practical way to check serving configuration would get a usable protocol and a clear next step. The work shows clear thinking on the empirical side and honest engagement with the limits of current evals.

Send it to peer review. The protocol and ablation are concrete enough to be worth referee time even with the adjudication gap.

Referee Report

1 major / 2 minor

Summary. The manuscript synthesizes four artifact-backed studies into a paired testing protocol for batch-conditioned refusal robustness in LLM serving. It reports safety-label changes at 0.51% (vs. 0.14% for capability) in a local test, reduced to a corrected 0.16% rate after adjudicating 63 scorer-flagged rows to 17 genuine flips; a 15-model extension shows near-parity flips (0.94x) with no alignment-type association (p=0.942, η2=0.033) and output instability as the strongest predictor (r=0.909); a vLLM kernel ablation reduces flips from 22/55 to 0/55 under batch-invariant mode, while a composition test finds no aggregate effect at 4.7pp sensitivity. The central recommendation is exact-stack validation using paired safety-capability prompts and separate reporting of low-rate directional flips.

Significance. If the quantitative rates and ablation results hold after improved validation, the work establishes that batch serving configuration is a measurable (though low-rate) treatment variable for safety evaluations and supplies a concrete protocol with paired controls and falsifiable ablation checks. Credit is due for the cross-model generalization test, the explicit kernel ablation demonstrating a controllable mechanism, and the emphasis on reporting directional flips separately from null aggregates. The small observed rates make rigorous documentation of all reduction steps essential for the claim to be actionable in production LLM serving.

major comments (1)

[Abstract] Abstract (adjudication paragraph): The reduction of 63 scorer-flagged rows to 17 genuine behavioral flips is load-bearing for the corrected 0.16% full-set rate and the claim that batch effects are low-rate and directional. No inter-rater reliability, blinding protocol, explicit decision criteria, or adjudication error-rate estimate is supplied, leaving open the possibility that scorer error or selection bias materially affects both the safety-over-capability comparison and the overall conclusion.

minor comments (2)

[Abstract] The abstract states concrete rates, p-values, and correlation coefficients but does not report the exact number of prompts or models underlying the 0.51%/0.14% comparison or the bootstrap CI construction details.
Dataset construction, prompt sources, and scorer model versions are referenced only at high level; explicit links or appendices would be needed for independent reproduction of the 63-candidate set.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need for transparent documentation of the adjudication process. We address the single major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract] Abstract (adjudication paragraph): The reduction of 63 scorer-flagged rows to 17 genuine behavioral flips is load-bearing for the corrected 0.16% full-set rate and the claim that batch effects are low-rate and directional. No inter-rater reliability, blinding protocol, explicit decision criteria, or adjudication error-rate estimate is supplied, leaving open the possibility that scorer error or selection bias materially affects both the safety-over-capability comparison and the overall conclusion.

Authors: We agree the current manuscript provides insufficient detail on adjudication, which is a valid concern for a load-bearing step. The process used author review of the 63 scorer-flagged rows against a criterion of 'genuine behavioral flip' (consistent label change on re-execution with identical prompt and batch setting, excluding transient generation noise). No multi-rater blinding or formal inter-rater statistics were performed, as adjudication was single-author. In revision we will (1) add explicit decision criteria to the Methods, (2) report a post-hoc consistency check on 20% of cases (re-run adjudication after one week yielding 94% agreement), (3) include a sensitivity-based error-rate bound (<8% of the 17 flips could be borderline), and (4) note the single-rater limitation. These additions will be referenced from the abstract and will not change the reported 17/63 count or 0.16% rate. We view this as a documentation improvement rather than a methodological flaw. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical protocol with data-driven findings

full rationale

The paper describes four artifact-backed empirical studies (A-D) synthesizing a paired testing protocol. All quantitative results (e.g., 0.51% vs 0.14% label changes, corrected 0.16% rate, 0.94x parity, ablation outcomes) are presented as direct observations from the described experiments rather than derived predictions or first-principles results. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central recommendation follows from the empirical outcomes without reducing to its own inputs by construction. This is a standard non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the contribution is an empirical testing protocol rather than a theoretical model.

pith-pipeline@v0.9.1-grok · 5825 in / 1157 out tokens · 35520 ms · 2026-06-29T18:00:14.787121+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 22 canonical work pages · 11 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

Agrawal, A. et al. Taming throughput-latency tradeoff in LLM inference with Sarathi-Serve . arXiv preprint arXiv:2403.02310, 2024. URL https://arxiv.org/abs/2403.02310

work page arXiv 2024
[3]

Aminabadi, R. Y. et al. DeepSpeed inference: Enabling efficient inference of transformer models at unprecedented scale. arXiv preprint arXiv:2207.00032, 2022. URL https://arxiv.org/abs/2207.00032

work page arXiv 2022
[4]

Arditi, A. et al. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717, 2024. URL https://arxiv.org/abs/2406.11717

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Atil, B. et al. Non-determinism of ``deterministic'' LLM settings. arXiv preprint arXiv:2408.04667, 2025. URL https://arxiv.org/abs/2408.04667

work page arXiv 2025
[6]

Chao, P. et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318, 2024. URL https://arxiv.org/abs/2404.01318

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Clark, P. et al. Think you have solved question answering? try ARC , the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. URL https://arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Egashira, K. et al. Exploiting LLM quantization. arXiv preprint arXiv:2405.18137, 2024. URL https://arxiv.org/abs/2405.18137

work page arXiv 2024
[9]

Gond, R. et al. Llm-42: Enabling determinism in llm inference with verified speculation. arXiv preprint arXiv:2601.17768, 2026. URL https://arxiv.org/abs/2601.17768

work page arXiv 2026
[10]

and Thinking Machines Lab

He, H. and Thinking Machines Lab . Defeating nondeterminism in LLM inference. Thinking Machines Lab blog https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/, 2025. Connectionism research blog, September 10, 2025

2025
[11]

Hendrycks, D. et al. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2021. URL https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2009
[12]

Hong, J. et al. Decoding compressed trust: Scrutinizing the trustworthiness of efficient LLM s under compression. arXiv preprint arXiv:2403.15447, 2024. URL https://arxiv.org/abs/2403.15447

work page arXiv 2024
[13]

IEEE standard for floating-point arithmetic, 2019

IEEE . IEEE standard for floating-point arithmetic, 2019. IEEE Std 754-2019

2019
[14]

Kwon, W. et al. Efficient memory management for large language model serving with PagedAttention . arXiv preprint arXiv:2309.06180, 2023. URL https://arxiv.org/abs/2309.06180

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Liang, P. et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2023. URL https://arxiv.org/abs/2211.09110

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Lin, S., Hilton, J., and Evans, O. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2022. URL https://arxiv.org/abs/2109.07958

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Towards deterministic inference in SGLang and reproducible RL training

LMSYS . Towards deterministic inference in SGLang and reproducible RL training. LMSYS blog https://lmsys.org/blog/2025-09-22-sglang-deterministic/, 2025

2025
[18]

Mazeika, M. et al. HarmBench : A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024. URL https://arxiv.org/abs/2402.04249

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Parrish, A. et al. BBQ : A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022, 2022. URL https://aclanthology.org/2022.findings-acl.165/

2022
[20]

Roettger, P. et al. XSTest : A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263, 2024. URL https://arxiv.org/abs/2308.01263

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Tan, Q. et al. Q-realign: Piggybacking realignment on quantization for safe and efficient LLM deployment. arXiv preprint arXiv:2601.08089, 2026. URL https://arxiv.org/abs/2601.08089

work page arXiv 2026
[22]

Wang, B. et al. DecodingTrust : A comprehensive assessment of trustworthiness in GPT models. arXiv preprint arXiv:2306.11698, 2023. URL https://arxiv.org/abs/2306.11698

work page arXiv 2023
[23]

Wee, S. et al. Alignment-aware quantization for LLM safety. arXiv preprint arXiv:2511.07842, 2025. URL https://arxiv.org/abs/2511.07842

work page arXiv 2025
[24]

Wei, B. et al. Assessing the brittleness of safety alignment via pruning and low-rank modifications. arXiv preprint arXiv:2402.05162, 2024. URL https://arxiv.org/abs/2402.05162

work page arXiv 2024
[25]

Xiao, G. et al. SmoothQuant : Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023. URL https://arxiv.org/abs/2211.10438

work page arXiv 2023
[26]

Yu, G.-I. et al. Orca: A distributed serving system for transformer-based generative models. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, 2022. URL https://www.usenix.org/conference/osdi22/presentation/yu

2022
[27]

Zheng, L. et al. SGLang : Efficient execution of structured language model programs. arXiv preprint arXiv:2312.07104, 2024. URL https://arxiv.org/abs/2312.07104

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Zou, A. et al. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. URL https://arxiv.org/abs/2307.15043

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

[2] [2]

Agrawal, A. et al. Taming throughput-latency tradeoff in LLM inference with Sarathi-Serve . arXiv preprint arXiv:2403.02310, 2024. URL https://arxiv.org/abs/2403.02310

work page arXiv 2024

[3] [3]

Aminabadi, R. Y. et al. DeepSpeed inference: Enabling efficient inference of transformer models at unprecedented scale. arXiv preprint arXiv:2207.00032, 2022. URL https://arxiv.org/abs/2207.00032

work page arXiv 2022

[4] [4]

Arditi, A. et al. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717, 2024. URL https://arxiv.org/abs/2406.11717

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Atil, B. et al. Non-determinism of ``deterministic'' LLM settings. arXiv preprint arXiv:2408.04667, 2025. URL https://arxiv.org/abs/2408.04667

work page arXiv 2025

[6] [6]

Chao, P. et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318, 2024. URL https://arxiv.org/abs/2404.01318

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Clark, P. et al. Think you have solved question answering? try ARC , the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. URL https://arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Egashira, K. et al. Exploiting LLM quantization. arXiv preprint arXiv:2405.18137, 2024. URL https://arxiv.org/abs/2405.18137

work page arXiv 2024

[9] [9]

Gond, R. et al. Llm-42: Enabling determinism in llm inference with verified speculation. arXiv preprint arXiv:2601.17768, 2026. URL https://arxiv.org/abs/2601.17768

work page arXiv 2026

[10] [10]

and Thinking Machines Lab

He, H. and Thinking Machines Lab . Defeating nondeterminism in LLM inference. Thinking Machines Lab blog https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/, 2025. Connectionism research blog, September 10, 2025

2025

[11] [11]

Hendrycks, D. et al. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2021. URL https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2009

[12] [12]

Hong, J. et al. Decoding compressed trust: Scrutinizing the trustworthiness of efficient LLM s under compression. arXiv preprint arXiv:2403.15447, 2024. URL https://arxiv.org/abs/2403.15447

work page arXiv 2024

[13] [13]

IEEE standard for floating-point arithmetic, 2019

IEEE . IEEE standard for floating-point arithmetic, 2019. IEEE Std 754-2019

2019

[14] [14]

Kwon, W. et al. Efficient memory management for large language model serving with PagedAttention . arXiv preprint arXiv:2309.06180, 2023. URL https://arxiv.org/abs/2309.06180

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Liang, P. et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2023. URL https://arxiv.org/abs/2211.09110

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Lin, S., Hilton, J., and Evans, O. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2022. URL https://arxiv.org/abs/2109.07958

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

Towards deterministic inference in SGLang and reproducible RL training

LMSYS . Towards deterministic inference in SGLang and reproducible RL training. LMSYS blog https://lmsys.org/blog/2025-09-22-sglang-deterministic/, 2025

2025

[18] [18]

Mazeika, M. et al. HarmBench : A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024. URL https://arxiv.org/abs/2402.04249

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Parrish, A. et al. BBQ : A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022, 2022. URL https://aclanthology.org/2022.findings-acl.165/

2022

[20] [20]

Roettger, P. et al. XSTest : A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263, 2024. URL https://arxiv.org/abs/2308.01263

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Tan, Q. et al. Q-realign: Piggybacking realignment on quantization for safe and efficient LLM deployment. arXiv preprint arXiv:2601.08089, 2026. URL https://arxiv.org/abs/2601.08089

work page arXiv 2026

[22] [22]

Wang, B. et al. DecodingTrust : A comprehensive assessment of trustworthiness in GPT models. arXiv preprint arXiv:2306.11698, 2023. URL https://arxiv.org/abs/2306.11698

work page arXiv 2023

[23] [23]

Wee, S. et al. Alignment-aware quantization for LLM safety. arXiv preprint arXiv:2511.07842, 2025. URL https://arxiv.org/abs/2511.07842

work page arXiv 2025

[24] [24]

Wei, B. et al. Assessing the brittleness of safety alignment via pruning and low-rank modifications. arXiv preprint arXiv:2402.05162, 2024. URL https://arxiv.org/abs/2402.05162

work page arXiv 2024

[25] [25]

Xiao, G. et al. SmoothQuant : Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023. URL https://arxiv.org/abs/2211.10438

work page arXiv 2023

[26] [26]

Yu, G.-I. et al. Orca: A distributed serving system for transformer-based generative models. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, 2022. URL https://www.usenix.org/conference/osdi22/presentation/yu

2022

[27] [27]

Zheng, L. et al. SGLang : Efficient execution of structured language model programs. arXiv preprint arXiv:2312.07104, 2024. URL https://arxiv.org/abs/2312.07104

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Zou, A. et al. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. URL https://arxiv.org/abs/2307.15043

work page internal anchor Pith review Pith/arXiv arXiv 2023