arxiv: 2605.11202 · v1 · submitted 2026-05-11 · 💻 cs.CR · cs.AI· cs.LG· cs.SE

Recognition: no theorem link

Continuous Discovery of Vulnerabilities in LLM Serving Systems with Fuzzing

Michelle L. Mazurek, Yibo Zhao, Yuchen Zhang, Yunze Zhao, Zaoxing Liu

Pith reviewed 2026-05-13 02:01 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LGcs.SE

keywords LLM servingfuzzingvulnerability discoveryinference enginesconcurrent workloadsKV cachesecurity testinggreybox fuzzing

0 comments

The pith

A greybox fuzzer called GRIEF finds 15 vulnerabilities in LLM inference engines by testing concurrent request traces that standard tests miss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM serving systems combine KV caches, batching, prefix sharing, and multi-tenant scheduling, so failures that only appear under concurrent workloads go undetected by model-level or single-request tests. The paper introduces GRIEF, which treats timed multi-request traces as primary inputs and applies lightweight oracles to detect crashes, hangs, performance degradation, and silent output corruption. Controlled replay with log-probability checks then confirms that the failures originate in the serving layer rather than the model. Early runs on vLLM and SGLang uncovered 15 issues, ten confirmed by developers and including two CVEs, spanning cache isolation breakdowns, cross-request slowdowns, and liveness problems. If the approach holds, concurrency and state reuse become a distinct security boundary that requires dedicated, continuous testing beyond conventional safety and API checks.

Core claim

GRIEF is a greybox fuzzer for LLM inference engines that treats timed multi-request traces as first-class inputs, uses lightweight oracles to detect crashes, hangs, performance pathologies, and silent output corruption, and applies controlled replay with log-probability checks to confirm reproducible serving-layer failures. Across campaigns on vLLM and SGLang it discovered 15 vulnerabilities, ten confirmed by engine developers and including two CVEs, that span KV-cache isolation failures, cross-request performance interference, and crash or liveness bugs. These results establish that concurrency, caching, and state reuse can produce silent cross-request contamination, noisy-neighbor denialof

What carries the argument

GRIEF, the greybox fuzzer that generates timed multi-request traces, applies lightweight oracles for serving anomalies, and verifies failures through controlled replay and log-probability checks.

If this is right

Silent cross-request data contamination can occur without malformed inputs or server error messages.
One request can impose performance degradation on others, creating a noisy-neighbor denial-of-service vector.
Crashes and liveness failures can be delayed until specific sequences of state reuse occur.
Standard model, safety, and API tests are insufficient for LLM serving infrastructure.
Concurrent serving behavior must be treated as a first-class security and reliability boundary.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trace-based fuzzing strategy could be adapted to other shared-state components in AI pipelines such as training schedulers or retrieval systems.
Engine maintainers could integrate continuous GRIEF-style campaigns into their release processes to catch regressions introduced by new caching or batching features.
Isolation mechanisms in multi-tenant LLM deployments may need explicit verification against concurrent workloads rather than relying on per-request correctness alone.

Load-bearing premise

The lightweight oracles and replay checks with log-probability can reliably distinguish genuine serving-layer failures from test artifacts or model behavior.

What would settle it

Applying GRIEF to the same engines and obtaining zero reproducible, developer-confirmed vulnerabilities that the oracles had flagged as serving-layer issues.

Figures

Figures reproduced from arXiv: 2605.11202 by Michelle L. Mazurek, Yibo Zhao, Yuchen Zhang, Yunze Zhao, Zaoxing Liu.

**Figure 1.** Figure 1: Examples of three KV-cache state-corruption symptoms by one bug discovered by GRIEF: confident value pollution [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the GRIEF system architecture on a single GPU setting, illustrating the interaction between the wrapper, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: A simplified representation of seed trace construction and mutation across timing, event, and splicing operations. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Victim time-to-first-token (TTFT) and throughput before, during, and after curated multi-completion interference. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: SGLang LoRA scheduler crash campaign. The stacked area summarizes four trace-level pressure dimensions. The [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Confident value pollution A . The baseline completion correctly aggregates all three contributions: the friend’s 24 outfits, the 48 outfits from the baby shower, and the mother’s 15 outfits, yielding 24 + 48 + 15 = 87. Under attack, the victim preserves a fluent reasoning chain but drops the initial 24 from the final aggregation, computing 48 + 15 = 63. The corrupted run still emits a clean final-answer ma… view at source ↗

**Figure 7.** Figure 7: Reasoning-chain inflation and truncation [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Answer-first reasoning confusion B . The corrupted completion emits an incorrect answer and final-answer marker (#### 3200) before any reasoning appears. The subsequent chain of thought then derives the correct result, 1240, but too late: downstream answer extraction has already anchored on the earlier answer-like region. This example shows a distinct failure mode in which state corruption does not simply … view at source ↗

read the original abstract

LLM inference and serving systems have become security-critical infrastructure; however, many of their most concerning failures arise from the serving layer rather than from model behavior alone. Modern inference engines combine KV cache, batching, prefix sharing, speculative decoding, adapters, and multi-tenant scheduling, creating shared-state behavior that only emerges under realistic concurrent workloads and is missed by standard model, safety, and API tests. We present GRIEF, a greybox fuzzer for LLM inference engines that treats timed multi-request traces as first-class inputs, uses lightweight oracles to detect crashes, hangs, performance pathologies, and silent output corruption, and applies controlled replay with log-probability checks to confirm reproducible serving-layer failures. Across early campaigns on vLLM and SGLang, GRIEF discovers 15 vulnerabilities, 10 confirmed by engine developers, including 2 CVEs, spanning KV-cache isolation failures, cross-request performance interference, and crash or liveness bugs. These results show that concurrency, caching, and state reuse can induce silent cross-request contamination, noisy-neighbor denial of service, and delayed crashes without malformed inputs or explicit server errors, making concurrent serving behavior a first-class security and reliability boundary for LLM infrastructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRIEF applies greybox fuzzing to concurrent traces in LLM servers and surfaces 15 issues with developer confirmations, but the oracles' ability to separate serving bugs from normal variation is not strongly evidenced.

read the letter

The main takeaway is that this fuzzer treats timed multi-request traces as inputs and uses lightweight oracles to catch crashes, hangs, performance problems, and silent output corruption in engines like vLLM and SGLang. It reports 15 discoveries, 10 confirmed by developers and two turned into CVEs, mostly around KV-cache isolation, cross-request interference, and liveness issues that standard tests miss. That focus on serving-layer state reuse under concurrency is the concrete advance here. The approach fits the actual workload of these systems better than single-request or model-only testing. The developer confirmations add some external grounding that the issues are not just artifacts. On the soft spots, the validation details are thin. The log-probability replay checks are described at a high level, but there are no reported false-positive rates, explicit thresholds, or ablations against clean concurrent runs or floating-point nondeterminism. If those checks flag normal scheduler jitter or speculative decoding paths as corruption, some of the 15 could be overstated. The paper does not include raw traces or quantitative oracle accuracy numbers, so the central claim rests more on the external confirmations than on internal evidence. This work is aimed at people who build or secure LLM inference stacks and at researchers applying fuzzing to new infrastructure layers. A reader who needs practical ideas for testing batching and caching behavior would get usable techniques from it. It deserves a serious referee because the domain is growing fast and the reported issues point to a real gap, even though the methods section would benefit from more rigor on the oracles. I would send it to review rather than desk reject.

Referee Report

1 major / 1 minor

Summary. The manuscript presents GRIEF, a greybox fuzzer for LLM inference engines that treats timed multi-request traces as first-class inputs. It uses lightweight oracles to detect crashes, hangs, performance pathologies, and silent output corruption, combined with controlled replay and log-probability checks to confirm reproducible serving-layer failures. Across campaigns on vLLM and SGLang, the tool reports discovering 15 vulnerabilities (10 developer-confirmed, including 2 CVEs) related to KV-cache isolation, cross-request interference, and liveness issues.

Significance. If the oracle-based attribution holds, the work is significant because it demonstrates that shared-state behaviors in concurrent LLM serving (KV caching, batching, prefix sharing) create security and reliability boundaries missed by standard model, safety, and API testing. Developer confirmations and CVEs provide external validation of practical impact, and the greybox approach with domain-specific oracles could inform future fuzzing of AI infrastructure.

major comments (1)

[Evaluation and oracle design sections] The central claim of 15 discovered vulnerabilities with 10 confirmations rests on the lightweight oracles and log-probability replay correctly attributing failures to serving-layer state rather than model nondeterminism or artifacts. The manuscript provides no explicit threshold for log-probability deltas, no baseline comparison to clean concurrent traces, and no ablation demonstrating that reported issues (e.g., cross-request contamination) are not triggered by benign KV-cache reuse or scheduler jitter. This is load-bearing for the results and directly engages the stress-test concern about oracle reliability.

minor comments (1)

[Abstract] The abstract and results summary would benefit from additional quantitative details on campaign scale (e.g., total requests tested, false-positive rates, or raw oracle trigger counts) to allow readers to assess the effort behind the 15 discoveries.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of GRIEF's significance and for the constructive feedback on oracle reliability. We address the major comment below.

read point-by-point responses

Referee: [Evaluation and oracle design sections] The central claim of 15 discovered vulnerabilities with 10 confirmations rests on the lightweight oracles and log-probability replay correctly attributing failures to serving-layer state rather than model nondeterminism or artifacts. The manuscript provides no explicit threshold for log-probability deltas, no baseline comparison to clean concurrent traces, and no ablation demonstrating that reported issues (e.g., cross-request contamination) are not triggered by benign KV-cache reuse or scheduler jitter. This is load-bearing for the results and directly engages the stress-test concern about oracle reliability.

Authors: We agree that the manuscript would benefit from greater explicitness on these points to strengthen the attribution of failures to serving-layer behaviors. In the revised version we will expand the Evaluation section to specify the log-probability delta thresholds applied during replay, add a baseline comparison against clean concurrent traces (showing that the oracles remain silent under non-adversarial conditions), and include an ablation isolating benign KV-cache reuse and scheduler jitter from the fuzzed concurrent workloads. These additions will directly address the concern that reported issues such as cross-request contamination could arise from normal engine behavior. The ten developer confirmations already provide external evidence that the failures are genuine, but the requested analyses will improve the rigor of the oracle validation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical fuzzer presentation with external developer confirmations

full rationale

The paper describes GRIEF as a greybox fuzzer using timed traces, lightweight oracles, and log-probability replay to find serving-layer issues in vLLM and SGLang. All central claims (15 vulnerabilities found, 10 confirmed including 2 CVEs) rest on external developer validation rather than any internal equations, fitted parameters, or self-referential derivations. No load-bearing steps reduce by construction to the paper's own inputs; the work is a self-contained empirical tool report.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Paper is an empirical security tool contribution; no free parameters, mathematical axioms, or invented physical entities are present in the abstract.

pith-pipeline@v0.9.0 · 5530 in / 1105 out tokens · 64302 ms · 2026-05-13T02:01:14.889058+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 2 internal anchors

[1]

The LibAFL Fuzzing Library - The LibAFL Fuzzing Library

work page
[2]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads, 2024

work page 2024
[3]

Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Itay Yona, Eric Wallace, David Rolnick, and Florian Tramèr

Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Itay Yona, Eric Wallace, David Rolnick, and Florian Tramèr. Stealing part of a production language model, 2024

work page 2024
[4]

Extracting training data from large language models, 2021

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Kather- ine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models, 2021

work page 2021
[5]

Jailbreakbench: An open robustness benchmark for jailbreaking large language models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. Advances in Neural Information Processing Systems, 37:55005–55029, 2024

work page 2024
[6]

Pappas, and Eric Wong

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries, 2024

work page 2024
[7]

LLM- Inference-Bench: Inference benchmarking of large language models on ai accelerators

Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus, Aditya Tanikanti, Ken Raffenetti, Valerie Taylor, Murali Emani, and Venkatram Vishwanath. LLM- Inference-Bench: Inference benchmarking of large language models on ai accelerators. In Workshops of the International Conference for High Performance Computing, Networking, Storage an...

work page 2024
[8]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models

Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. InProceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis, pages 423–435, 2023

work page 2023
[10]

Size-aware Sharding For Improving Tail Latencies in In-memory Key-value Stores

Diego Didona and Willy Zwaenepoel. Size-aware Sharding For Improving Tail Latencies in In-memory Key-value Stores. pages 79–94

work page
[11]

AFL++: Combining incremental steps of fuzzing research

Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. AFL++: Combining incremental steps of fuzzing research. In14th USENIX Workshop on Offensive Technologies, 2020

work page 2020
[12]

LibAFL: A framework to build modular and reusable fuzzers

Andrea Fioraldi, Dominik Maier, Dongjia Zhang, and Davide Balzarotti. LibAFL: A framework to build modular and reusable fuzzers. InProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 1051–1065, 2022

work page 2022
[13]

Pal: Program-aided language models,

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models.arXiv preprint arXiv:2211.10435, 2022. 10

work page arXiv 2022
[14]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection, 2023

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection, 2023

work page 2023
[15]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[16]

vLLM: An Efficient Inference Engine for Large Language Models

Woosuk Kwon. vLLM: An Efficient Inference Engine for Large Language Models

work page
[17]

Codecrash: Stress testing llm reasoning under structural and semantic perturbations.arXiv e-prints, pages arXiv–2504, 2025

Man Ho Lam, Chaozheng Wang, Jen-tse Huang, and Michael R Lyu. Codecrash: Stress testing llm reasoning under structural and semantic perturbations.arXiv e-prints, pages arXiv–2504, 2025

work page 2025
[18]

Fast Inference from Transformers via Speculative Decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast Inference from Transformers via Speculative Decoding

work page
[19]

Hongwei Li and Yongjun Wang. Reliability of llm inference engines from a static perspective: Root cause analysis and repair suggestion via natural language reports.Big Data and Cognitive Computing, 10(2):60, 2026

work page 2026
[20]

Eagle: speculative sampling requires rethinking feature uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: speculative sampling requires rethinking feature uncertainty. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024
[21]

Eagle-3: Scaling up inference acceleration of large language models via training-time test, 2025

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test, 2025

work page 2025
[22]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Connor Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Re...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

A first look at bugs in llm inference engines.ACM Transactions on Software Engineering and Methodology, 2025

Mugeng Liu, Siqi Zhong, Weichen Bi, Yixuan Zhang, Zhiyang Chen, Zhenpeng Chen, Xuanzhe Liu, and Yun Ma. A first look at bugs in llm inference engines.ACM Transactions on Software Engineering and Methodology, 2025

work page 2025
[24]

Autodan: Generating stealthy jailbreak prompts on aligned large language models, 2024

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models, 2024

work page 2024
[25]

Cachegen: Kv cache compression and streaming for fast large language model serving, 2024

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. Cachegen: Kv cache compression and streaming for fast large language model serving, 2024

work page 2024
[26]

Graph- based fuzz testing for deep learning inference engines

Weisi Luo, Dong Chai, Xiaoyue Ruan, Jiang Wang, Chunrong Fang, and Zhenyu Chen. Graph- based fuzz testing for deep learning inference engines. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pages 288–299. IEEE, 2021

work page 2021
[27]

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024

work page 2024
[28]

Deepspeed-moe: Advancing mixture-of- experts inference and training to power next-generation ai scale, 2022

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advancing mixture-of- experts inference and training to power next-generation ai scale, 2022. 11

work page 2022
[29]

Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B

Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kan...

work page 2020
[30]

Gonzalez, and Ion Stoica

Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, and Ion Stoica. S-lora: Serving thousands of concurrent lora adapters, 2024

work page 2024
[31]

Sponge examples: Energy-latency attacks on neural networks, 2021

Ilia Shumailov, Yiren Zhao, Daniel Bates, Nicolas Papernot, Robert Mullins, and Ross Anderson. Sponge examples: Energy-latency attacks on neural networks, 2021

work page 2021
[32]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024

work page 2024
[33]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025

work page 2025
[34]

Fine- tuning language models for factuality

Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D Manning, and Chelsea Finn. Fine- tuning language models for factuality. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[35]

Fuzz4all: Universal fuzzing with large language models

Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. Fuzz4all: Universal fuzzing with large language models. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ICSE ’24, pages 1–13. ACM, April 2024

work page 2024
[36]

Orca: A Distributed Serving System for Transformer-Based Generative Models

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A Distributed Serving System for Transformer-Based Generative Models. pages 521–538

work page
[37]

Enabling performant and flexible model-internal observability for llm inference

Nengneng Yu, Sixian Xiong, Yibo Zhao, Wei Wang, and Zaoxing Liu. Enabling performant and flexible model-internal observability for llm inference. InAdvances in Neural Information Processing Systems (NeurIPS), 2026. To appear

work page 2026
[38]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural I...

work page
[39]

Zico Kolter, and Matt Fredrikson

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023. 12 A Appendix A.1 Threat Model GRIEF targets shared LLM inference-serving deployments in which multiple client requests may overlap in time and interact through shared serving mechanisms ...

work page 2023