arxiv: 2605.11232 · v1 · submitted 2026-05-11 · 💻 cs.AI · cs.LG

Recognition: no theorem link

Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack

Prathamesh Vasudeo Naik , Naresh Dintakurthi , Yue Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:06 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords LLMOpsAML compliancefraud detectionLLM servingprefix cachingsynthetic datasetsworkload optimizationquality gating

0 comments

The pith

Workload-aware optimizations turn AML compliance prompts into high-throughput LLM workloads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Fraud and anti-money-laundering tasks require LLMs to handle prefix-heavy prompts that mix reusable policy text, transaction evidence, and structured outputs like JSON risk labels. The paper builds a serving stack around open-weight models that applies automatic prefix caching, length-aware batching, and speculative decoding to these specific workloads. On synthetic AML datasets converted into compliance prompts, the tuned stack delivers fivefold higher throughput, fivefold lower tail latency, and sixfold higher GPU utilization compared with baseline serving. This positions regulated LLM use as a problem of prompt engineering, serving configuration, and output validation rather than raw model capability alone.

Core claim

A workload-aware LLMOps stack that combines vLLM-style runtime tuning, PagedAttention, Automatic Prefix Caching, multi-adapter serving, adapter and prompt-length-aware batching, sleep/wake lifecycle management, speculative decoding, and optional prefill/decode disaggregation enables efficient serving of compliance prompts. When applied to synthetic AML datasets such as IBM AML and SAML-D reformulated as prefix-heavy prompts with policy instructions and schema-constrained outputs, it improves throughput from 612-650 to 3600 requests per hour, reduces P99 latency from 31-38 seconds to 6.4-8.7 seconds, and raises GPU utilization from 12% to 78%. An LLM-as-judge layer with deterministic checks,

What carries the argument

The workload-aware serving stack that exploits prefix reuse and KV-cache efficiency through Automatic Prefix Caching and adapter-aware batching for schema-constrained, evidence-rich compliance prompts.

If this is right

Open-weight models can meet regulated-domain performance targets when serving is tuned to prompt structure.
Prefix caching and batching optimizations yield large efficiency gains without model retraining or larger hardware.
Quality gates combining LLM judges and deterministic schema checks maintain reliability for structured outputs.
Synthetic dataset conversion provides a reproducible way to benchmark compliance LLM systems.
Regulated performance depends on systems-level design choices as much as on the base model selected.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar prefix-heavy prompt patterns in legal or medical compliance could benefit from the same caching and batching techniques.
Real production AML data may introduce prompt variability that requires additional dynamic tuning beyond the synthetic benchmarks.
The stack's self-hosted design supports data-sovereignty requirements common in financial institutions.
Further disaggregation of prefill and decode phases could yield additional latency reductions for long-context AML queries.

Load-bearing premise

That the synthetic AML prompts capture the essential structure and variability of real institutional compliance queries.

What would settle it

Deploying the stack on a live production AML system and checking whether it sustains 3600 requests per hour with P99 latency under 9 seconds on actual transaction data.

Figures

Figures reproduced from arXiv: 2605.11232 by Naresh Dintakurthi, Prathamesh Vasudeo Naik, Yue Wang.

**Figure 1.** Figure 1: Workload-aware serving architecture for prefix-dominated fraud and compliance inference. The stack separates workload construction, control, serving, and assurance planes so that prefix reuse, model tenancy, structured decoding, and output validation can be optimized jointly. preserve reusable work: shared policy prefixes should not be recomputed, repeated context should be retained across turns or workers… view at source ↗

read the original abstract

Fraud detection and anti-money-laundering (AML) compliance are high-value domains for large language models (LLMs), but their serving requirements differ sharply from generic chat workloads. Compliance prompts are often prefix-heavy, schema-constrained, and evidence-rich, combining reusable policy instructions, risk taxonomies, transaction or document context, and short structured outputs such as JSON labels or risk factors. These properties make prefix reuse, KV-cache efficiency, runtime tuning, model orchestration, and output validation first-order systems concerns. This paper introduces a workload-aware LLMOps stack for fraud and AML workloads using self-hosted open-weight models such as Meta Llama and Alibaba Qwen. The stack combines vLLM-style runtime tuning, PagedAttention, Automatic Prefix Caching, multi-adapter serving, adapter and prompt-length-aware batching, sleep/wake lifecycle management, speculative decoding, and optional prefill/decode disaggregation. To avoid exposing institution-specific data, the reproducibility track converts public synthetic AML datasets, including IBM AML and SAML-D, into prefix-heavy compliance prompts with reusable policy text, transaction evidence, typology definitions, and schema-constrained outputs. We also incorporate an LLM-as-judge quality gate using deterministic compliance checks, reference metrics, expert-adjudicated calibration data where available, and multi-judge rubric scoring. Across public-synthetic AML workloads and controlled serving benchmarks, workload-aware tuning improved throughput from 612-650 to 3,600 requests/hour, reduced P99 latency from 31-38 seconds to 6.4-8.7 seconds, and increased GPU utilization from 12% to 78%. These results show that regulated LLM performance is a workload-design, serving-optimization, and quality-gating problem, not only a model-selection problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper packages standard LLM serving tricks into a compliance-focused stack and shows big measured gains on synthetic AML prompts, but those gains rest entirely on synthetic data.

read the letter

The core contribution here is taking known techniques like prefix caching, paged attention, adapter-aware batching, and speculative decoding, then wiring them together with quality gates for AML-style workloads. The prompts are prefix-heavy with reusable policy text and schema outputs, so the stack targets cache reuse and output validation. They convert public synthetic sets like IBM AML and SAML-D into these prompts to keep things reproducible without real customer data. That produces the headline numbers: throughput up to 3600 requests per hour, P99 latency down to 6-8 seconds, and GPU utilization at 78 percent. The engineering is concrete and the reproducibility track is explicit, which is useful for teams already running self-hosted models in regulated settings.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a workload-aware LLMOps stack for fraud detection and AML compliance tasks using self-hosted open-weight models (e.g., Llama, Qwen). The stack integrates vLLM-style runtime tuning, PagedAttention, Automatic Prefix Caching, multi-adapter serving, adapter/prompt-length-aware batching, speculative decoding, and optional prefill/decode disaggregation, together with an LLM-as-judge quality gate. Evaluations on synthetic compliance prompts derived from public AML datasets (IBM AML, SAML-D) report throughput gains from 612-650 to 3,600 requests/hour, P99 latency reductions from 31-38 s to 6.4-8.7 s, and GPU utilization increases from 12% to 78%.

Significance. If the reported gains hold under broader conditions, the work would be significant for regulated financial domains by showing that compliance-specific prompt structures (prefix-heavy, schema-constrained) can be exploited for large efficiency improvements without model retraining. The privacy-preserving synthetic-data approach and explicit quality-gating mechanism are practical strengths that could inform production LLM serving in other high-stakes settings.

major comments (3)

[Abstract / Evaluation] Abstract and Evaluation section: the headline performance claims (throughput 612-650 → 3,600 req/h, P99 latency 31-38 s → 6.4-8.7 s, GPU util 12 % → 78 %) rest exclusively on synthetic AML prompts converted from public datasets. Because the optimizations (prefix caching, adapter-aware batching, speculative decoding) are workload-sensitive, these numbers may be artifacts of the synthetic prompt distribution rather than a general property of the stack; no analysis of statistical fidelity to real transaction streams or regulatory edge cases is provided.
[Evaluation] Evaluation section: no error bars, run-to-run variance, or statistical significance tests accompany the reported metrics, and no ablation isolating the contribution of each optimization (e.g., prefix caching alone vs. full stack) is shown. This makes it impossible to determine whether the gains are robust or driven by a single component.
[Methods / Implementation] Methods / Implementation: the precise vLLM configuration parameters, batch-size heuristics, and adapter/prompt-length-aware scheduling logic are described at a high level only. Without these details or accompanying code, the central claim that the stack delivers “compliance-grade” serving cannot be independently verified or reproduced.

minor comments (2)

[Abstract] The abstract states “we also incorporate an LLM-as-judge quality gate” but provides no rubric details, inter-judge agreement statistics, or calibration procedure; these should be expanded in the main text for clarity.
[Figures / Tables] Figure captions and table headers should explicitly state the number of runs, hardware configuration (GPU model, count), and prompt-length distribution to aid interpretation.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback highlighting needs for greater evaluation rigor and reproducibility. We address each major comment below, indicating planned revisions and any inherent limitations.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation section: the headline performance claims (throughput 612-650 → 3,600 req/h, P99 latency 31-38 s → 6.4-8.7 s, GPU util 12 % → 78 %) rest exclusively on synthetic AML prompts converted from public datasets. Because the optimizations (prefix caching, adapter-aware batching, speculative decoding) are workload-sensitive, these numbers may be artifacts of the synthetic prompt distribution rather than a general property of the stack; no analysis of statistical fidelity to real transaction streams or regulatory edge cases is provided.

Authors: We agree the reported gains are measured on synthetic prompts derived from public datasets (IBM AML, SAML-D), as stated in the manuscript to preserve privacy and enable reproducibility. These prompts are deliberately constructed to replicate compliance workload traits: long reusable policy prefixes, schema-constrained outputs, and evidence-rich contexts. We will revise the Evaluation and Discussion sections to include additional justification of prompt design fidelity based on public AML typology literature and dataset documentation, plus explicit limitations on regulatory edge cases. However, direct statistical comparison to proprietary real transaction streams is not feasible under data-protection regulations. revision: partial
Referee: [Evaluation] Evaluation section: no error bars, run-to-run variance, or statistical significance tests accompany the reported metrics, and no ablation isolating the contribution of each optimization (e.g., prefix caching alone vs. full stack) is shown. This makes it impossible to determine whether the gains are robust or driven by a single component.

Authors: We acknowledge the absence of variance measures and ablations in the current draft. The revised manuscript will report results from multiple runs (minimum five independent trials with varied request arrival seeds), include error bars and standard deviations for all metrics, and add statistical significance testing. We will also insert a dedicated ablation subsection and accompanying figure that isolates the contribution of Automatic Prefix Caching, adapter-aware batching, speculative decoding, and the full stack combination. revision: yes
Referee: [Methods / Implementation] Methods / Implementation: the precise vLLM configuration parameters, batch-size heuristics, and adapter/prompt-length-aware scheduling logic are described at a high level only. Without these details or accompanying code, the central claim that the stack delivers “compliance-grade” serving cannot be independently verified or reproduced.

Authors: We accept that the current description is insufficient for full reproducibility. The revision will expand the Methods section with concrete vLLM parameters (tensor_parallel_size, max_num_seqs, block_size, gpu_memory_utilization), explicit batch-size heuristics conditioned on prompt length and active adapters, and pseudocode for the length- and adapter-aware scheduler. We will also add a reproducibility appendix linking to public prompt-generation scripts and core serving configuration files used for the reported experiments, while noting that certain production orchestration components remain institutionally restricted. revision: partial

standing simulated objections not resolved

Direct statistical fidelity analysis or evaluation against real (non-synthetic) production transaction streams or regulatory edge cases, which would require access to proprietary institutional data prohibited by privacy and compliance regulations.

Circularity Check

0 steps flagged

No circularity; empirical benchmarks on synthetic data with no derivations or self-referential fits

full rationale

The paper describes a workload-aware LLM serving stack and reports measured performance gains (throughput, latency, GPU utilization) exclusively from controlled experiments on converted public synthetic AML datasets. No equations, parameter fits, or derivations are present that could reduce to inputs by construction. Claims rest on direct instrumentation of vLLM-style optimizations rather than any self-definitional loop, fitted-input prediction, or load-bearing self-citation chain. The explicit choice of synthetic data for privacy reasons is noted but does not create circularity in the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new mathematical entities or free parameters; it applies and tunes existing systems to a new workload.

axioms (1)

standard math The listed serving optimizations (vLLM, PagedAttention, Automatic Prefix Caching, etc.) function as described in their original papers.
Relies on prior work for the base systems.

pith-pipeline@v0.9.0 · 5633 in / 1361 out tokens · 64179 ms · 2026-05-13T02:06:31.413337+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 3 internal anchors

[1]

B. Li, Y . Jiang, V . Gadepally, and D. Tiwari. LLM infer- ence serving: Survey of recent advances and opportunities. arXiv:2407.12391, 2024

work page arXiv 2024
[2]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient mem- ory management for large language model serving with PagedAttention.SOSP, 2023

work page 2023
[3]

I. Gim, G. Chen, S. Lee, N. Sarda, A. Khandelwal, and L. Zhong. Prompt cache: Modular attention reuse for low-latency inference.arXiv:2311.04934, 2023. 7

work page arXiv 2023
[4]

B. Gao, Z. He, and Y . Liu. CachedAttention: Efficient attention-state reuse for LLM generation. arXiv:2403.19708, 2024

work page arXiv 2024
[5]

Automatic prefix caching

vLLM Project. Automatic prefix caching. https://docs.vllm.ai/en/latest/ features/automatic_prefix_caching/

work page
[6]

LMCache documentation

LMCache Project. LMCache documentation. https: //docs.lmcache.ai/

work page
[7]

Huang et al

Y . Huang et al. LMCache: Efficient KV cache reuse for LLM serving.arXiv:2510.09665, 2025

work page arXiv 2025
[8]

Splitwise: Efficient generative llm inference using phase splitting,

P. Patel, E. Choukse, C. Zhang, A. Shah, I. Goiri, S. Maleki, and R. Bianchini. Splitwise: Effi- cient generative LLM inference using phase splitting. arXiv:2311.18677, 2023

work page arXiv 2023
[9]

Zhong, S

Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang. DistServe: Disaggregating prefill and decod- ing for goodput-optimized large language model serving. OSDI, 2024

work page 2024
[10]

Zheng, L

L. Zheng, L. Yin, Z. Xie, C. Huang, J. Sun, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y . Sheng. SGLang: Efficient execution of structured language model programs.NeurIPS, 2024

work page 2024
[11]

Leviathan, M

Y . Leviathan, M. Kalman, and Y . Matias. Fast inference from transformers via speculative decoding.ICML, 2023

work page 2023
[12]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Y . Li et al. EAGLE-3: Scaling up inference acceleration of large language models.arXiv:2503.01840, 2025

work page internal anchor Pith review arXiv 2025
[13]

AIConfigurator: Lightning-fast configuration optimization for multi-framework LLM serv- ing.arXiv:2601.06288, 2026

AIConfigurator Authors. AIConfigurator: Lightning-fast configuration optimization for multi-framework LLM serv- ing.arXiv:2601.06288, 2026

work page arXiv 2026
[14]

E. R. Altman, J. Blanusa, L. von Niederhausern, B. Egressy, A. S. Anghel, and K. Atasu. Realistic synthetic financial transactions for anti-money laundering models. NeurIPS Datasets and Benchmarks, 2023

work page 2023
[15]

IBM Transactions for Anti Money Laun- dering dataset

IBM Research. IBM Transactions for Anti Money Laun- dering dataset. GitHub repository, 2023. https:// github.com/IBM/AML-Data

work page 2023
[16]

Community Data License Agree- ment – Sharing, Version 1.0

Linux Foundation. Community Data License Agree- ment – Sharing, Version 1.0. https://cdla.dev/ sharing-1-0/

work page
[17]

Oztas, D

B. Oztas, D. Cetinkaya, F. F. Adedoyin, M. Budka, H. Do- gan, and G. Aksu. Enhancing anti-money laundering: Development of a synthetic transaction monitoring dataset. IEEE International Conference on e-Business Engineering (ICEBE), pp. 47–54, 2023

work page 2023
[18]

B. Oztas. Anti Money Laundering Transaction Data (SAML-D). Kaggle dataset, 2023. https: //www.kaggle.com/datasets/berkanoztas/ synthetic-transaction-monitoring-dataset-aml

work page 2023
[19]

Attribution-NonCommercial- ShareAlike 4.0 International

Creative Commons. Attribution-NonCommercial- ShareAlike 4.0 International. https:// creativecommons.org/licenses/by-nc-sa/ 4.0/

work page
[20]

R. I. T. Jensen, J. Ferwerda, K. S. Jorgensen, E. R. Jensen, M. Borg, M. P. Krogh, J. B. Jensen, and A. Iosifidis. A synthetic data set to benchmark anti-money laundering methods.Scientific Data, 10:661, 2023

work page 2023
[21]

The Llama 3 Herd of Models

A. Grattafiori et al. The Llama 3 herd of models. arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Qwen2.5 Technical Report

A. Yang et al. Qwen2.5 technical report. arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena.NeurIPS, 2023

work page 2023
[24]

Y . Liu, D. Iter, Y . Xu, S. Wang, R. Xu, and C. Zhu. G-Eval: NLG evaluation using GPT-4 with better human alignment. EMNLP, 2023

work page 2023
[25]

P. Wang, L. Li, L. Chen, Z. Zhu, B. Lin, Y . Cao, Q. Liu, T. Liu, and Z. Sui. Large language models are not fair evaluators.ACL, 2024

work page 2024
[26]

Panickssery, S

A. Panickssery, S. R. Bowman, and S. Feng. LLM evalua- tors recognize and favor their own generations.NeurIPS, 2024

work page 2024
[27]

Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024

P. Verga, S. Hofstatter, S. Althammer, Y . Su, A. Piktus, A. Arkhangorodsky, M. Xu, N. White, and P. Lewis. Re- placing judges with juries: Evaluating LLM generations with a panel of diverse models.arXiv:2404.18796, 2024

work page arXiv 2024
[28]

Papineni, S

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: A method for automatic evaluation of machine translation. ACL, 2002

work page 2002
[29]

C.-Y . Lin. ROUGE: A package for automatic evaluation of summaries.Workshop on Text Summarization Branches Out, 2004

work page 2004
[30]

Reimers and I

N. Reimers and I. Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks.EMNLP- IJCNLP, 2019

work page 2019
[31]

P. V . Naik, N. K. Dintakurthi, Z. Hu, Y . Wang, and R. Qiu. Co-Investigator AI: The rise of agentic AI for smarter, trustworthy AML compliance narratives.CoRR, abs/2509.08380, 2025. doi: 10.48550/arXiv.2509.08380. 8

work page doi:10.48550/arxiv.2509.08380 2025