An Efficient vLLM-Based Inference Pipeline for Unified Audio Understanding and Generation

· 2026 · eess.AS · arXiv 2607.02119

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

While Large Multimodal Models excel in comprehension, high-throughput inference engines lack native support for multimodal generation. This is severe in Speech Language Models, where generating multi-layered audio tokens via decoupled AR+NAR or synchronous Multi-Token Prediction (MTP) with delay-pattern interleaving conflicts with standard single-stream loops. We present a vLLM-based inference pipeline for unified speech understanding and generation. We extend autoregressive decoding to natively execute delay-pattern de-interleaving and coordinated multi-stream sampling, integrating an on-GPU acoustic decoder for end-to-end waveform synthesis. Crucially, we overcome the shared intuition that Classifier-Free Guidance (CFG) halves throughput. By co-scheduling paired conditional and unconditional requests within a continuous batch, our CFG implementation sustains 80% of non-CFG throughput, absorbing dual-request and logit merging overheads. We open-source our framework.

representative citing papers

An Efficient vLLM-Based Inference Pipeline for Unified Audio Understanding and Generation

eess.AS · 2026-07-02 · unverdicted · novelty 6.0

Extends vLLM with delay-pattern de-interleaving, multi-stream sampling, and co-scheduled CFG to achieve 80% of non-CFG throughput for unified audio tasks while open-sourcing the pipeline.

citing papers explorer

Showing 1 of 1 citing paper.

An Efficient vLLM-Based Inference Pipeline for Unified Audio Understanding and Generation eess.AS · 2026-07-02 · unverdicted · none · ref 1 · internal anchor
Extends vLLM with delay-pattern de-interleaving, multi-stream sampling, and co-scheduled CFG to achieve 80% of non-CFG throughput for unified audio tasks while open-sourcing the pipeline.

An Efficient vLLM-Based Inference Pipeline for Unified Audio Understanding and Generation

fields

years

verdicts

representative citing papers

citing papers explorer