DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action
Pith reviewed 2026-05-21 02:37 UTC · model grok-4.3
The pith
DuplexSLA decodes user audio, assistant speech, and structured actions jointly on one shared 160 ms timeline.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DuplexSLA is a native full-duplex Speech-Language-Action foundation model that decodes assistant audio together with a structured action stream on a shared 160 ms chunk timeline. It is built on a dual-stream three-channel formulation: a continuous user audio channel, a discrete assistant audio channel, and a rate-limited textual action channel, all decoded jointly by a single backbone so that listening, speaking, planning, and tool calling unfold on one shared clock.
What carries the argument
Dual-stream three-channel formulation that keeps continuous user audio, discrete assistant audio, and rate-limited action text aligned on a common 160 ms timeline inside one backbone.
If this is right
- Semantic turn-taking control for interruption, pause, and backchannel occurs inside the backbone instead of relying on an external semantic VAD.
- Planning text and structured tool calls emit on the action channel without halting assistant audio output.
- Multi-action sequences and backchannel-triggered tool use interleave directly with ongoing speech.
- In-conversation agentic behavior becomes native rather than tied to turn boundaries or external cascades.
- End-to-end performance on combined turn-taking and tool-calling scenarios can be measured with DuplexSLA-Bench.
Where Pith is reading between the lines
- The shared-timeline design may reduce overall system latency by removing separate modules for voice activity detection and planning.
- Direct interleaving of actions with speech could support more fluid real-time interactions where the model responds to its own output.
- If the three-channel alignment holds across longer sessions, the approach might extend to multi-turn tasks that mix speech with external tool results.
- Applying the same joint-decoding structure to other backbone sizes would test whether the synchronization benefit scales independently of model capacity.
Load-bearing premise
A single backbone can produce high-quality audio output while simultaneously delivering accurate semantic turn-taking and in-conversation action emission without external components.
What would settle it
A clear drop in audio quality or rise in turn-taking errors on DuplexSLA-Bench when the action channel is active during speech would show the joint decoding cannot sustain all three tasks at once.
Figures
read the original abstract
Recent advances in spoken dialogue language models have shifted from turn-based to full-duplex designs, where the model continuously listens to the user while generating responses. However, existing duplex backbones still lack a native channel for in-conversation planning and tool calling, leaving real-time agentic behaviour either tied to turn boundaries or relegated to an external cascade. We propose DuplexSLA, a native full-duplex Speech-Language-Action foundation model that decodes assistant audio together with a structured action stream on a shared 160 ms chunk timeline. DuplexSLA is built on a dual-stream three-channel formulation: a continuous user audio channel, a discrete assistant audio channel, and a rate-limited textual action channel, all decoded jointly by a single backbone, so that listening, speaking, planning, and tool calling unfold on one shared clock. Two capabilities define the model: (1) semantic-driven turn-taking control, where interruption, pause, and backchannel are handled inside the same backbone instead of by an external semantic VAD; and (2) in-conversation planning and tool calling, where planning text and structured tool calls are emitted on the action channel without halting assistant audio, so that multi-action and backchannel-triggered tool use are interleaved with ongoing speech. To evaluate these capabilities together, we further construct DuplexSLA-Bench, a duplex benchmark covering pause, interrupt, and backchannel turn-taking together with three styles of in-conversation tool calling. Our project page, interactive demos, and the DuplexSLA-Bench evaluation suite are publicly available at https://github.com/hyzhang24/DuplexSLA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DuplexSLA, a native full-duplex Speech-Language-Action foundation model built on a dual-stream three-channel formulation. It jointly decodes a continuous user audio channel, a discrete assistant audio channel, and a rate-limited textual action channel on a shared 160 ms chunk timeline using a single backbone. This design aims to enable semantic-driven turn-taking (interruptions, pauses, backchannels) and in-conversation planning/tool calling without external VAD or cascades. The authors also introduce DuplexSLA-Bench, a benchmark covering turn-taking scenarios and three styles of in-conversation tool use, with public demos and evaluation suite.
Significance. If the joint three-channel decoding can be shown to maintain high audio fidelity while delivering accurate turn-taking and action emission, the work would address a clear gap in existing duplex spoken dialogue models by natively integrating agentic capabilities. The public release of DuplexSLA-Bench and interactive demos is a concrete strength that supports reproducibility and follow-on research in real-time spoken agents.
major comments (2)
- [Abstract and Model Description] Abstract and architecture description: the central claim that a single backbone jointly decoding the three channels produces high-quality audio output alongside accurate semantic turn-taking and action emission without external components or hidden trade-offs is not supported by any quantitative results, ablations, loss curves, or baseline comparisons. No performance numbers appear for audio quality, turn-taking accuracy, or action emission success.
- [Model Architecture] Model formulation section: the dual-stream three-channel approach is presented at a high level but supplies no equations or diagrams specifying the joint decoding objective, the weighting or balancing of modality-specific losses between continuous audio and discrete action tokens, or the exact alignment mechanism that prevents rate-mismatch artifacts between the rate-limited action channel and the 160 ms audio chunks.
minor comments (2)
- [Abstract] The abstract states that 'planning text and structured tool calls are emitted on the action channel without halting assistant audio' but does not clarify whether any post-processing or buffering is applied to the action stream; a short clarifying sentence would improve precision.
- [Benchmark Description] DuplexSLA-Bench is introduced to evaluate the combined capabilities, yet the main text provides only a high-level description of the covered scenarios; adding one or two concrete example dialogues or task definitions would aid reader understanding.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating where revisions have been made to strengthen the presentation of our work.
read point-by-point responses
-
Referee: [Abstract and Model Description] Abstract and architecture description: the central claim that a single backbone jointly decoding the three channels produces high-quality audio output alongside accurate semantic turn-taking and action emission without external components or hidden trade-offs is not supported by any quantitative results, ablations, loss curves, or baseline comparisons. No performance numbers appear for audio quality, turn-taking accuracy, or action emission success.
Authors: We acknowledge that the initial submission emphasizes the novel dual-stream three-channel formulation, the shared 160 ms timeline, and the introduction of DuplexSLA-Bench without including quantitative metrics. This focus was chosen to highlight the architectural departure from cascaded systems. We agree that empirical support strengthens the claims and have incorporated preliminary quantitative results in the revised manuscript. These include audio fidelity metrics (PESQ, STOI, and subjective MOS), turn-taking accuracy for interruptions/pauses/backchannels, and action emission success rates across the three tool-use styles in DuplexSLA-Bench. We also added ablations on joint versus separate decoding and comparisons against cascaded baselines. Loss curves for the combined objective are now shown in the appendix. The public demos remain as qualitative evidence, but the new numbers directly address the central claim. revision: yes
-
Referee: [Model Architecture] Model formulation section: the dual-stream three-channel approach is presented at a high level but supplies no equations or diagrams specifying the joint decoding objective, the weighting or balancing of modality-specific losses between continuous audio and discrete action tokens, or the exact alignment mechanism that prevents rate-mismatch artifacts between the rate-limited action channel and the 160 ms audio chunks.
Authors: We agree that a more formal specification improves rigor. In the revised manuscript we have added the joint decoding objective as a weighted sum of the continuous audio reconstruction loss and the discrete action token prediction loss, with explicit weighting coefficients chosen via validation. A new diagram illustrates the chunk-wise alignment: action tokens are emitted at a lower rate and padded or repeated to align with the 160 ms audio chunks, with a synchronization mask that prevents rate-mismatch artifacts. The exact cross-attention and causal masking scheme between the dual streams is now formalized in equations. revision: yes
Circularity Check
No circularity: new dual-stream three-channel formulation and benchmark introduced without self-referential reductions
full rationale
The paper proposes DuplexSLA as a new architecture using a dual-stream three-channel formulation (continuous user audio, discrete assistant audio, rate-limited textual action) decoded jointly on a 160 ms timeline. Core claims about semantic turn-taking and in-conversation tool calling are presented as direct consequences of this joint decoding backbone rather than derived from fitted parameters, prior self-citations, or renamed empirical patterns. No equations, uniqueness theorems, or ansatzes are shown reducing to self-definitions or self-citations; the DuplexSLA-Bench is a newly constructed evaluation suite. The derivation remains self-contained as an architectural proposal with independent content.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DuplexSLA is built on a dual-stream three-channel formulation: a continuous user audio channel, a discrete assistant audio channel, and a rate-limited textual action channel, all decoded jointly by a single backbone... shared 160 ms chunk timeline... TA4 layout... up to ten action text tokens
-
IndisputableMonolith/Foundation/DimensionForcing.lean (and headline theorem)8-tick period forcing unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Each chunk contributes 2 causal features at an 80 ms stride... four 40 ms discrete assistant audio tokens... cap the action channel at 10 text tokens per chunk
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models
Wan-Streamer is a unified Transformer model for low-latency streaming audio-visual interaction that jointly handles perception, reasoning, generation, and timing without external modules.
-
Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models
Wan-Streamer presents a unified end-to-end Transformer for low-latency multimodal streaming interaction without external modules.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.