arxiv: 2604.14932 · v1 · submitted 2026-04-16 · 💻 cs.AI

Recognition: unknown

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

Yifu Chen , Shengpeng Ji , Qian Chen , Tianle Liang , Yangzhuo Li , Ziqing Wang , Wen Wang , Jingyu Lu

show 4 more authors

Haoxiao Wang Xueyi Pu Fan Zhuo Zhou Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:02 UTC · model grok-4.3

classification 💻 cs.AI

keywords spoken dialogue modelsreinforcement learningpreference optimizationpost-trainingsemantic updatesacoustic anchoringadaptive regulation

0 comments

The pith

A modality-aware adaptive post-training approach makes reinforcement learning practical for spoken dialogue models by separating semantic preference updates from acoustic anchoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that direct application of preference optimization to spoken dialogue models is hindered by the interaction between sparse preference signals and dense speech generation in shared-parameter setups. It proposes a modality-aware adaptive post-training recipe to address this by constraining preference updates to the semantic channel, using explicit anchoring for acoustic behavior, and dynamically regulating the mixture based on rollout statistics. This would matter if true because it could unlock higher intelligence and expressiveness in end-to-end spoken dialogue systems without the need for separate cascaded components. Sympathetic readers would care as it offers a practical way to apply RL techniques that have succeeded elsewhere to the challenging domain of voice-based AI.

Core claim

The central claim is that the modality-aware adaptive post-training recipe makes RL practical for spoken dialogue models by constraining preference updates to the semantic channel, improving acoustic behavior via explicit anchoring, and dynamically regulating their mixture from rollout statistics to avoid unreliable preference gradients, yielding consistent gains in semantic quality and speech expressiveness.

What carries the argument

The modality-aware adaptive post-training recipe, which separates and dynamically balances semantic preference optimization and acoustic anchoring based on rollout statistics.

If this is right

Preference optimization can be applied without compromising acoustic generation quality.
Both semantic intelligence and acoustic expressiveness improve across tested models and benchmarks.
Dynamic regulation from rollout statistics prevents issues from unreliable gradients.
The method applies successfully to representative architectures in spoken dialogue.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This separation of concerns could be generalized to other multimodal generation tasks where different modalities have different optimization needs.
Anchoring techniques might stabilize training in other settings with dense outputs and sparse rewards.
Analyzing rollout statistics for adaptation could inspire similar adaptive strategies in other RL applications.

Load-bearing premise

The analysis accurately pinpoints the main problems in reward modeling and rollout sampling, and the new constraints and anchoring do not introduce instabilities in the training process.

What would settle it

Testing the proposed recipe on a spoken dialogue model and finding no gains in semantic quality or speech expressiveness metrics compared to direct preference optimization would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.14932 by Fan Zhuo, Haoxiao Wang, Jingyu Lu, Qian Chen, Shengpeng Ji, Tianle Liang, Wen Wang, Xueyi Pu, Yangzhuo Li, Yifu Chen, Zhou Zhao, Ziqing Wang.

**Figure 1.** Figure 1: Motivation and failure mode of unified RL for end to end spoken dialogue models. reward signals: rewards are often noisy, underspecified, and entangled with artifacts. When such sparse signals are distributed over long speech token sequences, credit assignment becomes illconditioned, and reward hacking can produce speech that scores well while sounding unnatural. These issues are amplified for weaker bas… view at source ↗

**Figure 2.** Figure 2: Token-level probability change under teacher forcing (∆ log p vs. base) for the same prompt. make credit assignment challenging for long and dense token streams. The KL term acts as a dense trust-region constraint that stabilizes optimization and prevents excessive policy drift. 3.2.3 Offline DPO-family Given pairwise preference data Dpref = {(x, y+, y−)}, DPO optimizes a logistic loss on the reference-cor… view at source ↗

**Figure 3.** Figure 3: Judge/reward-model agreement with human evaluation on semantic vs. acoustic dimensions [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Empirical geometry of text vs. speech gradients under different objectives. smaller distributional shifts than supervised finetuning (SFT), and may fail to move the model out of suboptimal regions when rollouts provide limited improvement signals. (Bottleneck II) Speech naturalness/expressiveness lacks reliably learnable preference signals and strong rollouts: acoustic reward judgments can be noisy or mi… view at source ↗

**Figure 5.** Figure 5: Semantic vs. acoustic diversity under repeated sampling reveals weaker acoustic discriminability [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of the proposed single stage adaptive hybrid post training. driven updates should focus on IT , while dense supervision stabilizes IS. Observation 3 ( [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt template (verbatim): understand_emotion. G.2 Criteria and decision rule (SBS) For each test item, annotators are shown the same user query and two candidate spoken responses (A/B) (Zhang et al., 2024b). For each axis below, annotators choose one of {A better, B better, Tie}. A Tie should be selected when the two responses are indistinguishable for that axis, or when each has offsetting strengths suc… view at source ↗

**Figure 8.** Figure 8: Prompt template (verbatim): gender-perspective monologue generation. weak, pick the better one unless they are truly indistinguishable. • Use Tie when (a) differences are negligible, or (b) each is clearly better on different aspects of the same axis and you cannot justify a preference. • You may replay each audio response; judge based on the responses as presented (do not infer unstated content). G.3 Agg… view at source ↗

**Figure 9.** Figure 9: Prompt template (verbatim): understand_volume. EMA Coefficient (α) IQ EQ α = 0 (no EMA) 53.15 2.53 α = 0.5 (low smoothing) 54.80 2.85 α = 0.9 (ours, default) 55.24 2.92 α = 0.99 (high smoothing) 50.95 2.88 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt template (verbatim): understand_pace. correctly suppresses the preference loss and relies more on SFT for acoustic anchoring. Mid-to-late training: As the policy improves, rollout quality and discriminability increase. λt gradually rises, allowing preference optimization to take a larger role in refining semantic intelligence. Convergence: λt stabilizes in the range of [0.35, 0.55], confirming that… view at source ↗

**Figure 11.** Figure 11: Prompt template (verbatim): controller_volume [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt template (verbatim): controller_pace [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: DPO scoring prompt (verbatim) used for semantic and paralinguistic scoring of repeated samples. EVALUATION_PROMPT = """ **Task: Evaluate the Quality of a Spoken Answer** You will be provided with a spoken question or chat and a spoken answer. Your goal is to assess the quality of the **answer **. Use the question to understand what a good response should accomplish. **Evaluation Approach:** 1. **Consider … view at source ↗

**Figure 14.** Figure 14: Reward prompt: overall answer quality [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Reward prompt: acoustic-only (paralinguistic) evaluation [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Reward prompt: semantic-only evaluation [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

read the original abstract

End-to-end spoken dialogue models have garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-source spoken dialogue models often remain below expectations. Motivated by the success of online reinforcement learning(RL) in other domains, one might attempt to directly apply preference optimization to spoken dialogue models, yet this transfer is non-trivial. We analyze these obstacles from the perspectives of reward modeling and rollout sampling, focusing on how sparse preference supervision interacts with dense speech generation under shared-parameter updates. Based on the analysis, we propose a modality-aware adaptive post-training recipe that makes RL practical for spoken dialogue: it constrains preference updates to the semantic channel and improves acoustic behavior via explicit anchoring, while dynamically regulating their mixture from rollout statistics to avoid unreliable preference gradients. We evaluate the method across multiple spoken dialogue benchmarks and representative architectures, and observe consistent improvements in semantic quality and speech expressiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WavAlign gives a workable recipe for RL on spoken dialogue models by splitting semantic preference updates from acoustic anchoring and regulating the mix via rollouts.

read the letter

This paper's main point is that direct preference optimization does not transfer cleanly to spoken dialogue because sparse rewards clash with dense speech output under shared parameters. The authors break down the reward modeling and rollout sampling issues, then propose constraining updates to the semantic channel, adding explicit acoustic anchoring, and using rollout statistics to adjust the mixture dynamically. That combination looks like the actual new piece rather than a generic RL port from text domains. They test it across several benchmarks and model architectures and report steady gains in semantic quality plus expressiveness, which matches the practical goal for open-source spoken models. The stress-test note lines up with the abstract: the internal logic holds and no new instabilities appeared in the reported runs. The empirical pattern is the strongest part here. Minor soft spots are the usual ones for this stage. The abstract omits the precise equations and implementation steps for the anchoring and regulation, so the full methods section needs checking to see how cleanly they avoid gradient problems. A few more ablation runs or edge-case results would help, but nothing in the given evidence suggests the central claim is fragile. This work is for researchers already working on end-to-end spoken dialogue or multimodal generation who want a concrete post-training lever. Anyone doing RL for dense output tasks will get usable ideas from the recipe and the results. It deserves a serious referee because the obstacle analysis is direct, the fixes are targeted, and the experiments back the improvements without obvious circularity or fitting artifacts. Recommendation: send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that direct application of preference optimization to spoken dialogue models is hindered by challenges in reward modeling and rollout sampling, particularly sparse supervision with dense generation under shared parameters. It introduces WavAlign, a modality-aware adaptive post-training recipe that restricts preference updates to the semantic channel, employs explicit anchoring for acoustics, and dynamically mixes them based on rollout statistics. Evaluations on spoken dialogue benchmarks and various architectures show consistent improvements in semantic quality and expressiveness.

Significance. If the results hold, the work offers a practical solution to applying RL in spoken dialogue, potentially elevating open-source models' performance in intelligence and expressiveness. The focus on modality-specific adaptations addresses a gap in current post-training methods for multimodal systems.

major comments (2)

[§3] §3: The proposed constraints on semantic-channel updates and acoustic anchoring are motivated by the analysis but the manuscript does not provide a formal derivation or proof that these prevent the identified instabilities, relying instead on empirical validation.
[§4.3] §4.3: The dynamic regulation from rollout statistics is described at a high level; without the exact formula or pseudocode for computing the mixture weights, reproducibility of the adaptive component is limited.

minor comments (2)

The title uses 'WavAlign' but the abstract and method section do not explain the name or its relation to waveform alignment.
[References] References: Several recent works on RL for dialogue (e.g., from 2023-2024) appear missing from the related work section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, providing clarifications and committing to revisions that strengthen the presentation without altering the core contributions.

read point-by-point responses

Referee: [§3] The proposed constraints on semantic-channel updates and acoustic anchoring are motivated by the analysis but the manuscript does not provide a formal derivation or proof that these prevent the identified instabilities, relying instead on empirical validation.

Authors: We acknowledge that §3 presents a motivation grounded in the analysis of sparse preference supervision interacting with dense speech generation under shared parameters, rather than a formal derivation or proof of stability. The constraints on semantic-channel updates and acoustic anchoring follow directly from this analysis to isolate unreliable gradients. While a rigorous theoretical proof of instability prevention is not provided (and may lie outside the paper's scope), the empirical validation across multiple benchmarks and architectures consistently shows improved semantic quality and expressiveness without observed instabilities. In revision, we will expand §3 with additional intuitive explanations of the mechanism and a brief discussion of the empirical evidence supporting stability. revision: partial
Referee: [§4.3] The dynamic regulation from rollout statistics is described at a high level; without the exact formula or pseudocode for computing the mixture weights, reproducibility of the adaptive component is limited.

Authors: We agree that the description of the dynamic mixture regulation in §4.3 is at a high level and that including the exact formula and pseudocode would enhance reproducibility. The mixture weights are computed from rollout statistics to balance semantic preference updates and acoustic anchoring. In the revised manuscript, we will add the precise mathematical formulation for the mixture weights and include pseudocode for the adaptive regulation procedure, either in §4.3 or an appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core contribution is an empirical recipe for post-training spoken dialogue models, derived from an analysis of reward modeling and rollout obstacles. It constrains updates to the semantic channel, adds explicit acoustic anchoring, and regulates mixtures from rollout statistics. No equations, fitted parameters renamed as predictions, or self-referential definitions appear in the abstract or described claims. The method is justified by benchmark evaluations showing consistent gains, not by any reduction of outputs to inputs by construction. Self-citations, if present in the full text, are not load-bearing for the central claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about modality interactions in shared-parameter models and the reliability of rollout statistics for regulation; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Sparse preference supervision interacts with dense speech generation under shared-parameter updates in a way that creates unreliable gradients.
Invoked in the analysis of obstacles to direct RL application.

pith-pipeline@v0.9.0 · 5505 in / 1195 out tokens · 39700 ms · 2026-05-10T11:02:21.742023+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, and 1 others. 2025. Kimi-audio technical report.arXiv preprint arXiv:2504.18425. Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Step-audio-aqaa: a fully end-to- end expressive large audio language model,

Step-audio-aqaa: a fully end-to-end expres- sive large audio language model.arXiv preprint arXiv:2506.08967. Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Qiu, Boxun Li, and Yaodong Yang. 2024a. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference.arXiv preprint arXiv:2406.15513. Shengpeng...

work page arXiv 2025
[3]

UFT: Unifying supervised and reinforcement fine-tuning

Uft: Unifying supervised and reinforcement fine-tuning.arXiv preprint arXiv:2505.16984. Zuwei Long, Yunhang Shen, Chaoyou Fu, Heting Gao, Lijiang Li, Peixian Chen, Mengdan Zhang, Hang Shao, Jian Li, Jinlong Peng, Haoyu Cao, Ke Li, Rongrong Ji, and Xing Sun. 2025. Vita-audio: Fast interleaved cross-modal token generation for efficient large speech-language...

work page arXiv 2025
[4]

Paras2s: Benchmarking and aligning spoken language models for paralinguistic- aware speech-to-speech interaction.arXiv preprint arXiv:2511.08723, 2025

Paras2s: Benchmarking and aligning spoken language models for paralinguistic-aware speech-to- speech interaction.arXiv preprint arXiv:2511.08723. Xiangchi Yuan, Xiang Chen, Tong Yu, Dachuan Shi, Can Jin, Wenke Lee, and Saayan Mitra. 2025. Miti- gating forgetting between supervised and reinforce- ment learning yields stronger reasoners.arXiv preprint arXiv...

work page arXiv 2025
[5]

Zero gradients and backpropagate Ltext to ob- taing text
[6]

This yields clean modality-separated gradients rather than mixing both losses in a single backward pass

Zero gradients and backpropagate Lspeech to obtaing speech. This yields clean modality-separated gradients rather than mixing both losses in a single backward pass. Judge Pearson sem Pearsonacous Intra-ID Spearmansem Intra-ID Spearmanacous MAEsem MAEacous ≤1 Passsem ≤1 Passacous Biassem Biasacous Gemini-3-Flash 0.757 0.430 0.699 0.294 0.553 0.890 0.909 0....
[7]

**Six Distinct Statements**: You must generate six unique statements
[8]

The tone should be calm and observational

**Emotionally Neutral**: All statements must be objective and avoid conveying any strong emotions like happiness, sadness, anger, fear, or anxiety. The tone should be calm and observational
[9]

**First-Person Narrative**: Write from an "I" perspective
[10]

**Word Count**: Each statement must be a minimum of 50 words
[11]

{topic}" # Output Format You must provide a valid JSON object containing ONLY a single key,

**Descriptive Content**: Focus on sensory details and factual observations related to the topic. # Input - **Topic**: "{topic}" # Output Format You must provide a valid JSON object containing ONLY a single key, "statements", whose value is a list of six generated strings. # Example ## Input: - **Topic**: "A quiet morning with a cup of coffee and a good bo...
[12]

**Natural Language**: The statement must be colloquial and sound like a real person talking
[13]

**Topic Relevance**: The statement must be clearly about the given topic
[14]

As a man

**No Explicit Gender**: Do not use phrases like "As a man...", "As a woman...", or other direct gender indicators. The gender context should come naturally from the topic itself
[15]

**No Questions**: Do not ask any questions in the generated statement itself
[16]

{gender}

**Concise**: Keep the statement to 2-3 sentences. # Input - **Gender**: "{gender}" - **Topic**: "{topic}" # Output Format You must provide a valid JSON object containing ONLY a single key, "statement", whose value is the generated string. # Example ## Input: - **Gender**: "Female" - **Topic**: "shopping for new shoes" ## Output: { "statement": "I've been ...

2012
[18]

**Descriptive Expansion**: Elaborate on the topic, adding sensory details or a short narrative to make it more vivid
[21]

{topic}" # Output Format You must provide a valid JSON object containing ONLY a single key,

**Distinct Content**: The three paragraphs must be different from each other. # Input - **Topic**: "{topic}" # Output Format You must provide a valid JSON object containing ONLY a single key, "paragraphs", whose value is a list of **three** generated strings. # Example ## Input: - **Topic**: "A quiet morning with a cup of coffee and a good book" ## Output...
[22]

**Natural Language**: The text must be fluent, colloquial, and sound like a person speaking naturally
[23]

fast," "slow,

**Pace Neutrality**: The generated text must be completely neutral regarding the speed of events or speech. Avoid words like "fast," "slow," "quick," "rapid," "leisurely," "gradual," "sudden," etc. The pace will be conveyed through audio, not text
[24]

**Word Count**: Each paragraph should be around 45 words
[25]

**No Questions**: Do not ask any questions in the generated paragraphs
[26]

{topic}" # Output Format You must provide a valid JSON object containing ONLY a single key,

**Distinct Content**: The three paragraphs must be different from each other. # Input - **Topic**: "{topic}" # Output Format You must provide a valid JSON object containing ONLY a single key, "paragraphs", whose value is a list of **three** generated strings. # Example ## Input: - **Topic**: "A quiet morning with a cup of coffee and a good book" ## Output...
[27]

Be creative and vary the sentence structure and tone

**Natural Language**: The commands must be colloquial and natural, not robotic. Be creative and vary the sentence structure and tone
[28]

say it loudly,

**Explicit Volume**: Each command must clearly or strongly imply that the voice assistant should respond with the specified volume. Use phrases like "say it loudly," "whisper it," "in a normal voice," "shout it," etc
[29]

**Scenario Relevance**: The commands must be closely related to the provided scenario, giving a logical context for the volume request
[30]

give me a smile

**Voice-Only Constraints**: This is for a voice system, so the commands must NOT include any requests for visual cues (e. g., "give me a smile"), physical actions (e.g., "jump for joy"), or any other non-verbal output. All requests must be conveyable through voice and tone
[31]

{volume}

**Word Count**: Each command must be between 30 and 60 words long. # Input - **Volume**: "{volume}" - **Scenario**: "{scenario}" # Output Format You must provide a valid JSON object containing ONLY a single key, "instructions", whose value is a list of two instruction strings. # Example ## Input: - **Volume**: "Loud" - **Scenario**: "A drill sergeant yell...
[32]

**Natural & Immersive Language**: Commands must sound like a real, expressive person setting a scene, not just giving a brief order
[33]

read this at a rapid-fire pace,

**Explicit Pace Request**: Each command must clearly instruct the AI to use the specified speech pace in its response (e. g., "read this at a rapid-fire pace," "tell me the story very slowly," "at a normal conversational speed")
[34]

The AI does not know the input'scenario'; your generated command is its only source of information

**THE MOST CRITICAL RULE: Provide SELF-CONTAINED CONTEXT**: The user's command must *itself* describe the full scenario to the AI. The AI does not know the input'scenario'; your generated command is its only source of information. It must be completely clear from the command *why* the specific pace is needed
[35]

This script must be more than 25 words long

**Two Distinct Styles**: You MUST generate one command of each style: - **Style A (Scripted Performance)**: The user provides a specific script or text (enclosed in quotes) for the AI to read. This script must be more than 25 words long. The command should ask the AI to perform it at the specified pace. - **Style B (Improvised Response)**: The user descri...
[36]

{pace}" - **Scenario**:

**Word Count**: Each command must be between 40 and 80 words long. # Input - **Pace**: "{pace}" - **Scenario**: "{scenario}" # Output Format You must provide a valid JSON object containing ONLY a single key, "instructions", whose value is a list of two instruction strings in the order: [Style A, Style B]. # Example ## Input: - **Pace**: "Fast" - **Scenari...
[37]

A spoken USER QUESTION
[38]

A spoken MODEL ANSWER Your task is to evaluate the MODEL ANSWER from two independent perspectives:
[39]

**Semantic Content Quality** — what is said
[40]

--- ## Part 1: Semantic Content Evaluation (Meaning Only)

**Paralinguistic Voice Quality** — how it is spoken Evaluate each aspect separately and independently. --- ## Part 1: Semantic Content Evaluation (Meaning Only) ... --- ## Output Format (STRICT) Return ONLY the following XML format. Do NOT include any extra text. <justification> Briefly explain:
[41]

how well the answer performed semantically,
[42]

"" Figure 13:DPO scoring prompt (verbatim) used for semantic and paralinguistic scoring of repeated samples. EVALUATION_PROMPT =

how well the voice quality performed. </justification> <score-semantic>[1–5]</score-semantic> <score-paralinguistic>[1–5]</score-paralinguistic>""" Figure 13:DPO scoring prompt (verbatim) used for semantic and paralinguistic scoring of repeated samples. EVALUATION_PROMPT = """ **Task: Evaluate the Quality of a Spoken Answer** You will be provided with a s...
[43]

**Consider the Question's Intent:** First, ask yourself what the question is truly asking for. Is it a request for a specific fact (requiring accuracy), a detailed explanation (requiring clarity and structure), a personal opinion ( requiring good reasoning), or a creative idea (requiring originality)?
[44]

"" Figure 14:Reward prompt: overall answer quality. ACOUSTIC_EVALUATION_PROMPT =

**Holistic speech quality Assessment:** With that in mind, evaluate the answer as a whole. A great answer not only sounds clear and natural but also effectively *fulfills the specific purpose* of the question. **Scoring Guidelines (1-5):** * **5 (Excellent):** Perfectly understands and fulfills the question's intent with outstanding content and delivery. ...
[45]

**Clarity & Intelligibility** Are syllables distinct? Is the signal free of strong noise, muffling, clipping, or distortion?
[46]

**Fluency & Flow** Is the speech smooth and continuous, with minimal stutters, filler words, or awkward pauses?
[47]

**Accent & Pronunciation** Does any accent hinder intelligibility? Minor accent is acceptable if words remain easy to follow
[48]

**Emotion & Expressiveness (Context-Aware)** Does the emotional tone (e.g., neutral, cheerful, calm) fit the dialogue scene? Synthetic timbre is fine as long as emotional cues align with context
[49]

"" Figure 15:Reward prompt: acoustic-only (paralinguistic) evaluation. SEMANTIC_EVALUATION_PROMPT =

**Pace & Prosody** Is the speaking rate comfortable? Are intonation and rhythm varied enough to avoid monotony? > **Important:** > **Do NOT** penalize the voice merely for sounding synthetic. > Minor robotic artifacts are acceptable **if** they do not obscure clarity, fluency, accent, or pacing. > Only the five criteria above influence the score—ignore fa...
[50]

**Accuracy & Relevance:** Is the information factually correct and directly relevant to the question? Does it directly answer what was asked, or does it evade the question?
[51]

**Depth & Completeness:** Does the answer provide sufficient detail and insight? Is it well-reasoned and comprehensive, or is it superficial and incomplete?
[52]

A brilliant answer

**Structure & Coherence:** Is the argument or explanation logically structured and easy to follow? Is the language clear and articulate? **Scoring Guidelines (1-5 for Content):** * **5 (Excellent):** The content is accurate, deeply insightful, highly relevant, and perfectly structured. A brilliant answer. * **4 (Good):** The content is strong and correct ...