F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation

Dinghao Zhou; Di Wu; Pengyu Cheng; Shengfan Shen; Sixiang Lv; Xingchen Song

arxiv: 2606.06357 · v1 · pith:XAQ7EFX7new · submitted 2026-06-04 · 💻 cs.SD · cs.AI· eess.AS

F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation

Dinghao Zhou , Xingchen Song , Di Wu , Pengyu Cheng , Shengfan Shen , Sixiang Lv This is my paper

Pith reviewed 2026-06-27 23:34 UTC · model grok-4.3

classification 💻 cs.SD cs.AIeess.AS

keywords audio tokenizerautoencoder latentsnoise regularizationaudio understandingaudio generationrepresentation encodercontinuous latents

0 comments

The pith

A noise-regularized bottleneck plus latent-side representation encoder lets continuous audio autoencoder latents serve both semantic understanding and autoregressive generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Continuous audio autoencoders produce latents that reconstruct waveforms effectively yet lack the structure needed for understanding tasks, while semantic encoders capture meaning but cannot decode back to audio. The paper introduces channel normalization and stochastic perturbation in the bottleneck to control latent scale without variational regularization, then trains a representation encoder on the frozen latents using RQ-MTP and frozen-LLM supervision. If successful, the same system yields high-dimensional vectors usable for understanding while leaving normalized continuous latents intact as targets for generation models.

Core claim

By replacing KL-based variational training with a noise-regularized bottleneck of channel normalization and stochastic perturbation, and by training a representation encoder on frozen autoencoder latents via RQ-MTP and frozen-LLM supervision, the tokenizer supplies high-dimensional representations for understanding while preserving normalized continuous latents as generation targets.

What carries the argument

Noise-regularized autoencoder bottleneck (channel normalization and stochastic perturbation) together with a latent-side representation encoder trained via RQ-MTP and frozen-LLM supervision.

If this is right

Normalized continuous latents remain directly usable as targets for autoregressive waveform generation.
High-dimensional outputs from the representation encoder support downstream understanding tasks.
A single tokenizer handles both understanding pipelines and generation pipelines without separate encoders.
Reconstruction fidelity is retained without KL-based variational regularization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Unified audio models could train on one latent space for both semantic tasks and synthesis.
The same regularization pattern might apply to continuous autoencoders in other modalities.
Downstream tests on speech recognition or music tasks would measure whether the representations transfer effectively.

Load-bearing premise

The noise-regularized bottleneck combined with training a representation encoder on frozen latents via RQ-MTP and frozen-LLM supervision will simultaneously deliver effective semantic representations and usable continuous latents without degrading reconstruction quality.

What would settle it

An experiment in which either waveform reconstruction quality falls below standard autoencoder levels or the extracted representations underperform on understanding benchmarks such as audio classification or retrieval.

Figures

Figures reproduced from arXiv: 2606.06357 by Dinghao Zhou, Di Wu, Pengyu Cheng, Shengfan Shen, Sixiang Lv, Xingchen Song.

**Figure 2.** Figure 2: TTS convergence with reconstruction-only and representation-based tokenizer outputs. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Continuous audio autoencoders reconstruct waveforms well but often produce latents with weak structure for understanding, while self-supervised audio encoders capture semantics but are not directly decodable. This mismatch complicates a single audio tokenizer that must support both understanding and generation. We adapt continuous autoencoder latents to this setting with two components: a noise-regularized autoencoder bottleneck and a latent-side representation encoder. The bottleneck uses channel normalization and stochastic perturbation instead of KL-based variational training, yielding scale-controlled continuous latents for reconstruction and autoregressive generation. The representation encoder is trained on frozen autoencoder latents with RQ-MTP and frozen-LLM supervision. The resulting tokenizer provides high-dimensional representations for understanding while preserving normalized continuous latents as generation targets

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean way to keep continuous audio latents usable for generation while adding semantic representations on top via a separate encoder.

read the letter

The main contribution is the noise-regularized bottleneck (channel normalization plus stochastic perturbation) that replaces KL training, paired with a frozen-latent representation encoder trained by RQ-MTP and frozen-LLM supervision.

This setup directly targets the mismatch the abstract describes: autoencoders reconstruct well but lack structure for understanding, while self-supervised encoders are not decodable. The design keeps the autoencoder untouched during the second stage, which is a sensible separation.

What the paper does well is spell out the concrete choices and why they matter for scale control and autoregressive use. The stress-test note is right that no obvious internal contradiction appears in the construction.

The soft spot is the lack of any reported metrics or ablations in the abstract. Without numbers on reconstruction quality, semantic performance, or comparisons to prior tokenizers, it is hard to judge whether the two goals are actually met at the same time. If the full paper shows solid results and controls, that gap closes.

This is for audio researchers building unified models for speech or music who need one tokenizer for both tasks. It is worth sending to peer review because the problem is real and the technical moves are specific enough to evaluate.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes F3-Tokenizer to reconcile continuous audio autoencoder latents (strong for reconstruction/generation but weak for semantics) with self-supervised encoders (strong for semantics but not decodable). It introduces a noise-regularized bottleneck using channel normalization and stochastic perturbation (instead of KL variational training) to produce scale-controlled continuous latents, plus a separate representation encoder trained on frozen autoencoder latents via RQ-MTP and frozen-LLM supervision. The central claim is that the resulting tokenizer simultaneously yields high-dimensional representations suitable for understanding tasks while preserving usable normalized continuous latents as generation targets.

Significance. If the empirical claims hold, the work would be significant for audio modeling by enabling a single tokenizer that supports both semantic understanding and autoregressive generation without the usual trade-off between discrete semantic tokens and continuous reconstruction targets. The design choices (normalization+perturbation for the bottleneck, frozen components for the representation encoder) are pragmatic and avoid common pitfalls of variational methods or joint training instability.

major comments (2)

[Abstract and §1] Abstract and §1: No empirical results, metrics (e.g., reconstruction SNR, semantic downstream accuracy, generation quality), ablation studies, or comparisons are presented to support the claim that the noise-regularized bottleneck plus RQ-MTP/frozen-LLM representation encoder simultaneously deliver effective semantic representations and usable continuous latents without degrading reconstruction. This absence is load-bearing for the central claim.
The weakest assumption (that channel normalization + stochastic perturbation combined with training on frozen latents will preserve reconstruction quality while adding semantic structure) is stated as an empirical outcome but is not tested or quantified anywhere in the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the absence of empirical results in the abstract and §1, and the lack of direct quantification for the bottleneck assumptions, are substantive issues that must be addressed.

read point-by-point responses

Referee: [Abstract and §1] Abstract and §1: No empirical results, metrics (e.g., reconstruction SNR, semantic downstream accuracy, generation quality), ablation studies, or comparisons are presented to support the claim that the noise-regularized bottleneck plus RQ-MTP/frozen-LLM representation encoder simultaneously deliver effective semantic representations and usable continuous latents without degrading reconstruction. This absence is load-bearing for the central claim.

Authors: We agree that the abstract and §1 contain no quantitative results, metrics, ablations or comparisons. The submitted manuscript presents the method description without supporting experiments in those sections. We will revise by adding a concise summary of key metrics (reconstruction SNR, downstream accuracies, generation quality) and comparisons to the abstract and §1, while expanding the experimental section with the full ablations and baselines. revision: yes
Referee: The weakest assumption (that channel normalization + stochastic perturbation combined with training on frozen latents will preserve reconstruction quality while adding semantic structure) is stated as an empirical outcome but is not tested or quantified anywhere in the manuscript.

Authors: We acknowledge that the manuscript asserts the outcome of the noise-regularized bottleneck without providing direct empirical tests or quantification of reconstruction preservation versus semantic gains. We will add dedicated ablation experiments in the revision that isolate the effect of channel normalization and stochastic perturbation on reconstruction quality and on downstream semantic performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The provided abstract and description outline an empirical architectural proposal: a noise-regularized bottleneck (channel normalization + stochastic perturbation) on a continuous autoencoder, plus a separate representation encoder trained via RQ-MTP and frozen-LLM supervision on frozen latents. No equations, derivation steps, fitted parameters renamed as predictions, or self-citation chains are visible that would reduce any claimed result to its inputs by construction. The outcome (simultaneous semantic representations and usable continuous latents) is presented as an empirical consequence of these design choices rather than a logical necessity or self-referential fit. This matches the default expectation of a non-circular paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, training details, or experimental sections are provided to identify free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5667 in / 1111 out tokens · 20297 ms · 2026-06-27T23:34:45.721988+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 6 linked inside Pith

[1]

wav2vec 2.0: A frame- work for self-supervised learning of speech representations.arXiv preprint arXiv:2006.11477, 2020

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A frame- work for self-supervised learning of speech representations.arXiv preprint arXiv:2006.11477, 2020

arXiv 2006
[2]

Audiolm: A language modeling approach to audio generation.arXiv preprint arXiv:2209.03143, 2022

Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. Audiolm: A language modeling approach to audio generation.arXiv preprint arXiv:2209.03143, 2022

arXiv 2022
[3]

High fidelity neural audio compression.arXiv preprint arXiv:2210.13438, 2022

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression.arXiv preprint arXiv:2210.13438, 2022

Pith/arXiv arXiv 2022
[4]

Nest-rq: Next token prediction for speech self-supervised pre-training

Minglun Han, Ye Bai, Chen Shen, Youjia Huang, Mingkun Huang, Zehua Lin, Linhao Dong, Lu Lu, and Yuxuan Wang. Nest-rq: Next token prediction for speech self-supervised pre-training. arXiv preprint arXiv:2409.08680, 2024

arXiv 2024
[5]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units.arXiv preprint arXiv:2106.07447, 2021

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.arXiv preprint arXiv:2106.07447, 2021

arXiv 2021
[6]

DiTAR: Diffusion transformer autoregressive modeling for speech generation

Dongya Jia, Zhuo Chen, Jiawei Chen, Chenpeng Du, Jian Wu, Jian Cong, Xiaobin Zhuang, Chumin Li, Zhen Wei, Yuping Wang, and Yuxuan Wang. DiTAR: Diffusion transformer autoregressive modeling for speech generation. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 27255–27270....

2025
[7]

High-fidelity audio compression with improved RVQGAN

Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. High-fidelity audio compression with improved RVQGAN. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023
[8]

V oicebox: Text-guided multilingual universal speech generation at scale.arXiv preprint arXiv:2306.15687, 2023

Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and Wei-Ning Hsu. V oicebox: Text-guided multilingual universal speech generation at scale.arXiv preprint arXiv:2306.15687, 2023

arXiv 2023
[9]

Spec- trostream: A versatile neural codec for general audio.arXiv preprint arXiv:2508.05207, 2025

Yunpeng Li, Kehang Han, Brian McWilliams, Zalan Borsos, and Marco Tagliasacchi. Spec- trostream: A versatile neural codec for general audio.arXiv preprint arXiv:2508.05207, 2025

arXiv 2025
[10]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022
[11]

Nextstep-1: Toward autoregressive image generation with continuous tokens at scale.arXiv preprint arXiv:2508.10711, 2025

NextStep Team. Nextstep-1: Toward autoregressive image generation with continuous tokens at scale.arXiv preprint arXiv:2508.10711, 2025

arXiv 2025
[12]

Vibevoice technical report.arXiv preprint arXiv:2508.19205, 2025

Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, and Furu Wei. Vibevoice technical report.arXiv preprint arXiv:2508.19205, 2025

arXiv 2025
[13]

RAE-ar: Replacing variational autoencoders with representation autoencoders for autoregressive image generation.arXiv preprint, 2026

RAE-AR Team. RAE-ar: Replacing variational autoencoders with representation autoencoders for autoregressive image generation.arXiv preprint, 2026

2026
[14]

Any2Speech: Borderless long speech synthesis.arXiv preprint arXiv:2603.19798, 2026

Xingchen Song, Di Wu, Dinghao Zhou, Pengyu Cheng, Shengfan Shen, Sixiang Lv, et al. Any2Speech: Borderless long speech synthesis.arXiv preprint arXiv:2603.19798, 2026. 9

Pith/arXiv arXiv 2026
[15]

Multimodal latent language modeling with next-token diffusion.arXiv preprint arXiv:2412.08635, 2024

Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, and Furu Wei. Multimodal latent language modeling with next-token diffusion.arXiv preprint arXiv:2412.08635, 2024

arXiv 2024
[16]

Neural codec language models are zero-shot text to speech synthesizers.arXiv preprint arXiv:2301.02111, 2023

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei. Neural codec language models are zero-shot text to speech synthesizers.arXiv preprint arXiv:2301.02111, 2023

Pith/arXiv arXiv 2023
[17]

Semavoice: Semantic-aware continuous autoregressive speech synthesis.arXiv preprint arXiv:2605.16964, 2026

Huimeng Wang, Hui Lu, Jiajun Deng, Haoning Xu, Youjun Chen, Xueyuan Chen, Zhaoqing Li, Shuhai Peng, Shiyin Kang, and Xunying Liu. Semavoice: Semantic-aware continuous autoregressive speech synthesis.arXiv preprint arXiv:2605.16964, 2026

Pith/arXiv arXiv 2026
[18]

MiMo-Audio-Tokenizer

Xiaomi MiMo Team. MiMo-Audio-Tokenizer. https://huggingface.co/XiaomiMiMo/ MiMo-Audio-Tokenizer, 2025

2025
[19]

Ming- UniAudio: Speech LLM for joint understanding, generation and editing with unified representa- tion.arXiv preprint arXiv:2511.05516, 2025

Canxiang Yan, Chunxiang Jin, Dawei Huang, Haibing Yu, Han Peng, Si Chen, et al. Ming- UniAudio: Speech LLM for joint understanding, generation and editing with unified representa- tion.arXiv preprint arXiv:2511.05516, 2025

arXiv 2025
[20]

X-Codec-2.0: An open source semantic-acoustic codec for speech language modeling

Zhen Ye. X-Codec-2.0: An open source semantic-acoustic codec for speech language modeling. https://github.com/zhenye234/X-Codec-2.0, 2025

2025
[21]

Codec does matter: Exploring the semantic shortcoming of codec for audio language model.arXiv preprint arXiv:2408.17175, 2024

Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, and Wei Xue. Codec does matter: Exploring the semantic shortcoming of codec for audio language model.arXiv preprint arXiv:2408.17175, 2024

arXiv 2024
[22]

Soundstream: An end-to-end neural audio codec.arXiv preprint arXiv:2107.03312, 2021

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec.arXiv preprint arXiv:2107.03312, 2021

arXiv 2021
[23]

Diffusion transformers with representation autoencoders

Boyang Zheng, Guanhua Ma, Haotian Li, Xiaoliang Wang, Jing Yang, Jun Zhang, Yuxin Liu, Bin Lin, Quan Liu, and Lei Zhang. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690, 2025. 10

Pith/arXiv arXiv 2025

[1] [1]

wav2vec 2.0: A frame- work for self-supervised learning of speech representations.arXiv preprint arXiv:2006.11477, 2020

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A frame- work for self-supervised learning of speech representations.arXiv preprint arXiv:2006.11477, 2020

arXiv 2006

[2] [2]

Audiolm: A language modeling approach to audio generation.arXiv preprint arXiv:2209.03143, 2022

Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. Audiolm: A language modeling approach to audio generation.arXiv preprint arXiv:2209.03143, 2022

arXiv 2022

[3] [3]

High fidelity neural audio compression.arXiv preprint arXiv:2210.13438, 2022

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression.arXiv preprint arXiv:2210.13438, 2022

Pith/arXiv arXiv 2022

[4] [4]

Nest-rq: Next token prediction for speech self-supervised pre-training

Minglun Han, Ye Bai, Chen Shen, Youjia Huang, Mingkun Huang, Zehua Lin, Linhao Dong, Lu Lu, and Yuxuan Wang. Nest-rq: Next token prediction for speech self-supervised pre-training. arXiv preprint arXiv:2409.08680, 2024

arXiv 2024

[5] [5]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units.arXiv preprint arXiv:2106.07447, 2021

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.arXiv preprint arXiv:2106.07447, 2021

arXiv 2021

[6] [6]

DiTAR: Diffusion transformer autoregressive modeling for speech generation

Dongya Jia, Zhuo Chen, Jiawei Chen, Chenpeng Du, Jian Wu, Jian Cong, Xiaobin Zhuang, Chumin Li, Zhen Wei, Yuping Wang, and Yuxuan Wang. DiTAR: Diffusion transformer autoregressive modeling for speech generation. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 27255–27270....

2025

[7] [7]

High-fidelity audio compression with improved RVQGAN

Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. High-fidelity audio compression with improved RVQGAN. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023

[8] [8]

V oicebox: Text-guided multilingual universal speech generation at scale.arXiv preprint arXiv:2306.15687, 2023

Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and Wei-Ning Hsu. V oicebox: Text-guided multilingual universal speech generation at scale.arXiv preprint arXiv:2306.15687, 2023

arXiv 2023

[9] [9]

Spec- trostream: A versatile neural codec for general audio.arXiv preprint arXiv:2508.05207, 2025

Yunpeng Li, Kehang Han, Brian McWilliams, Zalan Borsos, and Marco Tagliasacchi. Spec- trostream: A versatile neural codec for general audio.arXiv preprint arXiv:2508.05207, 2025

arXiv 2025

[10] [10]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022

[11] [11]

Nextstep-1: Toward autoregressive image generation with continuous tokens at scale.arXiv preprint arXiv:2508.10711, 2025

NextStep Team. Nextstep-1: Toward autoregressive image generation with continuous tokens at scale.arXiv preprint arXiv:2508.10711, 2025

arXiv 2025

[12] [12]

Vibevoice technical report.arXiv preprint arXiv:2508.19205, 2025

Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, and Furu Wei. Vibevoice technical report.arXiv preprint arXiv:2508.19205, 2025

arXiv 2025

[13] [13]

RAE-ar: Replacing variational autoencoders with representation autoencoders for autoregressive image generation.arXiv preprint, 2026

RAE-AR Team. RAE-ar: Replacing variational autoencoders with representation autoencoders for autoregressive image generation.arXiv preprint, 2026

2026

[14] [14]

Any2Speech: Borderless long speech synthesis.arXiv preprint arXiv:2603.19798, 2026

Xingchen Song, Di Wu, Dinghao Zhou, Pengyu Cheng, Shengfan Shen, Sixiang Lv, et al. Any2Speech: Borderless long speech synthesis.arXiv preprint arXiv:2603.19798, 2026. 9

Pith/arXiv arXiv 2026

[15] [15]

Multimodal latent language modeling with next-token diffusion.arXiv preprint arXiv:2412.08635, 2024

Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, and Furu Wei. Multimodal latent language modeling with next-token diffusion.arXiv preprint arXiv:2412.08635, 2024

arXiv 2024

[16] [16]

Neural codec language models are zero-shot text to speech synthesizers.arXiv preprint arXiv:2301.02111, 2023

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei. Neural codec language models are zero-shot text to speech synthesizers.arXiv preprint arXiv:2301.02111, 2023

Pith/arXiv arXiv 2023

[17] [17]

Semavoice: Semantic-aware continuous autoregressive speech synthesis.arXiv preprint arXiv:2605.16964, 2026

Huimeng Wang, Hui Lu, Jiajun Deng, Haoning Xu, Youjun Chen, Xueyuan Chen, Zhaoqing Li, Shuhai Peng, Shiyin Kang, and Xunying Liu. Semavoice: Semantic-aware continuous autoregressive speech synthesis.arXiv preprint arXiv:2605.16964, 2026

Pith/arXiv arXiv 2026

[18] [18]

MiMo-Audio-Tokenizer

Xiaomi MiMo Team. MiMo-Audio-Tokenizer. https://huggingface.co/XiaomiMiMo/ MiMo-Audio-Tokenizer, 2025

2025

[19] [19]

Ming- UniAudio: Speech LLM for joint understanding, generation and editing with unified representa- tion.arXiv preprint arXiv:2511.05516, 2025

Canxiang Yan, Chunxiang Jin, Dawei Huang, Haibing Yu, Han Peng, Si Chen, et al. Ming- UniAudio: Speech LLM for joint understanding, generation and editing with unified representa- tion.arXiv preprint arXiv:2511.05516, 2025

arXiv 2025

[20] [20]

X-Codec-2.0: An open source semantic-acoustic codec for speech language modeling

Zhen Ye. X-Codec-2.0: An open source semantic-acoustic codec for speech language modeling. https://github.com/zhenye234/X-Codec-2.0, 2025

2025

[21] [21]

Codec does matter: Exploring the semantic shortcoming of codec for audio language model.arXiv preprint arXiv:2408.17175, 2024

Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, and Wei Xue. Codec does matter: Exploring the semantic shortcoming of codec for audio language model.arXiv preprint arXiv:2408.17175, 2024

arXiv 2024

[22] [22]

Soundstream: An end-to-end neural audio codec.arXiv preprint arXiv:2107.03312, 2021

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec.arXiv preprint arXiv:2107.03312, 2021

arXiv 2021

[23] [23]

Diffusion transformers with representation autoencoders

Boyang Zheng, Guanhua Ma, Haotian Li, Xiaoliang Wang, Jing Yang, Jun Zhang, Yuxin Liu, Bin Lin, Quan Liu, and Lei Zhang. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690, 2025. 10

Pith/arXiv arXiv 2025