Decoupled Residual Quantization for Robust Semantic IDs in Recommendation

Guanxing Zhang; Junjie Wang; Weijie Bian; Xuesi Wang; Ziliang Wang

arxiv: 2606.01844 · v1 · pith:NU7J5HPFnew · submitted 2026-06-01 · 💻 cs.IR

Decoupled Residual Quantization for Robust Semantic IDs in Recommendation

Xuesi Wang , Junjie Wang , Ziliang Wang , Weijie Bian , Guanxing Zhang This is my paper

Pith reviewed 2026-06-28 12:50 UTC · model grok-4.3

classification 💻 cs.IR

keywords semantic idsrecommendation systemsvector quantizationresidual quantizationcodebook utilizationdiscrete representationsretrieval robustness

0 comments

The pith

Expected codeword overlap and effective codebook capacity diagnose semantic ID tokenizer failures and motivate decoupled residual quantization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces two metrics, expected codeword overlap and effective codebook capacity, to quantify why semantic ID tokenizers produce poor discrete representations of items. Overlap captures expected codeword confusion under small perturbations at retrieval time, while capacity converts that confusion into an effective count of usable, separated codes. These metrics connect boundary instability to both code usage imbalance and the geometry of the embedding space. As a concrete response, the authors propose decoupled residual quantization that treats continuous reconstruction and discrete distribution matching as separate objectives. Experiments on one industrial dataset indicate that symbolic robustness, reconstruction fidelity, and behavior-aware soft matching pull in different directions, so no single tokenizer optimizes all three at once.

Core claim

Semantic ID quality is diagnosed by computing expected codeword overlap under retrieval-time perturbation and converting it into effective codebook capacity; this framework shows that codebook underutilization and Euclidean distortion are the main drivers of tokenizer failure. Decoupled residual quantization addresses the problem by performing continuous geometry reconstruction independently of discrete distribution matching, yielding token sequences that maintain better separation while still fitting observed item behaviors.

What carries the argument

Decoupled Residual Quantization (DRQ), which separates continuous geometry reconstruction from discrete distribution matching.

If this is right

Tokenizers can be ranked by their expected overlap under realistic perturbation rather than by reconstruction loss alone.
Effective capacity below the nominal codebook size signals that many codes are effectively unused or overlapping.
Improving one aspect of semantic ID quality (robustness, fidelity, or soft matching) does not automatically improve the others.
Design choices in quantization should be evaluated against all three objectives rather than a single aggregate score.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same overlap-capacity lens could be applied to other discrete representation tasks such as image or audio tokenization to diagnose similar underutilization problems.
If the metrics generalize, they offer a parameter-free way to set codebook sizes before training rather than by post-hoc inspection.
The multi-objective finding suggests that future work may need explicit Pareto optimization or task-specific weighting of the three criteria.

Load-bearing premise

That the two proposed metrics correctly locate the root causes of tokenizer failure and that the observed trade-offs among robustness, fidelity, and soft matching will hold outside the single proprietary dataset.

What would settle it

A controlled experiment on a public recommendation dataset in which the measured overlap and capacity values fail to predict downstream retrieval degradation or in which DRQ shows no improvement over standard residual quantization on any of the three quality axes.

read the original abstract

Semantic IDs represent items as shared discrete token sequences and have become a practical tool for recommendation and retrieval. Yet it remains difficult to tell why a tokenizer fails: poor quality may come from codebook underutilization, unstable decision boundaries, or geometric distortion of the embedding space. This paper develops a quantitative framework for diagnosing these failures through expected codeword overlap and effective codebook capacity. The former measures expected codeword confusion under retrieval-time perturbation, while the latter converts that confusion into an effective number of usable, well-separated codes. The framework links semantic boundary confusion to both code usage imbalance and Euclidean geometric constraints. As a proof of concept, we present Decoupled Residual Quantization (DRQ), which separates continuous geometry reconstruction from discrete distribution matching. Experiments on a large-scale industrial dataset show that Semantic ID quality is multi-objective: symbolic robustness, reconstruction fidelity, and behavior-aware soft matching each stress different aspects of a tokenizer. These downstream observations are based on one proprietary industrial dataset, so they should be read as a case study rather than a universal benchmark claim.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New diagnostic metrics for semantic ID tokenizers plus a DRQ method, but everything rests on one proprietary dataset labeled as a case study.

read the letter

The paper introduces expected codeword overlap and effective codebook capacity as ways to quantify tokenizer failures from codebook underuse, unstable boundaries, or embedding distortion. It also presents DRQ to separate continuous geometry reconstruction from discrete distribution matching. These ideas are presented without reducing to prior published results in the abstract, so they count as new for this subfield.

The framework makes sense on its own terms and the authors are explicit that semantic ID quality involves trade-offs across symbolic robustness, reconstruction fidelity, and behavior-aware soft matching. The case study on their industrial data shows these aspects can pull in different directions, which is a useful observation for anyone building these systems.

The main limitation is that all supporting evidence comes from a single proprietary dataset. The abstract calls the results a case study rather than a general claim, which is honest, but it leaves open whether the metrics actually identify root causes reliably or whether the multi-objective pattern holds on public benchmarks or other domains. The stress-test concern is accurate here.

This is for engineers and researchers working on semantic IDs inside industrial recommendation pipelines. A reader already dealing with tokenizer robustness would find the metrics worth trying. It deserves peer review so the authors can add public data or the community can check generalization.

Referee Report

1 major / 1 minor

Summary. The manuscript develops a quantitative diagnostic framework for semantic ID tokenizers in recommendation systems, centered on two metrics—expected codeword overlap (measuring confusion under retrieval-time perturbation) and effective codebook capacity (converting overlap into an effective count of usable codes). It links these to root causes including codebook underutilization, unstable boundaries, and geometric distortion. The paper proposes Decoupled Residual Quantization (DRQ), which separates continuous geometry reconstruction from discrete distribution matching. Experiments on a single large-scale proprietary industrial dataset are presented explicitly as a case study, showing that semantic ID quality is multi-objective (symbolic robustness, reconstruction fidelity, and behavior-aware soft matching each stress different tokenizer aspects).

Significance. If the metrics prove to correctly identify the stated failure modes and the multi-objective characterization generalizes, the framework could offer a principled way to evaluate and improve tokenizers beyond ad-hoc metrics. The paper's explicit framing as a case study and its focus on first-principles metrics (overlap and capacity) are strengths that support targeted follow-up work, though the single-dataset scope constrains broader claims about tokenizer design.

major comments (1)

[Abstract] Abstract: The multi-objective claim (that the three aspects 'each stress different aspects of a tokenizer') and the assertion that the metrics diagnose root causes rest entirely on experiments from one proprietary industrial dataset labeled a case study. No public benchmarks, cross-domain ablations, or additional datasets are referenced, so it is unclear whether expected codeword overlap and effective codebook capacity reliably link to underutilization/unstable boundaries/distortion outside this data distribution.

minor comments (1)

The abstract supplies no equations or pseudocode for the two core metrics or for DRQ; adding a brief formal definition or high-level equation in the main text would improve accessibility without altering the case-study framing.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for highlighting the scope limitations of our empirical claims. Our work is explicitly positioned as a case study on proprietary industrial data, with the metrics derived from first principles. We address the concern point-by-point below and note the constraints on additional validation.

read point-by-point responses

Referee: [Abstract] Abstract: The multi-objective claim (that the three aspects 'each stress different aspects of a tokenizer') and the assertion that the metrics diagnose root causes rest entirely on experiments from one proprietary industrial dataset labeled a case study. No public benchmarks, cross-domain ablations, or additional datasets are referenced, so it is unclear whether expected codeword overlap and effective codebook capacity reliably link to underutilization/unstable boundaries/distortion outside this data distribution.

Authors: We agree that the multi-objective characterization and the empirical linkage of the metrics to specific failure modes (underutilization, unstable boundaries, distortion) are demonstrated solely on the single proprietary dataset, which the manuscript already labels a case study. The metrics are derived from first-principles considerations of perturbation-induced overlap and effective capacity, independent of any particular data distribution; the industrial experiments serve only to illustrate their diagnostic utility in a realistic retrieval setting. We will revise the abstract to more explicitly separate the first-principles metric definitions from the case-study observations and to further qualify the multi-objective claim as dataset-specific. We cannot add public benchmarks or cross-domain ablations because the data is proprietary and no equivalent public recommendation datasets with the required scale and retrieval-time perturbation characteristics are available to us. revision: partial

standing simulated objections not resolved

Inability to provide public benchmarks, cross-domain ablations, or additional datasets due to the proprietary nature of the industrial data used in the case study.

Circularity Check

0 steps flagged

No circularity: metrics defined from first principles with no self-referential reductions

full rationale

The provided abstract and text contain no equations, derivations, or self-citations. Expected codeword overlap is explicitly defined as measuring expected codeword confusion under retrieval-time perturbation, and effective codebook capacity is defined as converting that confusion into an effective number of usable codes; these are direct first-principles constructions rather than reductions to fitted inputs or prior results. The multi-objective claim is presented as an empirical observation from a single proprietary dataset explicitly labeled a case study, without any load-bearing derivation chain that reduces to its own inputs by construction. No patterns of self-definition, fitted predictions, or ansatz smuggling appear.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to enumerate specific free parameters, axioms, or invented entities; DRQ is described as a method rather than a new postulated entity.

pith-pipeline@v0.9.1-grok · 5721 in / 1058 out tokens · 24334 ms · 2026-06-28T12:50:50.492901+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references

[1]

Recommender systems with generative retrieval

Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan H Erickson, Ariel Magd, et al. Recommender systems with generative retrieval. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023
[2]

Onerec: A unified generative recommendation framework

Jiacheng Liu et al. Onerec: A unified generative recommendation framework. arXiv preprint arXiv:2405.00000, 2024

arXiv 2024
[3]

Product quantization for nearest neighbor search.IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010

Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010

2010
[4]

Soundstream: An end-to-end neural audio codec

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skianis, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec. InIEEE/ACM Transactions on Audio, Speech, and Language Processing, volume 30, pages 495–507. IEEE, 2021

2021
[5]

Se-rec: Semantic-enhanced generative recommendation.arXiv preprint arXiv:2401.00000, 2024

X Li et al. Se-rec: Semantic-enhanced generative recommendation.arXiv preprint arXiv:2401.00000, 2024

arXiv 2024
[6]

Neural discrete representation learning

Aaron van den Oord, Oriol Vinyals, and koray kavukcuoglu. Neural discrete representation learning. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017
[7]

Understanding dimen- sional collapse in contrastive self-supervised learning

Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimen- sional collapse in contrastive self-supervised learning. InInternational Conference on Learning Representations, volume 10, 2021

2021
[8]

Least squares quantization in pcm.IEEE transactions on information theory, 28(2):129–137, 1982

Stuart Lloyd. Least squares quantization in pcm.IEEE transactions on information theory, 28(2):129–137, 1982

1982
[9]

Rq-gmm: Residual quantized gaussian mixture model for multimodal semantic discretization in ctr prediction.arXiv preprint arXiv:2602.12593, 2026

Ziye Tong, Jiahao Liu, Weimin Zhang, Hongji Ruan, Derick Tang, Zhanpeng Zeng, Qinsong Zeng, Peng Zhang, Tun Lu, and Ning Gu. Rq-gmm: Residual quantized gaussian mixture model for multimodal semantic discretization in ctr prediction.arXiv preprint arXiv:2602.12593, 2026

arXiv 2026
[10]

Latent space oddity: on the curvature of deep generative models

Georgios Arvanitidis, Lars Kai Hansen, and Søren Hauberg. Latent space oddity: on the curvature of deep generative models. InInternational Conference on Learning Representations, 2018

2018
[11]

Understanding contrastive representation learning through alignment and uniformity on the hypersphere

Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InInternational Conference on Machine Learning, pages 9929–9939. PMLR, 2020

2020
[12]

Representation degeneration problem in training natural language generation models.arXiv preprint arXiv:1907.12009, 2019

Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. Representation degeneration problem in training natural language generation models.arXiv preprint arXiv:1907.12009, 2019

arXiv 1907
[13]

How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings.arXiv preprint arXiv:1909.00512, 2019

Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings.arXiv preprint arXiv:1909.00512, 2019

arXiv 1909
[14]

Deep learning and the information bottleneck principle.IEEE information theory workshop (ITW), pages 1–5, 2015

Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle.IEEE information theory workshop (ITW), pages 1–5, 2015

2015

[1] [1]

Recommender systems with generative retrieval

Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan H Erickson, Ariel Magd, et al. Recommender systems with generative retrieval. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023

[2] [2]

Onerec: A unified generative recommendation framework

Jiacheng Liu et al. Onerec: A unified generative recommendation framework. arXiv preprint arXiv:2405.00000, 2024

arXiv 2024

[3] [3]

Product quantization for nearest neighbor search.IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010

Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010

2010

[4] [4]

Soundstream: An end-to-end neural audio codec

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skianis, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec. InIEEE/ACM Transactions on Audio, Speech, and Language Processing, volume 30, pages 495–507. IEEE, 2021

2021

[5] [5]

Se-rec: Semantic-enhanced generative recommendation.arXiv preprint arXiv:2401.00000, 2024

X Li et al. Se-rec: Semantic-enhanced generative recommendation.arXiv preprint arXiv:2401.00000, 2024

arXiv 2024

[6] [6]

Neural discrete representation learning

Aaron van den Oord, Oriol Vinyals, and koray kavukcuoglu. Neural discrete representation learning. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017

[7] [7]

Understanding dimen- sional collapse in contrastive self-supervised learning

Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimen- sional collapse in contrastive self-supervised learning. InInternational Conference on Learning Representations, volume 10, 2021

2021

[8] [8]

Least squares quantization in pcm.IEEE transactions on information theory, 28(2):129–137, 1982

Stuart Lloyd. Least squares quantization in pcm.IEEE transactions on information theory, 28(2):129–137, 1982

1982

[9] [9]

Rq-gmm: Residual quantized gaussian mixture model for multimodal semantic discretization in ctr prediction.arXiv preprint arXiv:2602.12593, 2026

Ziye Tong, Jiahao Liu, Weimin Zhang, Hongji Ruan, Derick Tang, Zhanpeng Zeng, Qinsong Zeng, Peng Zhang, Tun Lu, and Ning Gu. Rq-gmm: Residual quantized gaussian mixture model for multimodal semantic discretization in ctr prediction.arXiv preprint arXiv:2602.12593, 2026

arXiv 2026

[10] [10]

Latent space oddity: on the curvature of deep generative models

Georgios Arvanitidis, Lars Kai Hansen, and Søren Hauberg. Latent space oddity: on the curvature of deep generative models. InInternational Conference on Learning Representations, 2018

2018

[11] [11]

Understanding contrastive representation learning through alignment and uniformity on the hypersphere

Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InInternational Conference on Machine Learning, pages 9929–9939. PMLR, 2020

2020

[12] [12]

Representation degeneration problem in training natural language generation models.arXiv preprint arXiv:1907.12009, 2019

Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. Representation degeneration problem in training natural language generation models.arXiv preprint arXiv:1907.12009, 2019

arXiv 1907

[13] [13]

How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings.arXiv preprint arXiv:1909.00512, 2019

Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings.arXiv preprint arXiv:1909.00512, 2019

arXiv 1909

[14] [14]

Deep learning and the information bottleneck principle.IEEE information theory workshop (ITW), pages 1–5, 2015

Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle.IEEE information theory workshop (ITW), pages 1–5, 2015

2015