arxiv: 2605.09120 · v1 · submitted 2026-05-09 · 💻 cs.IR · cs.SD

Recognition: 1 theorem link

· Lean Theorem

Reddit2Deezer: A Scalable Dataset for Real-World Grounded Conversational Music Recommendation

Haven Kim , Julian McAuley

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:26 UTC · model grok-4.3

classification 💻 cs.IR cs.SD

keywords conversational music recommendationdatasetRedditDeezergrounded dialoguesmusic metadatadialogue corpusparaphrased data

0 comments

The pith

A dataset of 190,000 real Reddit music conversations is linked to Deezer for scalable grounded recommendation research.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Conversational music recommendation research faces a tradeoff between small authentic dialogue sets and large but artificial ones. The paper builds Reddit2Deezer by pulling 190k thread and leaf-comment pairs from Reddit music discussions and linking each mentioned musical entity to a Deezer identifier. This supplies natural conversations together with audio previews, genre tags, popularity scores, and BPM values. The release includes both a raw version preserving original wording and a paraphrased version for long-term reproducibility. Human validation checks confirm the quality of the dialogues, the accuracy of item grounding, and the fidelity of paraphrases.

Core claim

We introduce Reddit2Deezer, a reality-grounded CMR resource derived from 190k unique {thread, leaf-comment} pairs. We release the resource in two versions: a raw version that preserves authenticity, and a paraphrased version that maximizes long-term reproducibility. Each musical entity is linked to a Deezer identifier, which provides straightforward access to audio previews and rich metadata (e.g., genre tags, popularity, BPM), opening the door to future research on content-grounded conversational recommendation. A human validation confirms the quality of the dialogues, item grounding, and paraphrases.

What carries the argument

Extraction of 190k {thread, leaf-comment} pairs from Reddit music discussions, each grounded by linkage to a Deezer music identifier.

If this is right

Training and evaluation of conversational music recommenders can now use naturally occurring dialogues at a scale previously unavailable.
Content features including audio previews, genres, and BPM become directly usable inside conversational recommendation pipelines.
Research on content-grounded conversational recommendation can draw on real user-generated discussions instead of constructed ones.
The paraphrased release supports reproducible experiments while aiming to keep core dialogue and grounding characteristics intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Reddit extraction and grounding approach could generate comparable resources for conversational recommendation in domains such as books or films.
Direct comparisons between models trained on this real data versus synthetic corpora could quantify the benefit of authenticity for downstream performance.
Access to audio previews opens the possibility of studying how acoustic properties interact with conversational context in recommendation decisions.

Load-bearing premise

That Reddit threads and leaf comments constitute authentic high-quality conversational music discussions, and that Deezer linkage plus paraphrasing preserves the necessary conversational and grounding properties.

What would settle it

A live user study in which models trained on Reddit2Deezer show no gain in recommendation accuracy or user satisfaction over models trained on existing synthetic conversational datasets.

Figures

Figures reproduced from arXiv: 2605.09120 by Haven Kim, Julian McAuley.

read the original abstract

Conversational music recommendation (CMR) research currently faces a tradeoff between authentic dialogue corpora that are limited in scale and synthesized corpora that scale up but whose conversations are artificially constructed rather than naturally observed. In this paper, we introduce Reddit2Deezer, a reality-grounded CMR resource derived from 190k unique {thread, leaf-comment} pairs. We release the resource in two versions: a raw version that preserves authenticity, and a paraphrased version that maximizes long-term reproducibility. Each musical entity is linked to a Deezer identifier, which provides straightforward access to audio previews and rich metadata (e.g., genre tags, popularity, BPM), opening the door to future research on content-grounded conversational recommendation. A human validation confirms the quality of the dialogues, item grounding, and paraphrases. The dataset is available at https://huggingface.co/datasets/McAuley-Lab/Reddit2Deezer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Reddit2Deezer delivers a large grounded dataset for CMR but the thread-plus-leaf-comment construction likely falls short on multi-turn dialogue structure.

read the letter

The main takeaway is that this paper releases Reddit2Deezer, a 190k-pair dataset built from music threads and leaf comments, with each item linked to a Deezer ID for metadata and audio access. They ship both the raw version and a paraphrased one to help with reproducibility while trying to keep some authenticity. That dual release and the grounding step are the practical contributions here, and they address a real gap where prior CMR data is either small and real or large and synthetic. The scale is useful for training content-aware models, and pointing to Deezer makes downstream work on previews or tags straightforward. The human validation claim is there, which is better than nothing for a dataset paper. The soft spot is the conversational structure. The pairs are defined as thread plus leaf comment, so intermediate replies in the tree get dropped. That setup often reduces to a post and a single response rather than accumulating preferences or clarifications across turns. The abstract calls them dialogues and says validation covers quality, but without details on what annotators actually checked for coherence versus topical fit, it is unclear how well this supports true context-tracking recommenders. If the full paper shows they filtered for multi-turn threads or measured that property, the concern shrinks; based on the given description it looks like a real limit. This is for people building grounded conversational recommenders who need real examples at scale rather than hand-crafted ones. The data release itself is clean enough to be worth referee time even if the structure needs caveats in the final version.

Referee Report

2 major / 2 minor

Summary. The paper introduces Reddit2Deezer, a large-scale dataset for conversational music recommendation (CMR) derived from 190k unique {thread, leaf-comment} pairs sourced from Reddit. Each musical entity is linked to a Deezer identifier for metadata and audio access. The resource is released in raw (authenticity-preserving) and paraphrased (reproducibility-focused) versions, with human validation confirming dialogue quality, item grounding, and paraphrase fidelity. The dataset is hosted on Hugging Face.

Significance. If the construction and validation confirm that the pairs constitute authentic, multi-turn conversational music discussions with reliable grounding, the dataset would meaningfully advance CMR research by providing a scalable, real-world alternative to limited authentic corpora or artificial syntheses. The Deezer linkage adds practical value for content-based and audio-aware modeling. The contribution is primarily as a data resource rather than a modeling advance.

major comments (2)

[Abstract and Dataset Construction] Abstract and Dataset Construction: The resource is described as consisting of 'dialogues' from {thread, leaf-comment} pairs, but leaf comments are terminal nodes in the comment tree. Pairing each only with the original post omits all intermediate parent comments. This risks reducing the data to post-plus-isolated-response pairs rather than full multi-turn histories, which would undermine utility for training or evaluating context-tracking CMR models that rely on accumulating preferences across turns. The human validation protocol should explicitly state whether annotators assessed multi-turn coherence or only topical relevance and grounding.
[Human Validation] Human Validation: The abstract states that 'a human validation confirms the quality of the dialogues, item grounding, and paraphrases,' but provides no details on annotator count, inter-annotator agreement metrics (e.g., Fleiss' kappa or percentage agreement), sample size, or the precise instructions given to annotators. These omissions make it difficult to assess the reliability and reproducibility of the quality claims, which are central to the dataset's value proposition.

minor comments (2)

[Introduction] Introduction: Adding a comparison table of Reddit2Deezer against prior CMR datasets (scale, authenticity, grounding method, multi-turn support) would better position the contribution.
[Dataset Release] Dataset Release: The Hugging Face link is provided, but supplementary statistics (e.g., distribution of thread lengths, unique users, music genres, or paraphrase edit distances) would help readers evaluate data characteristics without downloading the full resource.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address the two major comments point by point below. We agree with both observations and will revise the manuscript to improve clarity, accuracy, and completeness.

read point-by-point responses

Referee: [Abstract and Dataset Construction] The resource is described as consisting of 'dialogues' from {thread, leaf-comment} pairs, but leaf comments are terminal nodes in the comment tree. Pairing each only with the original post omits all intermediate parent comments. This risks reducing the data to post-plus-isolated-response pairs rather than full multi-turn histories, which would undermine utility for training or evaluating context-tracking CMR models that rely on accumulating preferences across turns. The human validation protocol should explicitly state whether annotators assessed multi-turn coherence or only topical relevance and grounding.

Authors: We appreciate the referee for identifying this structural limitation. The dataset is deliberately constructed as {original post, leaf comment} pairs, where the leaf comment is the terminal node in a Reddit comment thread. This choice enables scalable extraction of authentic, user-generated responses grounded in music entities while preserving the original text. However, it does not retain the full chain of intermediate comments, resulting in two-turn (post-response) pairs rather than complete multi-turn dialogue histories. We will revise the manuscript to (1) explicitly describe the data as post-response pairs, (2) remove or qualify the term 'dialogues' where it implies full multi-turn context, (3) discuss the implications and limitations for context-tracking CMR models, and (4) clarify that human annotators evaluated topical relevance, item grounding, and coherence between the post and leaf comment only (not multi-turn coherence across omitted intermediates). revision: yes
Referee: [Human Validation] The abstract states that 'a human validation confirms the quality of the dialogues, item grounding, and paraphrases,' but provides no details on annotator count, inter-annotator agreement metrics (e.g., Fleiss' kappa or percentage agreement), sample size, or the precise instructions given to annotators. These omissions make it difficult to assess the reliability and reproducibility of the quality claims, which are central to the dataset's value proposition.

Authors: We agree that the current manuscript provides insufficient detail on the human validation protocol. We will add a dedicated subsection (or expand the existing validation description) that reports the number of annotators, inter-annotator agreement metrics (e.g., percentage agreement and/or Cohen's kappa), the size of the annotated sample, and the exact annotation guidelines and questions presented to annotators. These additions will directly address reproducibility concerns and strengthen the evidential basis for the quality claims. revision: yes

Circularity Check

0 steps flagged

Dataset release paper exhibits no circular derivation

full rationale

The paper introduces Reddit2Deezer by extracting {thread, leaf-comment} pairs from Reddit, linking musical entities to Deezer IDs, offering raw and paraphrased versions, and reporting human validation of quality. No equations, fitted parameters, predictions, or derivations are present. The contribution is the data resource itself; construction steps are described procedurally without reducing to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The derivation chain is self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about Reddit data quality and Deezer utility; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Reddit threads and leaf comments form natural, high-quality conversational music discussions
Basis for claiming the resource is 'reality-grounded' and suitable for CMR.
domain assumption Linking musical mentions to Deezer identifiers provides useful content grounding via metadata and audio
Stated as enabling future content-grounded research.

pith-pipeline@v0.9.0 · 5456 in / 1294 out tokens · 33395 ms · 2026-05-12T03:26:13.659357+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Reddit2Deezer, a reality-grounded CMR resource derived from 190k unique {thread, leaf-comment} pairs... Each musical entity is linked to a Deezer identifier

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 4 internal anchors

[1]

Sebastiano Antenucci, Simone Boglio, Emanuele Chioso, Ervin Dervishaj, Shuwen Kang, Tommaso Scarlatti, and Maurizio Ferrari Dacrema. 2018. Artist- driven layering and user’s behaviour impact on recommendations in a playlist continuation scenario. InProceedings of the ACM recommender systems challenge

work page 2018
[2]

Thierry Bertin-Mahieux, Daniel PW Ellis, Brian Whitman, and Paul Lamere

work page
[3]

The million song dataset. (2011)

work page 2011
[4]

Sebastian Böck, Filip Korzeniowski, Jan Schlüter, Florian Krebs, and Gerhard Widmer. 2016. Madmom: A new python audio and music signal processing library. InProceedings of the 24th ACM international conference on Multimedia. 1174–1178

work page 2016
[5]

Arun Tejasvi Chaganty, Megan Leszczynski, Shu Zhang, Ravi Ganti, Krisztian Balog, and Filip Radlinski. 2023. Beyond single items: Exploring user preferences in item sets with the conversational playlist curation dataset. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2754–2764

work page 2023
[6]

Ching-Wei Chen, Paul Lamere, Markus Schedl, and Hamed Zamani. 2018. Recsys challenge 2018: Automatic music playlist continuation. InProceedings of the 12th ACM Conference on Recommender Systems. 527–528

work page 2018
[7]

Keunwoo Choi, Seungheon Doh, and Juhan Nam. 2025. Talkplaydata 2: An agen- tic synthetic data pipeline for multimodal conversational music recommendation. arXiv preprint arXiv:2509.09685(2025)

work page arXiv 2025
[8]

Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. 2016. Towards conversational recommender systems. InProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 815–824

work page 2016
[9]

W.G. Cochran. 1963.Sampling techniques. John Wiley & Sons. https://books. google.com/books?id=Y-SxXwAACAAJ

work page 1963
[10]

SeungHeon Doh, Keunwoo Choi, Daeyong Kwon, Taesu Kim, and Juhan Nam

work page
[11]

arXiv:2411.07439 [cs.SD] https://arxiv.org/abs/2411

Music Discovery Dialogue Generation Using Human Intent Analysis and Large Language Models. arXiv:2411.07439 [cs.SD] https://arxiv.org/abs/2411. 07439

work page arXiv
[12]

SeungHeon Doh, Keunwoo Choi, Jongpil Lee, and Juhan Nam. 2023. Lp- musiccaps: Llm-based pseudo music captioning.arXiv preprint arXiv:2307.16372 (2023)

work page arXiv 2023
[13]

Seungheon Doh, Keunwoo Choi, and Juhan Nam. 2025. TALKPLAY: Multimodal Music Recommendation with Large Language Models. arXiv:2502.13713 [cs.IR] https://arxiv.org/abs/2502.13713

work page arXiv 2025
[14]

Seungheon Doh, Keunwoo Choi, and Juhan Nam. 2025. Talkplay-tools: Con- versational music recommendation with llm tool calling.arXiv preprint arXiv:2510.01698(2025)

work page arXiv 2025
[15]

Alvan R Feinstein and Domenic V Cicchetti. 1990. High agreement but low kappa: I. The problems of two paradoxes.Journal of clinical epidemiology43, 6 (1990), 543–549

work page 1990
[16]

M Goker and Cynthia Thompson. 2000. The adaptive place advisor: A conversa- tional recommendation system. InProceedings of the 8th German workshop on case based reasoning. Citeseer, 187–198

work page 2000
[17]

Zhankui He, Zhouhang Xie, Rahul Jha, Harald Steck, Dawen Liang, Yesu Feng, Bodhisattwa Prasad Majumder, Nathan Kallus, and Julian Mcauley. 2023. Large Language Models as Zero-Shot Conversational Recommenders. InProceedings of the 32nd ACM International Conference on Information and Knowledge Manage- ment (CIKM ’23). ACM, 720–730. doi:10.1145/3583780.3614949

work page doi:10.1145/3583780.3614949 2023
[18]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs.CL] https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Hyunsik Jeon, Satoshi Koide, Yu Wang, Zhankui He, and Julian McAuley. 2025. LaViC: Adapting Large Vision-Language Models to Visually-Aware Conversa- tional Recommendation. arXiv:2503.23312 [cs.AI] https://arxiv.org/abs/2503. 23312

work page arXiv 2025
[20]

Clark Mingxuan Ju, Liam Collins, Leonardo Neves, Bhuvesh Kumar, Louis Yufeng Wang, Tong Zhao, and Neil Shah. 2025. Generative Recommendation with Seman- tic IDs: A Practitioner’s Handbook. InProceedings of the 34th ACM International Conference on Information and Knowledge Management(Seoul, Republic of Ko- rea)(CIKM ’25). Association for Computing Machiner...

work page doi:10.1145/3746252.3761612 2025
[21]

Haven Kim, Yupeng Hou, and Julian McAuley. 2026. FusID: Modality- Fused Semantic IDs for Generative Music Recommendation.arXiv preprint arXiv:2601.08764(2026)

work page arXiv 2026
[22]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. arXiv:1711.05101 [cs.LG] https://arxiv.org/abs/1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2019
[23]

Alessandro B Melchiorre, Elena V Epure, Shahed Masoudian, Gustavo Escobedo, Anna Hausberger, Manuel Moussallam, and Markus Schedl. 2025. Just ask for music (jam): Multimodal and personalized natural language music recommenda- tion. InProceedings of the Nineteenth ACM Conference on Recommender Systems. 615–620

work page 2025
[24]

Enrico Palumbo, Gustavo Penha, Andreas Damianou, José Luis Redondo García, Timothy Christopher Heath, Alice Wang, Hugues Bouchard, and Mounia Lal- mas. 2025. Text2Tracks: Prompt-based Music Recommendation via Generative Retrieval. arXiv:2503.24193 [cs.IR] https://arxiv.org/abs/2503.24193

work page arXiv 2025
[25]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning. PMLR, 28492–28518

work page 2023
[26]

Markus Schedl, Stefan Brandl, Oleg Lesota, Emilia Parada-Cabaleiro, David Penz, and Navid Rekabsaz. 2022. LFM-2b: A dataset of enriched music listening events for recommender systems research and fairness analysis. InProceedings of the 2022 Conference on Human Information Interaction and Retrieval. 337–341

work page 2022
[27]

Abhinav Kumar Singh, Harsha Vardhan Khurdula, Yoeven D Khemlani, and Vineet Agarwal. 2026. The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models. arXiv:2604.25359 [cs.CL] https://arxiv.org/abs/2604.25359

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Rohan Surana, Amit Namburi, Gagan Mundada, Abhay Lal, Zachary Novack, Julian McAuley, and Junda Wu. 2025. Musicrs: Benchmarking audio-centric conversational recommendation.arXiv preprint arXiv:2509.19469(2025)

work page arXiv 2025
[29]

Maksims Volkovs, Himanshu Rai, Zhaoyue Cheng, Ga Wu, Yichao Lu, and Scott Sanner. 2018. Two-stage model for automatic playlist continuation at scale. In Proceedings of the ACM Recommender Systems Challenge 2018. 1–6

work page 2018
[30]

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. 2023. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

work page 2023
[31]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Yongfeng Zhang, Xu Chen, Qingyao Ai, Liu Yang, and W Bruce Croft. 2018. To- wards conversational search and recommendation: System ask, user respond. In Proceedings of the 27th acm international conference on information and knowledge management. 177–186

work page 2018