pith. machine review for the scientific record. sign in

arxiv: 2605.09120 · v1 · submitted 2026-05-09 · 💻 cs.IR · cs.SD

Recognition: 1 theorem link

· Lean Theorem

Reddit2Deezer: A Scalable Dataset for Real-World Grounded Conversational Music Recommendation

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:26 UTC · model grok-4.3

classification 💻 cs.IR cs.SD
keywords conversational music recommendationdatasetRedditDeezergrounded dialoguesmusic metadatadialogue corpusparaphrased data
0
0 comments X

The pith

A dataset of 190,000 real Reddit music conversations is linked to Deezer for scalable grounded recommendation research.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Conversational music recommendation research faces a tradeoff between small authentic dialogue sets and large but artificial ones. The paper builds Reddit2Deezer by pulling 190k thread and leaf-comment pairs from Reddit music discussions and linking each mentioned musical entity to a Deezer identifier. This supplies natural conversations together with audio previews, genre tags, popularity scores, and BPM values. The release includes both a raw version preserving original wording and a paraphrased version for long-term reproducibility. Human validation checks confirm the quality of the dialogues, the accuracy of item grounding, and the fidelity of paraphrases.

Core claim

We introduce Reddit2Deezer, a reality-grounded CMR resource derived from 190k unique {thread, leaf-comment} pairs. We release the resource in two versions: a raw version that preserves authenticity, and a paraphrased version that maximizes long-term reproducibility. Each musical entity is linked to a Deezer identifier, which provides straightforward access to audio previews and rich metadata (e.g., genre tags, popularity, BPM), opening the door to future research on content-grounded conversational recommendation. A human validation confirms the quality of the dialogues, item grounding, and paraphrases.

What carries the argument

Extraction of 190k {thread, leaf-comment} pairs from Reddit music discussions, each grounded by linkage to a Deezer music identifier.

If this is right

  • Training and evaluation of conversational music recommenders can now use naturally occurring dialogues at a scale previously unavailable.
  • Content features including audio previews, genres, and BPM become directly usable inside conversational recommendation pipelines.
  • Research on content-grounded conversational recommendation can draw on real user-generated discussions instead of constructed ones.
  • The paraphrased release supports reproducible experiments while aiming to keep core dialogue and grounding characteristics intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same Reddit extraction and grounding approach could generate comparable resources for conversational recommendation in domains such as books or films.
  • Direct comparisons between models trained on this real data versus synthetic corpora could quantify the benefit of authenticity for downstream performance.
  • Access to audio previews opens the possibility of studying how acoustic properties interact with conversational context in recommendation decisions.

Load-bearing premise

That Reddit threads and leaf comments constitute authentic high-quality conversational music discussions, and that Deezer linkage plus paraphrasing preserves the necessary conversational and grounding properties.

What would settle it

A live user study in which models trained on Reddit2Deezer show no gain in recommendation accuracy or user satisfaction over models trained on existing synthetic conversational datasets.

Figures

Figures reproduced from arXiv: 2605.09120 by Haven Kim, Julian McAuley.

Figure 1
Figure 1. Figure 1: nDCG@5 by recommender turn position. Per-turn [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

Conversational music recommendation (CMR) research currently faces a tradeoff between authentic dialogue corpora that are limited in scale and synthesized corpora that scale up but whose conversations are artificially constructed rather than naturally observed. In this paper, we introduce Reddit2Deezer, a reality-grounded CMR resource derived from 190k unique {thread, leaf-comment} pairs. We release the resource in two versions: a raw version that preserves authenticity, and a paraphrased version that maximizes long-term reproducibility. Each musical entity is linked to a Deezer identifier, which provides straightforward access to audio previews and rich metadata (e.g., genre tags, popularity, BPM), opening the door to future research on content-grounded conversational recommendation. A human validation confirms the quality of the dialogues, item grounding, and paraphrases. The dataset is available at https://huggingface.co/datasets/McAuley-Lab/Reddit2Deezer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Reddit2Deezer, a large-scale dataset for conversational music recommendation (CMR) derived from 190k unique {thread, leaf-comment} pairs sourced from Reddit. Each musical entity is linked to a Deezer identifier for metadata and audio access. The resource is released in raw (authenticity-preserving) and paraphrased (reproducibility-focused) versions, with human validation confirming dialogue quality, item grounding, and paraphrase fidelity. The dataset is hosted on Hugging Face.

Significance. If the construction and validation confirm that the pairs constitute authentic, multi-turn conversational music discussions with reliable grounding, the dataset would meaningfully advance CMR research by providing a scalable, real-world alternative to limited authentic corpora or artificial syntheses. The Deezer linkage adds practical value for content-based and audio-aware modeling. The contribution is primarily as a data resource rather than a modeling advance.

major comments (2)
  1. [Abstract and Dataset Construction] Abstract and Dataset Construction: The resource is described as consisting of 'dialogues' from {thread, leaf-comment} pairs, but leaf comments are terminal nodes in the comment tree. Pairing each only with the original post omits all intermediate parent comments. This risks reducing the data to post-plus-isolated-response pairs rather than full multi-turn histories, which would undermine utility for training or evaluating context-tracking CMR models that rely on accumulating preferences across turns. The human validation protocol should explicitly state whether annotators assessed multi-turn coherence or only topical relevance and grounding.
  2. [Human Validation] Human Validation: The abstract states that 'a human validation confirms the quality of the dialogues, item grounding, and paraphrases,' but provides no details on annotator count, inter-annotator agreement metrics (e.g., Fleiss' kappa or percentage agreement), sample size, or the precise instructions given to annotators. These omissions make it difficult to assess the reliability and reproducibility of the quality claims, which are central to the dataset's value proposition.
minor comments (2)
  1. [Introduction] Introduction: Adding a comparison table of Reddit2Deezer against prior CMR datasets (scale, authenticity, grounding method, multi-turn support) would better position the contribution.
  2. [Dataset Release] Dataset Release: The Hugging Face link is provided, but supplementary statistics (e.g., distribution of thread lengths, unique users, music genres, or paraphrase edit distances) would help readers evaluate data characteristics without downloading the full resource.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address the two major comments point by point below. We agree with both observations and will revise the manuscript to improve clarity, accuracy, and completeness.

read point-by-point responses
  1. Referee: [Abstract and Dataset Construction] The resource is described as consisting of 'dialogues' from {thread, leaf-comment} pairs, but leaf comments are terminal nodes in the comment tree. Pairing each only with the original post omits all intermediate parent comments. This risks reducing the data to post-plus-isolated-response pairs rather than full multi-turn histories, which would undermine utility for training or evaluating context-tracking CMR models that rely on accumulating preferences across turns. The human validation protocol should explicitly state whether annotators assessed multi-turn coherence or only topical relevance and grounding.

    Authors: We appreciate the referee for identifying this structural limitation. The dataset is deliberately constructed as {original post, leaf comment} pairs, where the leaf comment is the terminal node in a Reddit comment thread. This choice enables scalable extraction of authentic, user-generated responses grounded in music entities while preserving the original text. However, it does not retain the full chain of intermediate comments, resulting in two-turn (post-response) pairs rather than complete multi-turn dialogue histories. We will revise the manuscript to (1) explicitly describe the data as post-response pairs, (2) remove or qualify the term 'dialogues' where it implies full multi-turn context, (3) discuss the implications and limitations for context-tracking CMR models, and (4) clarify that human annotators evaluated topical relevance, item grounding, and coherence between the post and leaf comment only (not multi-turn coherence across omitted intermediates). revision: yes

  2. Referee: [Human Validation] The abstract states that 'a human validation confirms the quality of the dialogues, item grounding, and paraphrases,' but provides no details on annotator count, inter-annotator agreement metrics (e.g., Fleiss' kappa or percentage agreement), sample size, or the precise instructions given to annotators. These omissions make it difficult to assess the reliability and reproducibility of the quality claims, which are central to the dataset's value proposition.

    Authors: We agree that the current manuscript provides insufficient detail on the human validation protocol. We will add a dedicated subsection (or expand the existing validation description) that reports the number of annotators, inter-annotator agreement metrics (e.g., percentage agreement and/or Cohen's kappa), the size of the annotated sample, and the exact annotation guidelines and questions presented to annotators. These additions will directly address reproducibility concerns and strengthen the evidential basis for the quality claims. revision: yes

Circularity Check

0 steps flagged

Dataset release paper exhibits no circular derivation

full rationale

The paper introduces Reddit2Deezer by extracting {thread, leaf-comment} pairs from Reddit, linking musical entities to Deezer IDs, offering raw and paraphrased versions, and reporting human validation of quality. No equations, fitted parameters, predictions, or derivations are present. The contribution is the data resource itself; construction steps are described procedurally without reducing to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The derivation chain is self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about Reddit data quality and Deezer utility; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Reddit threads and leaf comments form natural, high-quality conversational music discussions
    Basis for claiming the resource is 'reality-grounded' and suitable for CMR.
  • domain assumption Linking musical mentions to Deezer identifiers provides useful content grounding via metadata and audio
    Stated as enabling future content-grounded research.

pith-pipeline@v0.9.0 · 5456 in / 1294 out tokens · 33395 ms · 2026-05-12T03:26:13.659357+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 4 internal anchors

  1. [1]

    Sebastiano Antenucci, Simone Boglio, Emanuele Chioso, Ervin Dervishaj, Shuwen Kang, Tommaso Scarlatti, and Maurizio Ferrari Dacrema. 2018. Artist- driven layering and user’s behaviour impact on recommendations in a playlist continuation scenario. InProceedings of the ACM recommender systems challenge

  2. [2]

    Thierry Bertin-Mahieux, Daniel PW Ellis, Brian Whitman, and Paul Lamere

  3. [3]

    The million song dataset. (2011)

  4. [4]

    Sebastian Böck, Filip Korzeniowski, Jan Schlüter, Florian Krebs, and Gerhard Widmer. 2016. Madmom: A new python audio and music signal processing library. InProceedings of the 24th ACM international conference on Multimedia. 1174–1178

  5. [5]

    Arun Tejasvi Chaganty, Megan Leszczynski, Shu Zhang, Ravi Ganti, Krisztian Balog, and Filip Radlinski. 2023. Beyond single items: Exploring user preferences in item sets with the conversational playlist curation dataset. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2754–2764

  6. [6]

    Ching-Wei Chen, Paul Lamere, Markus Schedl, and Hamed Zamani. 2018. Recsys challenge 2018: Automatic music playlist continuation. InProceedings of the 12th ACM Conference on Recommender Systems. 527–528

  7. [7]

    Keunwoo Choi, Seungheon Doh, and Juhan Nam. 2025. Talkplaydata 2: An agen- tic synthetic data pipeline for multimodal conversational music recommendation. arXiv preprint arXiv:2509.09685(2025)

  8. [8]

    Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. 2016. Towards conversational recommender systems. InProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 815–824

  9. [9]

    W.G. Cochran. 1963.Sampling techniques. John Wiley & Sons. https://books. google.com/books?id=Y-SxXwAACAAJ

  10. [10]

    SeungHeon Doh, Keunwoo Choi, Daeyong Kwon, Taesu Kim, and Juhan Nam

  11. [11]

    arXiv:2411.07439 [cs.SD] https://arxiv.org/abs/2411

    Music Discovery Dialogue Generation Using Human Intent Analysis and Large Language Models. arXiv:2411.07439 [cs.SD] https://arxiv.org/abs/2411. 07439

  12. [12]

    SeungHeon Doh, Keunwoo Choi, Jongpil Lee, and Juhan Nam. 2023. Lp- musiccaps: Llm-based pseudo music captioning.arXiv preprint arXiv:2307.16372 (2023)

  13. [13]

    Seungheon Doh, Keunwoo Choi, and Juhan Nam. 2025. TALKPLAY: Multimodal Music Recommendation with Large Language Models. arXiv:2502.13713 [cs.IR] https://arxiv.org/abs/2502.13713

  14. [14]

    Seungheon Doh, Keunwoo Choi, and Juhan Nam. 2025. Talkplay-tools: Con- versational music recommendation with llm tool calling.arXiv preprint arXiv:2510.01698(2025)

  15. [15]

    Alvan R Feinstein and Domenic V Cicchetti. 1990. High agreement but low kappa: I. The problems of two paradoxes.Journal of clinical epidemiology43, 6 (1990), 543–549

  16. [16]

    M Goker and Cynthia Thompson. 2000. The adaptive place advisor: A conversa- tional recommendation system. InProceedings of the 8th German workshop on case based reasoning. Citeseer, 187–198

  17. [17]

    Zhankui He, Zhouhang Xie, Rahul Jha, Harald Steck, Dawen Liang, Yesu Feng, Bodhisattwa Prasad Majumder, Nathan Kallus, and Julian Mcauley. 2023. Large Language Models as Zero-Shot Conversational Recommenders. InProceedings of the 32nd ACM International Conference on Information and Knowledge Manage- ment (CIKM ’23). ACM, 720–730. doi:10.1145/3583780.3614949

  18. [18]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs.CL] https://arxiv.org/abs/2106.09685

  19. [19]

    Hyunsik Jeon, Satoshi Koide, Yu Wang, Zhankui He, and Julian McAuley. 2025. LaViC: Adapting Large Vision-Language Models to Visually-Aware Conversa- tional Recommendation. arXiv:2503.23312 [cs.AI] https://arxiv.org/abs/2503. 23312

  20. [20]

    Clark Mingxuan Ju, Liam Collins, Leonardo Neves, Bhuvesh Kumar, Louis Yufeng Wang, Tong Zhao, and Neil Shah. 2025. Generative Recommendation with Seman- tic IDs: A Practitioner’s Handbook. InProceedings of the 34th ACM International Conference on Information and Knowledge Management(Seoul, Republic of Ko- rea)(CIKM ’25). Association for Computing Machiner...

  21. [21]

    Haven Kim, Yupeng Hou, and Julian McAuley. 2026. FusID: Modality- Fused Semantic IDs for Generative Music Recommendation.arXiv preprint arXiv:2601.08764(2026)

  22. [22]

    Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. arXiv:1711.05101 [cs.LG] https://arxiv.org/abs/1711.05101

  23. [23]

    Alessandro B Melchiorre, Elena V Epure, Shahed Masoudian, Gustavo Escobedo, Anna Hausberger, Manuel Moussallam, and Markus Schedl. 2025. Just ask for music (jam): Multimodal and personalized natural language music recommenda- tion. InProceedings of the Nineteenth ACM Conference on Recommender Systems. 615–620

  24. [24]

    Enrico Palumbo, Gustavo Penha, Andreas Damianou, José Luis Redondo García, Timothy Christopher Heath, Alice Wang, Hugues Bouchard, and Mounia Lal- mas. 2025. Text2Tracks: Prompt-based Music Recommendation via Generative Retrieval. arXiv:2503.24193 [cs.IR] https://arxiv.org/abs/2503.24193

  25. [25]

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning. PMLR, 28492–28518

  26. [26]

    Markus Schedl, Stefan Brandl, Oleg Lesota, Emilia Parada-Cabaleiro, David Penz, and Navid Rekabsaz. 2022. LFM-2b: A dataset of enriched music listening events for recommender systems research and fairness analysis. InProceedings of the 2022 Conference on Human Information Interaction and Retrieval. 337–341

  27. [27]

    Abhinav Kumar Singh, Harsha Vardhan Khurdula, Yoeven D Khemlani, and Vineet Agarwal. 2026. The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models. arXiv:2604.25359 [cs.CL] https://arxiv.org/abs/2604.25359

  28. [28]

    Rohan Surana, Amit Namburi, Gagan Mundada, Abhay Lal, Zachary Novack, Julian McAuley, and Junda Wu. 2025. Musicrs: Benchmarking audio-centric conversational recommendation.arXiv preprint arXiv:2509.19469(2025)

  29. [29]

    Maksims Volkovs, Himanshu Rai, Zhaoyue Cheng, Ga Wu, Yichao Lu, and Scott Sanner. 2018. Two-stage model for automatic playlist continuation at scale. In Proceedings of the ACM Recommender Systems Challenge 2018. 1–6

  30. [30]

    Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. 2023. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

  31. [31]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  32. [32]

    Yongfeng Zhang, Xu Chen, Qingyao Ai, Liu Yang, and W Bruce Croft. 2018. To- wards conversational search and recommendation: System ask, user respond. In Proceedings of the 27th acm international conference on information and knowledge management. 177–186