pith. machine review for the scientific record. sign in

arxiv: 1612.01840 · v3 · submitted 2016-12-06 · 💻 cs.SD · cs.IR

Recognition: unknown

FMA: A Dataset For Music Analysis

Authors on Pith no claims yet
classification 💻 cs.SD cs.IR
keywords audiodatasetmusiclargesomesuitabletasksaccessible
0
0 comments X
read the original abstract

We introduce the Free Music Archive (FMA), an open and easily accessible dataset suitable for evaluating several tasks in MIR, a field concerned with browsing, searching, and organizing large music collections. The community's growing interest in feature and end-to-end learning is however restrained by the limited availability of large audio datasets. The FMA aims to overcome this hurdle by providing 917 GiB and 343 days of Creative Commons-licensed audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. It provides full-length and high-quality audio, pre-computed features, together with track- and user-level metadata, tags, and free-form text such as biographies. We here describe the dataset and how it was created, propose a train/validation/test split and three subsets, discuss some suitable MIR tasks, and evaluate some baselines for genre recognition. Code, data, and usage examples are available at https://github.com/mdeff/fma

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

    cs.SD 2025-07 unverdicted novelty 7.0

    Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.

  2. Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation

    eess.AS 2026-05 unverdicted novelty 6.0

    L3-SE reduces linguistic hallucination in LM-based speech enhancement by distilling noise-invariant acoustic-semantic representations from noisy inputs to condition an autoregressive decoder-only language model.

  3. Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation

    cs.SD 2026-05 unverdicted novelty 5.0

    A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with c...

  4. UniPASE: A Generative Model for Universal Speech Enhancement with High Fidelity and Low Hallucinations

    eess.AS 2026-04 unverdicted novelty 5.0

    UniPASE extends the PASE framework with DeWavLM-Omni to convert degraded speech into high-fidelity, low-hallucination audio across sampling rates via phonetic enhancement, acoustic adaptation, and multi-rate vocoding.

  5. Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

    eess.AS 2026-04 unverdicted novelty 5.0

    Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.

  6. Multimodal Dataset Normalization and Perceptual Validation for Music-Taste Correspondences

    cs.SD 2026-04 unverdicted novelty 5.0

    Music-flavor correspondences transfer from small human-annotated collections to large synthetic FMA datasets, with computational targets showing significant alignment to human listener ratings.

  7. Cross-Cultural Bias in Mel-Scale Representations: Evidence and Alternatives from Speech and Music

    cs.SD 2026-04 unverdicted novelty 5.0

    Mel-scale features exhibit measurable cultural bias with 12.5% higher WER on tonal languages and 15.7% F1 drop on non-Western music, while adaptive alternatives reduce these gaps substantially.

  8. HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation

    cs.SD 2026-04 unverdicted novelty 5.0

    HAFM uses a hierarchical autoregressive model with dual-rate HuBERT and EnCodec tokens to generate coherent instrumental music from vocals, achieving FAD 2.08 on MUSDB18 while matching prior systems with fewer parameters.

  9. A Knowledge-Driven Approach to Target Speech Extraction in the Presence of Background Sound Effects for Cinematic Audio Source Separation (CASS)

    eess.AS 2026-04 unverdicted novelty 4.0

    Detecting manners of articulation and adding them as knowledge features improves target speech extraction in cinematic audio with background sounds.