Piano Transcription by Hierarchical Language Modeling with Pretrained Roll-based Encoders

📅 2025-01-06

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

To address key limitations in automatic music transcription (AMT)—namely, frame-level systems’ reliance on hand-crafted thresholds, the difficulty of autoregressive language models (LMs) in modeling long sequences, and their high computational cost—this paper proposes an end-to-end hierarchical language modeling paradigm. It couples a pretrained roll encoder (e.g., TRF or CNN) with an autoregressive LM decoder, decomposing note prediction into three conditional generation stages: onset/pitch → velocity → offset. Crucially, this is the first work to jointly optimize the roll encoder and LM decoder, eliminating post-hoc thresholding. Evaluated on two standard benchmarks, our method achieves absolute improvements of +0.01 and +0.022 in onset-offset-velocity F1 score over prior roll-based approaches, demonstrating substantial gains. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Automatic Music Transcription (AMT), aiming to get musical notes from raw audio, typically uses frame-level systems with piano-roll outputs or language model (LM)-based systems with note-level predictions. However, frame-level systems require manual thresholding, while the LM-based systems struggle with long sequences. In this paper, we propose a hybrid method combining pre-trained roll-based encoders with an LM decoder to leverage the strengths of both methods. Besides, our approach employs a hierarchical prediction strategy, first predicting onset and pitch, then velocity, and finally offset. The hierarchical prediction strategy reduces computational costs by breaking down long sequences into different hierarchies. Evaluated on two benchmark roll-based encoders, our method outperforms traditional piano-roll outputs 0.01 and 0.022 in onset-offset-velocity F1 score, demonstrating its potential as a performance-enhancing plug-in for arbitrary roll-based music transcription encoder. We release the code of this work at https://github.com/yongyizang/AMT_train.

Problem

Research questions and friction points this paper is trying to address.

Automatic Music Transcription

Accuracy Improvement

Resource Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic Music Transcription

Hierarchical Language Model

Pre-trained Systems

🔎 Similar Papers

Unifying Multitrack Music Arrangement via Reconstruction Fine-Tuning and Efficient Tokenization