Automatic Piano Transcription
Despite the fact that the piano is the most studied polyphonic music transcription instrument, it is not well represented in existing datasets. This is where a combination of deep learning techniques and handcrafted features can make a significant contribution to the quality of piano transcription. In this paper, we present two such systems that combine a deep convolutional neural network (CNN) with a set of handcrafted features. We show that the combination results in a 10% improvement in F1 score over either method used individually. The resulting model is tested on the Expanded MIDI Groove (E-GMD) dataset, which contains recordings of the piano made by the Yamaha Disklavier.
www.tartalover.com
AMT is an acronym for Automatic Music Transcription. The goal is to translate a musical recording into a score. This can be done using an algorithm that predicts the onset and frames of a note. This can be done with deep convolutional neural networks, recurrent neural networks or non-negative matrix factorization. The latter uses the principle of non-negative matrix transformation (NMF) to convert the input waveform into a time-frequency representation. This representation is then decomposed into a series of basis spectra. These spectra support frequency modulation and tuning changes.
The Onset and Frames (OaF) model is one of the most famous models in this field. It was developed to tackle the polyphonic transcription of drums and piano pieces. It is composed of two parts: an onset detector and a frames detector. It is able to detect onsets and frames and can be trained with input spectrograms. It also has the ability to predict polyphony in stems.
Deep Learning and Automatic Piano Transcription
In the music transcription domain, deep learning techniques are used to identify the onsets of notes. This is called the onset and frames model (OaF). Several methods have been proposed to accomplish this task. Some of them require preprocessing steps to convert the waveform into a time-frequency representation. Others use spectrograms or other signal-based representations.
Another technique is to use a graph representation. This is a more complex representation. It can be used to represent a range of types of audio recordings. It is also useful because it can handle a wide range of timbres. It is also useful for stream-level transcription and source separation tasks.
One of the most popular methods is to use a neural network (NN). This is a slew of models, all of which use spectrograms as input. The CREPE neural network is a good example of a deep neural network. However, it does not perform as well as the OaF model. NNs also have a problem when dealing with different timbres.
The Multiple Sequence Resolution Network (MSRN) is another example. This model uses multiple resolutions to achieve better results for longer tracks. It is particularly useful for genre classification, since it can use multiple resolutions for a given input. In addition to stream-level transcription, it has also been applied to speech emotion recognition and speaker identification. It uses a tree of resolutions, which is useful for both tasks.