Low Resource audio-to-lyrics alignment from polyphonic music recordings¶
The state-of-the-art in lyrics alignment retrieves the word timings in a single pass, which can exhaust available memory, due to the length of the subject audio signals. To circumvent this issue, we present a system which first searches for words and their positions on the audio signal via decoding with a biased language model, then segments the music recordings with respect to the recognized portions of input lyrics. Evaluation of our system on a benchmark public dataset shows that it performs competitively with the state-of-the-art in lyrics alignment, while requiring much less data and computational resources compared to existing approaches. In addition, we have reported the best scores in lyrics transcription using vocal extraction. Our experiments highlight the importance of the output of source separation for the transcription and alignment tasks, and we demonstrate that our system can be leveraged for generating new training data. We publicly share our code for open science.