Mercurial > hg > mirex2013
changeset 43:52d237639e16 abstract
rearranged segmentino's subsection
author | luisf <luis.figueira@eecs.qmul.ac.uk> |
---|---|
date | Fri, 06 Sep 2013 19:09:00 +0100 |
parents | c4c2b7f297a4 |
children | 6a075bfd3e7d |
files | vamp-plugins_abstract/qmvamp-mirex2013.tex |
diffstat | 1 files changed, 25 insertions(+), 24 deletions(-) [+] |
line wrap: on
line diff
--- a/vamp-plugins_abstract/qmvamp-mirex2013.tex Fri Sep 06 19:03:30 2013 +0100 +++ b/vamp-plugins_abstract/qmvamp-mirex2013.tex Fri Sep 06 19:09:00 2013 +0100 @@ -101,7 +101,31 @@ chord similarities. A standard HMM/Viterbi approach is used to smooth these to provide a chord transcription. -\section{Structural Segmentation} +\section{Audio Onset Detection} + +The Note Onset Detector Vamp plugin analyses a single channel of audio and estimates the onset times of notes within the music -- that is, the times at which notes and other audible events begin. + +It calculates an onset likelihood function for each spectral frame, and picks peaks in a smoothed version of this function. The plugin is non-causal, returning all results at the end of processing. + +Please read refer to the following publication for the basic detection methods~\cite{chris2003a}. The Adaptative Whitening technique is described in~\cite{dan2007a}. The Percussion Onset detector is described in~\cite{dan2005a}. + +\section{Audio Structural Segmentation} + +\subsection{QM Segmenter Plugin} + +The Segmenter Vamp plugin divides a single channel of music up into structurally consistent segments. It returns a numeric value (the segment type) for each moment at which a new segment starts. + +For music with clearly tonally distinguishable sections such as verse, chorus, etc., segments with the same type may be expected to be similar to one another in some structural sense. For example, repetitions of the chorus are likely to share a segment type. + +The method, described in~\cite{mark2008a}, relies upon structural/timbral similarity to obtain the high-level song structure. This is based on the assumption that the distributions of timbre features are similar over corresponding structural elements of the music. + +The algorithm works by obtaining a frequency-domain representation of the audio signal using a Constant-Q transform, a Chromagram or Mel-Frequency Cepstral Coefficients (MFCC) as underlying features (the particular feature is selectable as a parameter). The extracted features are normalised in accordance with the MPEG-7 standard (NASE descriptor), which means the spectrum is converted to decibel scale and each spectral vector is normalised by the RMS energy envelope. The value of this envelope is stored for each processing block of audio. This is followed by the extraction of 20 principal components per block using PCA, yielding a sequence of 21 dimensional feature vectors where the last element in each vector corresponds to the energy envelope. + +A 40-state Hidden Markov Model is then trained on the whole sequence of features, with each state of the HMM corresponding to a specific timbre type. This process partitions the timbre-space of a given track into 40 possible types. The important assumption of the model is that the distribution of these features remain consistent over a structural segment. After training and decoding the HMM, the song is assigned a sequence of timbre-features according to specific timbre-type distributions for each possible structural segment. + +The segmentation itself is computed by clustering timbre-type histograms. A series of histograms are created over a sliding window which are grouped into M clusters by an adapted soft k-means algorithm. Each of these clusters will correspond to a specific segment-type of the analyzed song. Reference histograms, iteratively updated during clustering, describe the timbre distribution for each segment. The segmentation arises from the final cluster assignments. + +\subsection{Segmentino} A beat-quantised chroma representation is used to calculate pair-wise similarities between beats (really: beat ``shingles'', i.e. multi-beat @@ -122,29 +146,6 @@ certain length; corresponding segments have the same length in beats. - -\section{Audio Onset Detection} - -The Note Onset Detector Vamp plugin analyses a single channel of audio and estimates the onset times of notes within the music -- that is, the times at which notes and other audible events begin. - -It calculates an onset likelihood function for each spectral frame, and picks peaks in a smoothed version of this function. The plugin is non-causal, returning all results at the end of processing. - -Please read refer to the following publication for the basic detection methods~\cite{chris2003a}. The Adaptative Whitening technique is described in~\cite{dan2007a}. The Percussion Onset detector is described in~\cite{dan2005a}. - -\section{Audio Structural Segmentation} - -The Segmenter Vamp plugin divides a single channel of music up into structurally consistent segments. It returns a numeric value (the segment type) for each moment at which a new segment starts. - -For music with clearly tonally distinguishable sections such as verse, chorus, etc., segments with the same type may be expected to be similar to one another in some structural sense. For example, repetitions of the chorus are likely to share a segment type. - -The method, described in~\cite{mark2008a}, relies upon structural/timbral similarity to obtain the high-level song structure. This is based on the assumption that the distributions of timbre features are similar over corresponding structural elements of the music. - -The algorithm works by obtaining a frequency-domain representation of the audio signal using a Constant-Q transform, a Chromagram or Mel-Frequency Cepstral Coefficients (MFCC) as underlying features (the particular feature is selectable as a parameter). The extracted features are normalised in accordance with the MPEG-7 standard (NASE descriptor), which means the spectrum is converted to decibel scale and each spectral vector is normalised by the RMS energy envelope. The value of this envelope is stored for each processing block of audio. This is followed by the extraction of 20 principal components per block using PCA, yielding a sequence of 21 dimensional feature vectors where the last element in each vector corresponds to the energy envelope. - -A 40-state Hidden Markov Model is then trained on the whole sequence of features, with each state of the HMM corresponding to a specific timbre type. This process partitions the timbre-space of a given track into 40 possible types. The important assumption of the model is that the distribution of these features remain consistent over a structural segment. After training and decoding the HMM, the song is assigned a sequence of timbre-features according to specific timbre-type distributions for each possible structural segment. - -The segmentation itself is computed by clustering timbre-type histograms. A series of histograms are created over a sliding window which are grouped into M clusters by an adapted soft k-means algorithm. Each of these clusters will correspond to a specific segment-type of the analyzed song. Reference histograms, iteratively updated during clustering, describe the timbre distribution for each segment. The segmentation arises from the final cluster assignments. - \bibliography{qmvamp-mirex2013} \end{document}