view Report/chapter3/ch3.tex @ 28:a95e656907c3

Updated report
author Paulo Chiliguano <p.e.chiilguano@se14.qmul.ac.uk>
date Mon, 31 Aug 2015 02:43:54 +0100
parents ae650489d3a8
children b1c54790ed97
line wrap: on
line source
\chapter{Methodology}
\label{ch:methodology}
The methodology used to develop our hybrid music recommender consists of three main stages. First, the collection of real world user-item data corresponding to the play counts of specific songs and the fetching of audio clips of the unique identified songs in the dataset. Secondly, the implementation of the deep learning algorithm to represent the audio clips in terms of music genre probabilities as n-dimensional vectors. Finally, we investigate estimation of distribution algorithms to model user profiles based on the rated songs above a threshold. Every stage of our hybrid recommender is entirely done in Python 2.7\footnote{https://www.python.org/download/releases/2.7/}.

\section{Data collection}
The Million Song Dataset \parencite{Bertin-Mahieux2011} is a collection of audio features and metadata for a million contemporary popular music tracks which provides ground truth for evaluation research in MIR. This collection is also complemented by the Taste Profie subset which provides 48,373,586 triplets, each of them consist of anonymised user ID, Echo Nest song ID and play count. We choose this dataset because it is publicly available data and it contains enough data for user modelling and recommender evaluation.

\subsection{Taste Profile subset cleaning}
Due to potential mismatches\footnote{http://labrosa.ee.columbia.edu/millionsong/blog/12-2-12-fixing-matching-errors} between song ID and  track ID on the Echo Nest database, it is required to filter out the wrong matches in the Taste Profile subset. The cleaning process is illustrated in Figure~\ref{fig:taste_profile} 
\begin{figure}[ht!]
	\centering
	\includegraphics[width=0.8\textwidth]{chapter3/taste_profile.png}
	\caption{Cleaning of the taste profile subset}
	\label{fig:taste_profile}
\end{figure}
%Please see figure ~\ref{fig:JobInformationDialog} 

A script is implemented to discard the triplets that contain the song identifiers from the mismatches text file. First, we load the file to read each line of it to obtain song identifier. The identifiers are stored as elements of a set object to construct a collection of unique elements. Next, due to the size of the Taste Profile subset (about 3 GB, uncompressed), we load the dataset by chunks of 20,000 triplets in a \textit{pandas}\footnote{http://pandas.pydata.org/} dataframe to clean each chunk by discarding the triplets that contains the song identifiers in the set object of the previous step. The cleaning process takes around 2.47 minutes and we obtain 45,795,100 triplets. 

In addition to the cleaning process, we reduce significantly the size of the dataset for experimental purposes. We only consider users with more than 1,000 played songs and select the identifiers of 1,500 most played songs. This additional process takes around 3.23 minutes and we obtain 65,327 triplets.

%count resulting number of triplets
%At this stage, similarities between users is calculated to form a neighbourhood and predict user rating based on combination of the ratings of selected users in the neighbourhood.

\subsection{Fetching audio data}
First, for each element of the list of 1,500 songs identifiers obtained in the previous step is used to retrieve the correspondent Echo Nest track ID through a script using the \emph{get\_tracks} method from the \textit{Pyechonest}\footnote{http://echonest.github.io/pyechonest/} package which allow us to acquire track ID and preview URL for each song ID through Echo Nest API. The reason behind this is 7digital API uses Echo Nest track ID instead of song ID to retrieve any data from its catalogue. If the track information of a song is not available, we skip to retrieve the Echo Nest information of the next song ID.

Moreover, for each preview URL obtained in the previous step, we can fetch an audio clip of 30 to 60 seconds of duration through a OAuth request to 7digital API. For this particular API requests, we use the GET method of the request class from the \textit{python-oauth2}\footnote{https://github.com/jasonrubenstein/python\_oauth2} package, because every request require a nonce, timestamp and a signature method, and also, the country parameter, e.g., 'GB' to access to UK catalogue. Before running the script, it is useful to check if the provided 7digital API keys and the country parameter are enabled in the \textit{OAuth 1.0 Signature Reference Implementation}\footnote{http://7digital.github.io/oauth-reference-page/} for 7digital.

Additionally, the script accumulates the Echo Nest song identifier, track ID, artist name, song title and the 7digital preview audio URL for each downloaded track in a text file. If a preview audio clip is not available, the script skip to the next song ID. The generated text file is used to reduce more the triplets dataset from previous section. 

%include number of tracks available

%Classifier creates a model for each user based on the acoustic features of the tracks that user has liked.
\subsection{Intermediate time-frequency representation for audio signals}
For representing audio waveforms of the song collection obtained through 7digital API, a similar procedure suggested by \textcite{NIPS2013_5004} is followed:
\begin{itemize}
	\item Read 3 seconds of each song at a sampling rate of 22050 Hz and mono channel.
	\item Compute log-mel spectrograms with 128 components from windows of 1024 frames and a hop size of 512 samples.
\end{itemize}

The Python script for feature extraction implemented by \textcite{Sigtia20146959} is modified to return the log-mel spectrograms by using the LibROSA\footnote{https://bmcfee.github.io/librosa/index.html} package.

``Representations of music directly from the temporal or spectral domain can be very sensitive to small time and frequency deformations''. \parencite{zhang2014deep}
\section{Data preprocessing}
\begin{itemize}
	\item Rating from complementary cumulative distribution.
	\item Flatenning spectrogram.
\end{itemize}
\section{Algorithms}
\subsection{Music genre classifier}
\label{subsec:genre}
The input of the CNN consist of the 128-component spectrograms obtained in feature extraction. The batch size considered is 20 frames.
Each convolutional layer consists of 10 kernels and ReLUs activation units. In the first convolutional layer the pooling size is 4 and in the second layer the pooling size is 2. The filters analyses the frames along the frequency axis to consider every Mel components with a hop size of 4 frames in the time axis. Additionally, there is a hidden multi perceptron layer with 513 units.
%Deep belief network is a probabilistic model that has one observed layer and several hidden layers.
\subsubsection{CNN network architecture}
The classification of genre for each frame is returned by negative log likelihood estimation of a logistic stochastic gradient descent (SGD) layer.

In our testing, we obtained a 38.8 \% of classification error after 9 trials using the GTZAN dataset. More details of classification results are shown on Table \ref{table:genre}.


\subsection{User profile modelling}
\label{subsec:profile}
\subsubsection{Permutation EDA}
\subsubsection{Continuous Univariate Marginal Distribution Algorithm}

\subsection{Song recommendation}