view Report/chapter3/ch3.tex @ 23:45e6f85d0ba4

List of clips downloaded
author Paulo Chiliguano <p.e.chiilguano@se14.qmul.ac.uk>
date Tue, 11 Aug 2015 14:23:42 +0100
parents e68dbee1f6db
children e4bcfe00abf4
line wrap: on
line source
\chapter{Methodology}
The methodology used to develop the hybrid music recommender consists of three main stages. First, the collection of real users' data corresponding to  the number of playings of specific songs and the retrieval of audio samples of the identified songs in the users' data. Secondly, the implementation of the deep learning algorithm to represent the songs as vectors and the EDA to model the user profiles  
\section{Data collection}
The Million Song Dataset \citep{Bertin-Mahieux2011} is a collection of audio features and metadata for a million contemporary popular music tracks which purpose in MIR is to provide a ground truth for evaluation research. This collection is also complemented by the Taste Profie subset \footnote{http://labrosa.ee.columbia.edu/millionsong/tasteprofile} which provides 48,373,586 triplets that consists of Last.fm user ID, Echo Nest song ID and play count of the song.

\subsection{Taste Profile subset cleaning}
Due to potential mismatches between song IDs and track IDs on the Echo Nest database, it is required to filter out the wrong matches in the Taste Profile subset. A Python script is implemented to discard the triplets that contain the song ID values from the mismatches list available also on the Million Song Dataset webpage. The resulting triplets are stored in a new CSV file.
%count resulting number of triplets


%At this stage, similarities between users is calculated to form a neighbourhood and predict user rating based on combination of the ratings of selected users in the neighbourhood.

\subsection{Audio clips retrieval}
The list of songs IDs from the triplets obtained in the last step are used to retrieve the track IDs through a Python script that includes the Pyechonest \footnote{http://echonest.github.io/pyechonest/} package which allow us to acquire track ID with \emph{get\_tracks} method through Echo Nest API\footnote{http://developer.echonest.com} requests. The reason behind obtaining track IDs is because for each ID we can retrieve a 30-60 seconds preview audio clips through 7digital API\footnote{http://developer.7digital.com}.

Additionally, the Python script accumulates the song ID, the URL, artist and song metadata of each track available in a text file. If the track for a song ID is not available, the script skips to the next song ID to retrieve information of it. The generated text file can be used to reduce more the triplets dataset from the last section. 

%include number of tracks available

%Classifier creates a model for each user based on the acoustic features of the tracks that user has liked.
\subsection{Intermediate time-frequency representation for audio signals}
For representing audio waveforms of the song collection obtained through 7digital API, a similar procedure suggested by \citet{NIPS2013_5004} is followed:
\begin{itemize}
	\item Read 3 seconds of each song at a sampling rate of 22050 Hz and mono channel.
	\item Compute log-mel spectrograms with 128 components from windows of 1024 frames and a hop size of 512 samples.
\end{itemize}

The Python script for feature extraction implemented by \citet{Sigtia20146959} is modified to return the log-mel spectrograms by using the LibROSA\footnote{https://bmcfee.github.io/librosa/index.html} package.

``Representations of music directly from the temporal or spectral domain can be very sensitive to small time and frequency deformations''. \citep{zhang2014deep}

\section{Algorithms}
\subsection{CNN architecture}
The input of the CNN consist of the 128-component spectrograms obtained in feature extraction. The batch size considered is 20 frames.
Each convolutional layer consists of 10 kernels and ReLUs activation units. In the first convolutional layer the pooling size is 4 and in the second layer the pooling size is 2. The filters analyses the frames along the frequency axis to consider every Mel components with a hop size of 4 frames in the time axis. Additionally, there is a hidden multi perceptron layer with 513 units.
%Deep belief network is a probabilistic model that has one observed layer and several hidden layers.
\subsubsection{Genre classification}
The classification of genre for each frame is returned by negative log likelihood estimation of a logistic stochastic gradient descent (SGD) layer.
\subsection{Continuous Bayesian EDA}
\subsection{EDA-based hybrid recommender}