Mercurial > hg > musicweb-iswc2016

--- a/musicweb.tex	Sun May 01 05:18:47 2016 +0100
+++ b/musicweb.tex	Sun May 01 05:43:48 2016 +0100
@@ -183,24 +183,23 @@


 \section{Background}\label{sec:background}
-Researchers have realised the usefulness of musical metadata and have for some time tried to collect and exploit music metadata for knowledge representation. Several ontologies have been and continue to be developed to link music metadata, such as the music ontology\cite{DBLP:conf/ismir/RaimondASG07}, which defines all objects in the process of creation, interpretation and distribution of music; the similarity ontology\cite{jacobson2011}, which allows for associations based on similarity of all musical elements contained in the music ontology; the studio ontology, which can be used to describe all elements in music studio environments\cite{fazekas2011studio}; or the audio effects ontology\cite{wilmering2013}, permitting the description of audio effects employed in music production processes. Linked music metadata is full of promise. However, most attempts to make use of linked metadata to guide music discovery have stressed some aspects of metadata while ignoring others. Pachet identifies three types of musical metadata \cite{Pachet2005}:
+Researchers have realised the usefulness of musical metadata and have for some time tried to collect and exploit music metadata for knowledge representation. Several ontologies have been and continue to be developed to link music metadata, such as the music ontology\cite{DBLP:conf/ismir/RaimondASG07}, which defines all objects in the process of creation, interpretation and distribution of music; the similarity ontology\cite{jacobson2011}, which allows for associations based on similarity of all musical elements contained in the music ontology; the studio ontology, which can be used to describe all elements in music studio environments\cite{fazekas2011studio}; or the audio effects ontology\cite{wilmering2013}, permitting the description of audio effects employed in music production processes. Linked music metadata is full of promise. However, most attempts to make use of linked metadata to guide music discovery have stressed some aspects of metadata while ignoring others. Pachet\cite{Pachet2005} identifies three types of musical metadata:
 \begin{enumerate}
   \item Editorial metadata: information that is provided manually by authoritative experts. There is a wide range of potential producers of this kind of data, from record labels to collaborative schemes, as well as different kinds of data, from which musician played in which song to tour info, to artists' biography.
   \item Cultural metadata: information which is produced by the environment or culture. This is data that is not explicitly entered into some information system, but rather is contained, and must be extracted from, other information sources, such as user trends, google searches, articles and magazines, word associations in blogs, etc.
   \item Acoustic metadata: data extracted from audio files using music information retrieval methods.
 \end{enumerate}
-Of these, only the first has been exploited to a significant degree. Web resources for music discovery which employ liked data such as musicbrainz or lastfm rely mostly on editorial metadata to link. Commercial recommendation systems make use of cultural metadata, mainly through collaborative filtering.
-To our knowledge the first recommedation system based on linked data was proposed in \cite{celma2008foafing}, which used web crawling to gather data which could then be offered to the user. Recommendation was based on profiling the user's listening habits and foaf connections. A further step was taken in \cite{heitmann2010}, in which the author addresses common problems in recommender system such the new item problem or the new user problem. dbrec, a recommender system presented in \cite{passant2010dbrec}, recommended music obtained from dbpedia by computing a measure of semantic distance as the number of indirect and distinct links between resources in a graph. The system offered the user an explanation for each recommendation, listing the resources shared by the artists recommended.
-Nguyen \emph{et al}\cite{nguyen2015} explore the effectiveness of recommendation systems based on knowledge encyclopedias such as dbpedia and freenet. The authors compute several different similarity measures of linked data extracted from both datasets which they then feed to a recommender system.
-There are several web resources offering services similar to MusicWeb. One of them is musikipedia\footnote{http://musikipedia.org/}. The user can visit a page for an artist and listen to music or watch videos. The user can also link to other artists that are connected to the current one, and an explanation of the connection is offered. Links are extracted from dbpedia and offer all common categories between artists.
-
-Dbpedia and freebase are two of the most common sources of linked data available. There are several other sources of music metadata. Acousticbrainz\footnote{https://acousticbrainz.org/} is an crowd source information resource which contains low and high level music metadata, including audio and editorial features. Acousticbrainz is participated by musicbrainz, which is also a major container of linked editorial metadata.
+Of these, only the first has been exploited to a significant degree. Web resources for music discovery which employ linked data such as musicbrainz or lastfm rely mostly on editorial metadata. Commercial recommendation systems make use of limited cultural metadata, mainly through collaborative filtering.
+To our knowledge the first recommedation system based on linked data was proposed in \cite{celma2008foafing}, which used web crawling to gather data which could then be offered to the user. Recommendation was based on profiling the user's listening habits and \emph{foaf} connections. A further step was taken in \cite{heitmann2010}, in which the author addresses common problems in recommender system such the new item problem or the new user problem. \emph{dbrec}, a recommender system introduced in \cite{passant2010dbrec}, suggested music obtained from dbpedia by computing a measure of semantic distance as the number of indirect and distinct links between resources in a graph. The system offered the user an explanation for each recommendation, listing the resources shared by the artists recommended.
+\cite{nguyen2015} explores the effectiveness of recommendation systems based on knowledge encyclopedias such as dbpedia and freenet. The authors compute several different similarity measures of linked data extracted from both datasets, which they then feed to a recommender system.\\
+There are several web resources offering services similar to MusicWeb. One of them is musikipedia\footnote{http://musikipedia.org/}. The user can visit a page for an artist and listen to music or watch videos. The user can also link to other artists that are connected to the current one, and an explanation of the connection is offered. Links are extracted from dbpedia and offer all common categories between artists.\\
+Dbpedia and freebase are two of the most common sources of linked data available. There are several other sources of music metadata. AcousticBrainz\footnote{https://acousticbrainz.org/} is an crowd source information resource which contains low and high level music metadata, including audio and editorial features. Acousticbrainz is participated by musicbrainz, which is also a major container of linked editorial metadata.


 \section{MusicWeb architecture}

-MusicWeb provides a browsing experience using connections that are either extra-musical or tangential to music, such as the artists' political affiliation or social influence, or intra-musical, such as the artists' main instrument or most favoured musical keys. It does this by pulling data from several different web knowledge content resources and presenting them for the user to navigate in a faceted manner\cite{Marchionini2006}. The listener can begin his journey by choosing or searching for an artist. The application offers youtube videos, audio streams, photographs and album covers, as well as the artist's biography. The page also includes many box widgets with links to artists who are related to the current artist in different, and sometimes unexpected and surprising ways (figure \ref{fig:ella_page}). The user can then click on any of these artists and the search commences again, exploring a web of artists further and further.
+MusicWeb provides a browsing experience using connections that are either extra-musical or tangential to music, such as the artists' political affiliation or social influence, or intra-musical, such as the artists' main instrument or most favoured musical keys. It does this by pulling data from several different web knowledge content resources and presenting them for the user to navigate in a faceted manner\cite{Marchionini2006}. The listener can begin his journey by choosing or searching an artist. The application offers youtube videos, audio streams, photographs and album covers, as well as the artist's biography. The page also includes many box widgets with links to artists who are related to the current artist in different, and sometimes unexpected ways (figure \ref{fig:ella_page}). The user can then click on any of these artists and the search commences again, exploring a web of artists further and further.


 \begin{figure}[!ht]
@@ -256,19 +255,19 @@
 Music does not lend itself easily to categorisation. There are many ways in which artist can be, and in fact are, considered to be related. Similarity may refer to whether artists' songs sound similar, or are perceived to be in the same style or genre. But it may also mean that they are followed by people from similar social backgrounds or political inclinations, or similar ages; or perhaps they are similar because they have played together, or participated in the same event, or their songs touch on similar themes. Linked data facilitates faceted searching and displaying of information\cite{Oren2006}: an artist may be similar to many other artists in one of the ways just mentioned, and to a completely different plethora of artists in other senses, all of which might contribute to music discovery. Semantic web technologies can help us gather different facets of data and shape them into representations of knowledge. MusicWeb does this by searching similarities in three different domains: socio-cultural, research and journalistic literature and content-based linkage.
 \subsection{Socio-cultural linkage}

-Socio-cultural connections between artists in MusicWeb are primarily derived from YAGO categories that are incorporated into entities in DBpedia. Many categories, in particular those that can be considered extra-musical or tangential to music, that emerge as users browse MusicWeb, stem from the particular methodology used to derive YAGO information from Wikipedia. While DBpedia extracts knowledge from the same source, YAGO leverages Wikipedia category pages to link entities without adapting the Wikipedia taxonomy of these categories. The hierarchy is created by adapting the Wikipedia categories to the WordNet concept structure. This enables linking each artist to other similar artists by various commonalities such as style, geographical location, instrumentation, record label as well as more obscure categories, for example, artists who have received the same award, have shared the same fate, or belonged to the same organisation or religion. YAGO categories can reveal connections between artists that traditional isolated music datasets would not be able to establish. For example, MusicWeb links Alice Coltrane to John McLaughlin as both artists are converts to Hinuism, or Ella Fitzgerald to Jerry Garcia because both belong to the category of American amputees, or further still, the guitarist of heavy metal band Pantera to South African singer and civil rights activist Miriam Makeba as both died on stage.
+Socio-cultural connections between artists in MusicWeb are primarily derived from YAGO categories that are incorporated into entities in DBpedia. Many categories, in particular those that can be considered extra-musical or tangential to music, that emerge as users browse MusicWeb, stem from the particular methodology used to derive YAGO information from Wikipedia. While DBpedia extracts knowledge from the same source, YAGO leverages Wikipedia category pages to link entities without adapting the Wikipedia taxonomy of these categories. The hierarchy is created by adapting the Wikipedia categories to the WordNet concept structure. This enables linking each artist to other similar artists by various commonalities such as style, geographical location, instrumentation, record label as well as more obscure categories, for example, artists who have received the same award, have shared the same fate, or belonged to the same organisation or religion. YAGO categories can reveal connections between artists that traditional isolated music datasets would not be able to establish. For example, MusicWeb links Alice Coltrane to John McLaughlin as both artists are converts to Hinduism, or Ella Fitzgerald to Jerry Garcia because both belong to the category of American amputees, or further still, the guitarist of heavy metal band Pantera to South African singer and civil rights activist Miriam Makeba as both died on stage.
 % NOTE: not sure about this. Do we consider the dbpedia queries to be socio-cultural? or the collaborates-with in musicbrainz? Or do you (George) mean something like the introduction just above?

 \subsection{Literature-based linking}
 Often artist share a connection through literature topics. For example: a psychologist interested in self-image during adolescence might want to research the impact of artists like Miley Cyrus or Rihanna on young teenagers\cite{Lamb2013}. Or a historian researching class politics in the UK might write about The Sex Pistols and John Lennon\cite{Moliterno2012}. In order to extract these relations one must mine the data from texts using natural language processing. Our starting point is a large database of 100,000 artists. MusicWeb searches several sources and collects texts that mention each artist. It then carries out semantic analysis to identify connections between artists and higher-level topics. There are two main sources of texts:
 \begin{enumerate}
-\item Research articles. There are various web resources that allow querying their research literature databases. MusicWeb uses mendeley\footnote{\url{http://dev.mendeley.com/}} and elsevier\footnote{\url{http://dev.elsevier.com/}}. Both resources offer managed and largely curated data and search possibilities include keywords, authors and disciplines. Data comprehension varies, but most often it features an array of keywords, an abstract, readership categorised according to discipline and sometimes the article itself.
+\item Research articles. There are various web resources that allow querying their research literature databases. MusicWeb uses Mendeley\footnote{\url{http://dev.mendeley.com/}} and Elsevier\footnote{\url{http://dev.elsevier.com/}}. Both resources offer managed and largely curated data and search possibilities include keywords, authors and disciplines. Data comprehension varies, but most often it features an array of keywords, an abstract, readership categorised according to discipline and sometimes the article itself.
   \item Online publications, such as newspapers, music magazines and blogs focused on music. This is non-managed, non-curated data, it must be extracted from the body of the text. The data is accessed after having crawled websites searching for keywords or tags in the title, and then scraped. External links contained in the page are also followed.
 \end{enumerate}
 Many texts contain references to an artist name without actually being relevant to MusicWeb. A search for Madonna, for example, can yield many results from the fields of sculpture, art history or religion studies. The first step is to model the relevance of the text, and discard texts which are of no interest to music discovery. This is done through a two stage process:
 \begin{enumerate}
 \item Managed articles contain information about the number of readers per discipline. This data is analysed and texts are discarded if the readers belong mainly to disciplines not related to humanities.
-\item The text is projected onto a tfd/idf vector space model\cite{Manning1999} constructed from words appearing in a relatively small collection of already accepted articles. Cosine similarity between this corpus and the text of every potential is computed, and texts which exceed a threshold are rejected.
+\item The text is projected onto a \emph{tfd/idf} vector space model\cite{Manning1999} constructed from words appearing in a relatively small collection of already accepted articles. Cosine similarity between this corpus and the text of every potential is computed, and texts which exceed a threshold are rejected.
 \end{enumerate}
 All items that pass these tests are stored as potential articles in the shape of the graph depicted in figure \ref{fig:article_graph}. Potential articles can always be reviewed and discarded as the corpus of articles grows and the similarity is recomputed.

@@ -279,9 +278,9 @@
   \label{fig:article_graph}

 \end{figure}
-Texts (or abstracts, in the case of research publications where the body is not available) are subjected to semantic analysis. It is first tokenised and a bag of words is extracted from it. This bag of words is used to query the alchemy\footnote{AlchemyAPI is used under license from IBM Watson.} language analysis service for:
+Texts (or abstracts, in the case of research publications where the body is not available) are subjected to semantic analysis. The text as a bag of words is used to query the alchemy\footnote{AlchemyAPI is used under license from IBM Watson.} language analysis service for:
 \begin{itemize}
-\item Named entity recognition. The entity recogniser provides a list of names that appear mentioned in the text together with a measure of relevance. They can include toponyms, institutions, publications and persons. MusicWeb is interested in identifying artists, so every person mentioned is checked against the database. If the person is not included in MusicWeb's database then three resources are checked: dbpedia, musicbrainz and freebase. All three resources identify musicians using the yago ontology. It is important to align the artist properly, since the modeling process is largely unsupervised, and wrong identifications can skew the model. Musicians identified in texts are stored and linked to the artist that originated the query. MusicWeb then offers a link to either of them as ``appearing together in article''.
+\item Named entity recognition. The entity recogniser provides a list of names that appear mentioned in the text together with a measure of relevance. They can include toponyms, institutions, publications and persons. MusicWeb is interested in identifying artists, so every person mentioned is checked against the database. If the person is not included in MusicWeb's database then three resources are checked: dbpedia, musicbrainz and freebase. All three resources identify musicians using YAGO. It is important to align the artist properly, since the modeling process is largely unsupervised, and wrong identifications can skew the model. Musicians identified in texts are stored and linked to the article as \textbf{props:in\_article}.
 \item Keyword extraction. Non-managed texts and research papers that don't include tags or keywords. Keywords are checked against wordnet for hypernyms and stored. Artists that share keywords or hypernyms are considered to be relevant to the same topic in the literature.
 \end{itemize}
 MusicWeb also offers links between artists who appear in different articles by the same author, as well as in the same journal.
@@ -354,15 +353,24 @@
 \section{Conclusions}\label{sec:conclusions}

 \subsection{Future work}
-
-\subsubsection{Literature-based linking}
-Our next steps in this direction will be:
+MusicWeb is an emerging application being developed at a research center to explore the possibilities of linked data-based music discovery. As such, its user base is at the moment limited to member of the research group (roughly 30 people, students and staff). Hence there are no major technical infrascuture requirements which cannot be supported by the university. This does not mean that we do not intend to utilise very large datasets. It was conceived and it is being developed mainly as a research tool. Our aim is to gather in one application various different approaches to music discovery and how they can benefit from linked music metadata. Our next steps are directed toward evaluating its potential acceptance by end users. It would be of great value to us to find out which linking methods listeners find most appealing or interesting, and which they would use more often.
+As to the different methods of linking music metadata, our next steps will be:
+\begin{itemize}
+\item In literature-based linking:
 \begin{enumerate}
-\item To reinforcing the model to filter texts. We want to explore different possibilities to make the selection of texts robuster and more reliable.
+\item To reinforce the model to filter texts. We want to explore different possibilities to make the selection of texts robuster and more reliable.
 \item To investigate methods for reliable abstract concept identification. The use of valuable metadata such as discipline, journal or keywords offers the possibility of clustering topics under a hierarchy of abstract concepts.
 \item To research the application of graph distance in artist similarity.
 \end{enumerate}

+\item In content-based linking:
+\begin{enumerate}
+\item Include more feature types
+\item investigate correlation between linking categories (e.g. would content based similarity be correlated with cultural similarity)?
+\item  model albums separately (e.g. many artists cross genres over a long career, does this have an effect influencing content based linking?)
+\end{enumerate}
+\end{itemize}
+
 	%
 	% ---- Bibliography ----
 	%