changeset 19:3100eb38e180

NLP explanation
author mariano
date Sat, 30 Apr 2016 14:39:29 +0100
parents 596164f18966
children e512ee8c8e63
files musicweb.tex
diffstat 1 files changed, 3 insertions(+), 2 deletions(-) [+]
line wrap: on
line diff
--- a/musicweb.tex	Sat Apr 30 14:17:23 2016 +0100
+++ b/musicweb.tex	Sat Apr 30 14:39:29 2016 +0100
@@ -272,12 +272,13 @@
 NOTE: not sure about this. Do we consider the dbpedia queries to be socio-cultural? or the collaborates-with in musicbrainz?
 
 \subsection{Similarity in the literature}
-Artists tend to be regarded as similar when writing about certain topics. For example: a psychologist interested in self-image during adolescence might want to research the impact of artists like Miley Cyrus or Rihanna on young teenagers\cite{Lamb2013}. Or a historian researching class politics might write about The Sex Pistols and John Lennon\cite{Moliterno2012}. MusicWeb searches and collects texts from several sources and carries out semantic analysis to identify such connections between artists and higher-level topics. There are two main sources of texts:
+Artists tend to be regarded as similar when writing about certain topics. For example: a psychologist interested in self-image during adolescence might want to research the impact of artists like Miley Cyrus or Rihanna on young teenagers\cite{Lamb2013}. Or a historian researching class politics might write about The Sex Pistols and John Lennon\cite{Moliterno2012}. The starting point is a large database of 100,000 artists. MusicWeb searches and collects texts which mention each artist from several sources and carries out semantic analysis to identify such connections between artists and higher-level topics. There are two main sources of texts:
 \begin{enumerate}
 \item Research articles. There are various web resources that allow querying their research literature databases. MusicWeb uses mendeley\footnote{http://dev.mendeley.com/} and elsevier\footnote{http://dev.elsevier.com/}. Both resources offer managed and largely curated data and search possibilities include keywords, authors and disciplines. Data comprehension varies, but most often it features an array of keywords, an abstract, readership categorised according to discipline and sometimes the article itself.
   \item Online publications, such as newspapers, music magazines and blogs focused on music. This is non-managed, non-curated data, it must be extracted from the body of the text. The data is accessed after having crawled websites searching for keywords or tags in the title, and then scraped. External links contained in the page are also followed. 
 \end{enumerate}
-The text (or the abstact, in the case of research articles) are subjected to semantic analysis. It is first tokenised and a bag of words is extracted from it. This bag of words is used to query the alchemy\footnote{http://www.alchemyapi.com/} language analysis service for entity recognition, keyword extraction and topic modelling. The entity recogniser provides a list of names that are mentioned in the text and which are identified by a model trained with a large dataset of names. It can include toponyms, institutions, publications and persons. MusicWeb is interested in identifying artists, so every person mentioned is checked against three resources: dbpedia, musicbrainz and freebase.
+Many texts contain references to an artist name without actually being relevant to MusicWeb. A search for Madonna, for example, can yield many results from the fields of sculpture, art history or religion studies. The first step is to model the relevance of the text. Texts associated to disciplines which clearly do not belong to the humanities are discarded. All other articles are stored in the shape of the graph depicted in figure FIGURE!!!!!!.
+The text (or the abstact, in the case of research articles) are subjected to semantic analysis. It is first tokenised and a bag of words is extracted from it. This bag of words is used to query the alchemy\footnote{http://www.alchemyapi.com/} language analysis service for entity recognition, keyword extraction and topic modelling. The entity recogniser provides a list of names, that are mentioned in the text together with a measure of relevanec. Entities are normally identified by a model trained with a large dataset of names. They can include toponyms, institutions, publications and persons. MusicWeb is interested in identifying artists, so every person mentioned is checked against the database. If the person is not included in MusicWeb's database then three resources are checked: dbpedia, musicbrainz and freebase. All three resources identify musicians using the yago ontology. It is important to align the artist properly, since the modeling process is largely unsupervised, and wrong identifications can confuse the model. Musicians identified in texts are stored and linked to the artist that originated the query.
 \begin{itemize}
 \item Semantic analysis\cite{Landauer1998}
 \item Topic modeling\cite{Blei2012}