Mercurial > hg > musicweb-iswc2016
changeset 23:6741f7163739
NLP, only figure missing
author | mariano |
---|---|
date | Sat, 30 Apr 2016 17:21:05 +0100 |
parents | caa2091de9af |
children | 7964cd686c66 |
files | musicweb.tex |
diffstat | 1 files changed, 24 insertions(+), 10 deletions(-) [+] |
line wrap: on
line diff
--- a/musicweb.tex Sat Apr 30 16:48:49 2016 +0100 +++ b/musicweb.tex Sat Apr 30 17:21:05 2016 +0100 @@ -282,18 +282,16 @@ Many texts contain references to an artist name without actually being relevant to MusicWeb. A search for Madonna, for example, can yield many results from the fields of sculpture, art history or religion studies. The first step is to model the relevance of the text, and discard texts which are of no interest to music discovery. This is done through a two stage process: \begin{enumerate} \item Managed articles contain information about the number of readers per discipline. This data is analysed and texts are discarded if the readers belong mainly to disciplines not related to humanities. -\item We store a tfd/idf space made up of words appearing in a relatively small collection of already accepted articles. Cosine similarity between this corpus and every potential article is then computed, and texts which exceed a threshold are rejected. +\item The text is projected onto a tfd/idf vector space model\cite{Manning1999} constructed from words appearing in a relatively small collection of already accepted articles. Cosine similarity between this corpus and the text of every potential is computed, and texts which exceed a threshold are rejected. \end{enumerate} -All items that pass these tests are stored as potential articles in the shape of the graph depicted in figure FIGURE!!!!!!. -The text (or the abstact, in the case of research articles) are subjected to semantic analysis. It is first tokenised and a bag of words is extracted from it. This bag of words is used to query the alchemy\footnote{http://www.alchemyapi.com/} language analysis service for entity recognition, keyword extraction and topic modelling. The entity recogniser provides a list of names, that are mentioned in the text together with a measure of relevanec. Entities are normally identified by a model trained with a large dataset of names. They can include toponyms, institutions, publications and persons. MusicWeb is interested in identifying artists, so every person mentioned is checked against the database. If the person is not included in MusicWeb's database then three resources are checked: dbpedia, musicbrainz and freebase. All three resources identify musicians using the yago ontology. It is important to align the artist properly, since the modeling process is largely unsupervised, and wrong identifications can confuse the model. Musicians identified in texts are stored and linked to the artist that originated the query. +All items that pass these tests are stored as potential articles in the shape of the graph depicted in figure FIGURE!!!!!!. Potential articles can always be reviewed and discarded as the corpus of articles grows and the similarity is recomputed. +Texts (or abstracts, in the case of research publications where the body is not available) are subjected to semantic analysis. It is first tokenised and a bag of words is extracted from it. This bag of words is used to query the alchemy\footnote{AlchemyAPI is used under license from IBM Watson.} language analysis service for: \begin{itemize} -\item Semantic analysis\cite{Landauer1998} -\item Topic modeling\cite{Blei2012} -\item Entity recognition -\item Hierarchical bayesian modeling -\item Authors, journals, keywords, tags +\item Named entity recognition. The entity recogniser provides a list of names that appear mentioned in the text together with a measure of relevance. They can include toponyms, institutions, publications and persons. MusicWeb is interested in identifying artists, so every person mentioned is checked against the database. If the person is not included in MusicWeb's database then three resources are checked: dbpedia, musicbrainz and freebase. All three resources identify musicians using the yago ontology. It is important to align the artist properly, since the modeling process is largely unsupervised, and wrong identifications can skew the model. Musicians identified in texts are stored and linked to the artist that originated the query. MusicWeb then offers a link to either of them as ``appearing together in article''. +\item Keyword extraction. Non-managed texts and research that don't include tags or keywords. Keywords are checked against wordnet for hypernyms and stored\cite{Agirre2004}. Artists that share keywords or hypernyms are considered to be relevant to the same topic in the literature. +\end{itemize} +MusicWeb also offers links between artists who appear in different articles by the same author, as well as in the same journal. -\end{itemize} \subsection{Content-based information retrieval}\label{sec:mir} @@ -398,7 +396,23 @@ A. G.~Moliterno \newblock What Riot? Punk Rock Politics, Fascism, and Rock Against Racism. \newblock Published online: \url{http://www.studentpulse.com/articles/612/what-riot-punk-rock-politics-fascism-and-rock-against-racism}, 2012 - + + \bibitem{Manning1999} + C.~Manning and H.~Sch\"utze + \newblock Foundations of Statistical Natural Language Processing. + \newblock MIT Press, Cambridge, MA., 1999 + + \bibitem{Agirre2004} + E.~Agirre, E.~Alfonseca and O.~Lopez de Lacalle + \newblock Approximating Hierarchy-Based Similarity for WordNet Nominal Synsets using Topic Signatures. + \newblock In {\em Proceedings of the Second Global WordNet Conference, pp. 15-22}, 2004. + + \bibitem{Wong2012} + W.~Wong, W.~Liu and M.~Bennamoun + \newblock Ontology Learning from Text: A Look Back and into the Future + \newblock In {\em ACM Comput. Surv. 44, 4}, 2012 + + \bibitem{Landauer1998} T.~Landauer, P.~Folt, and D.~Laham. \newblock An introduction to latent semantic analysis