Mercurial > hg > musicweb-iswc2016
changeset 20:e512ee8c8e63
progress in NLP
author | mariano |
---|---|
date | Sat, 30 Apr 2016 16:01:47 +0100 |
parents | 3100eb38e180 |
children | af84677d385b |
files | musicweb.tex |
diffstat | 1 files changed, 6 insertions(+), 1 deletions(-) [+] |
line wrap: on
line diff
--- a/musicweb.tex Sat Apr 30 14:39:29 2016 +0100 +++ b/musicweb.tex Sat Apr 30 16:01:47 2016 +0100 @@ -277,7 +277,12 @@ \item Research articles. There are various web resources that allow querying their research literature databases. MusicWeb uses mendeley\footnote{http://dev.mendeley.com/} and elsevier\footnote{http://dev.elsevier.com/}. Both resources offer managed and largely curated data and search possibilities include keywords, authors and disciplines. Data comprehension varies, but most often it features an array of keywords, an abstract, readership categorised according to discipline and sometimes the article itself. \item Online publications, such as newspapers, music magazines and blogs focused on music. This is non-managed, non-curated data, it must be extracted from the body of the text. The data is accessed after having crawled websites searching for keywords or tags in the title, and then scraped. External links contained in the page are also followed. \end{enumerate} -Many texts contain references to an artist name without actually being relevant to MusicWeb. A search for Madonna, for example, can yield many results from the fields of sculpture, art history or religion studies. The first step is to model the relevance of the text. Texts associated to disciplines which clearly do not belong to the humanities are discarded. All other articles are stored in the shape of the graph depicted in figure FIGURE!!!!!!. +Many texts contain references to an artist name without actually being relevant to MusicWeb. A search for Madonna, for example, can yield many results from the fields of sculpture, art history or religion studies. The first step is to model the relevance of the text, and discard texts which are of no interest to music discovery. This is done through a two stage process: +\being{enumerate} +\item Managed articles contain information about the number of readers per discipline. This data is analysed and texts are discarded if the readers belong mainly to disciplines not related to humanities. +\item We store a tfd/idf space made up of words appearing in a relatively small collection of already accepted articles. Cosine similarity between this corpus and every potential article is then computed, and texts which exceed a threshold are rejected. +\end{enumerate} +All items that pass these tests are stored as potential articles in the shape of the graph depicted in figure FIGURE!!!!!!. The text (or the abstact, in the case of research articles) are subjected to semantic analysis. It is first tokenised and a bag of words is extracted from it. This bag of words is used to query the alchemy\footnote{http://www.alchemyapi.com/} language analysis service for entity recognition, keyword extraction and topic modelling. The entity recogniser provides a list of names, that are mentioned in the text together with a measure of relevanec. Entities are normally identified by a model trained with a large dataset of names. They can include toponyms, institutions, publications and persons. MusicWeb is interested in identifying artists, so every person mentioned is checked against the database. If the person is not included in MusicWeb's database then three resources are checked: dbpedia, musicbrainz and freebase. All three resources identify musicians using the yago ontology. It is important to align the artist properly, since the modeling process is largely unsupervised, and wrong identifications can confuse the model. Musicians identified in texts are stored and linked to the artist that originated the query. \begin{itemize} \item Semantic analysis\cite{Landauer1998}