Mercurial > hg > musicweb-iswc2016
changeset 21:af84677d385b
typo
author | mariano |
---|---|
date | Sat, 30 Apr 2016 16:02:35 +0100 |
parents | e512ee8c8e63 |
children | caa2091de9af |
files | musicweb.tex |
diffstat | 1 files changed, 1 insertions(+), 1 deletions(-) [+] |
line wrap: on
line diff
--- a/musicweb.tex Sat Apr 30 16:01:47 2016 +0100 +++ b/musicweb.tex Sat Apr 30 16:02:35 2016 +0100 @@ -278,7 +278,7 @@ \item Online publications, such as newspapers, music magazines and blogs focused on music. This is non-managed, non-curated data, it must be extracted from the body of the text. The data is accessed after having crawled websites searching for keywords or tags in the title, and then scraped. External links contained in the page are also followed. \end{enumerate} Many texts contain references to an artist name without actually being relevant to MusicWeb. A search for Madonna, for example, can yield many results from the fields of sculpture, art history or religion studies. The first step is to model the relevance of the text, and discard texts which are of no interest to music discovery. This is done through a two stage process: -\being{enumerate} +\begin{enumerate} \item Managed articles contain information about the number of readers per discipline. This data is analysed and texts are discarded if the readers belong mainly to disciplines not related to humanities. \item We store a tfd/idf space made up of words appearing in a relatively small collection of already accepted articles. Cosine similarity between this corpus and every potential article is then computed, and texts which exceed a threshold are rejected. \end{enumerate}