# HG changeset patch # User mariano # Date 1462028555 -3600 # Node ID af84677d385b52943eed40c95fa0c5f9e508fb7c # Parent e512ee8c8e630e0a8aa9efb5a2963553569ee029 typo diff -r e512ee8c8e63 -r af84677d385b musicweb.tex --- a/musicweb.tex Sat Apr 30 16:01:47 2016 +0100 +++ b/musicweb.tex Sat Apr 30 16:02:35 2016 +0100 @@ -278,7 +278,7 @@ \item Online publications, such as newspapers, music magazines and blogs focused on music. This is non-managed, non-curated data, it must be extracted from the body of the text. The data is accessed after having crawled websites searching for keywords or tags in the title, and then scraped. External links contained in the page are also followed. \end{enumerate} Many texts contain references to an artist name without actually being relevant to MusicWeb. A search for Madonna, for example, can yield many results from the fields of sculpture, art history or religion studies. The first step is to model the relevance of the text, and discard texts which are of no interest to music discovery. This is done through a two stage process: -\being{enumerate} +\begin{enumerate} \item Managed articles contain information about the number of readers per discipline. This data is analysed and texts are discarded if the readers belong mainly to disciplines not related to humanities. \item We store a tfd/idf space made up of words appearing in a relatively small collection of already accepted articles. Cosine similarity between this corpus and every potential article is then computed, and texts which exceed a threshold are rejected. \end{enumerate}