Mercurial > hg > musicweb-iswc2016
changeset 25:f601fefa6660
article graph
author | mariano |
---|---|
date | Sat, 30 Apr 2016 18:15:14 +0100 |
parents | 7964cd686c66 |
children | cfd668b44641 25aa1c823930 |
files | graphics/article_graph.graffle graphics/article_graph.pdf musicweb.tex |
diffstat | 3 files changed, 9 insertions(+), 1 deletions(-) [+] |
line wrap: on
line diff
--- a/musicweb.tex Sat Apr 30 18:07:01 2016 +0100 +++ b/musicweb.tex Sat Apr 30 18:15:14 2016 +0100 @@ -279,7 +279,15 @@ \item Managed articles contain information about the number of readers per discipline. This data is analysed and texts are discarded if the readers belong mainly to disciplines not related to humanities. \item The text is projected onto a tfd/idf vector space model\cite{Manning1999} constructed from words appearing in a relatively small collection of already accepted articles. Cosine similarity between this corpus and the text of every potential is computed, and texts which exceed a threshold are rejected. \end{enumerate} -All items that pass these tests are stored as potential articles in the shape of the graph depicted in figure FIGURE!!!!!!. Potential articles can always be reviewed and discarded as the corpus of articles grows and the similarity is recomputed. +All items that pass these tests are stored as potential articles in the shape of the graph depicted in figure \ref{fig:article_graph}. Potential articles can always be reviewed and discarded as the corpus of articles grows and the similarity is recomputed. + +\begin{figure}[!ht] + \centering + \includegraphics[scale=0.5]{graphics/article_graph.pdf} + \caption{Article graph} + \label{fig:article_graph} + +\end{figure} Texts (or abstracts, in the case of research publications where the body is not available) are subjected to semantic analysis. It is first tokenised and a bag of words is extracted from it. This bag of words is used to query the alchemy\footnote{AlchemyAPI is used under license from IBM Watson.} language analysis service for: \begin{itemize} \item Named entity recognition. The entity recogniser provides a list of names that appear mentioned in the text together with a measure of relevance. They can include toponyms, institutions, publications and persons. MusicWeb is interested in identifying artists, so every person mentioned is checked against the database. If the person is not included in MusicWeb's database then three resources are checked: dbpedia, musicbrainz and freebase. All three resources identify musicians using the yago ontology. It is important to align the artist properly, since the modeling process is largely unsupervised, and wrong identifications can skew the model. Musicians identified in texts are stored and linked to the artist that originated the query. MusicWeb then offers a link to either of them as ``appearing together in article''.