Mercurial > hg > cip2012
view draft.tex @ 57:ceec4e8b6585
Merge.
author | samer |
---|---|
date | Fri, 16 Mar 2012 18:05:56 +0000 |
parents | fa819cf73ea7 a377a2216a2e |
children | 6e492b4eff44 |
line wrap: on
line source
\documentclass[conference]{IEEEtran} \usepackage{cite} \usepackage[cmex10]{amsmath} \usepackage{graphicx} \usepackage{amssymb} \usepackage{epstopdf} \usepackage{url} \usepackage{listings} %\usepackage[expectangle]{tools} \usepackage{tools} \usepackage{tikz} \usetikzlibrary{calc} \usetikzlibrary{matrix} \usetikzlibrary{patterns} \usetikzlibrary{arrows} \let\citep=\cite \newcommand{\colfig}[2][1]{\includegraphics[width=#1\linewidth]{figs/#2}}% \newcommand\preals{\reals_+} \newcommand\X{\mathcal{X}} \newcommand\Y{\mathcal{Y}} \newcommand\domS{\mathcal{S}} \newcommand\A{\mathcal{A}} \newcommand\Data{\mathcal{D}} \newcommand\rvm[1]{\mathrm{#1}} \newcommand\sps{\,.\,} \newcommand\Ipred{\mathcal{I}_{\mathrm{pred}}} \newcommand\Ix{\mathcal{I}} \newcommand\IXZ{\overline{\underline{\mathcal{I}}}} \newcommand\x{\vec{x}} \newcommand\Ham[1]{\mathcal{H}_{#1}} \newcommand\subsets[2]{[#1]^{(k)}} \def\bet(#1,#2){#1..#2} \def\ev(#1=#2){#1\!\!=\!#2} \newcommand\rv[1]{\Omega \to #1} \newcommand\ceq{\!\!=\!} \newcommand\cmin{\!-\!} \newcommand\modulo[2]{#1\!\!\!\!\!\mod#2} \newcommand\sumitoN{\sum_{i=1}^N} \newcommand\sumktoK{\sum_{k=1}^K} \newcommand\sumjtoK{\sum_{j=1}^K} \newcommand\sumalpha{\sum_{\alpha\in\A}} \newcommand\prodktoK{\prod_{k=1}^K} \newcommand\prodjtoK{\prod_{j=1}^K} \newcommand\past[1]{\overset{\rule{0pt}{0.2em}\smash{\leftarrow}}{#1}} \newcommand\fut[1]{\overset{\rule{0pt}{0.1em}\smash{\rightarrow}}{#1}} \newcommand\parity[2]{P^{#1}_{2,#2}} %\usepackage[parfill]{parskip} \begin{document} \title{Cognitive Music Modelling: an\\Information Dynamics Approach} \author{ \IEEEauthorblockN{Samer A. Abdallah, Henrik Ekeus, Peter Foster} \IEEEauthorblockN{Andrew Robertson and Mark D. Plumbley} \IEEEauthorblockA{Centre for Digital Music\\ Queen Mary University of London\\ Mile End Road, London E1 4NS}} \maketitle \begin{abstract} People take in information when perceiving music. With it they continually build predictive models of what is going to happen. There is a relationship between information measures and how we perceive music. An information theoretic approach to music cognition is thus a fruitful avenue of research. In this paper, we review the theoretical foundations of information dynamics and discuss a few emerging areas of application. \end{abstract} \section{Introduction} \label{s:Intro} The relationship between Shannon's \cite{Shannon48} information theory and music and art in general has been the subject of some interest since the 1950s \cite{Youngblood58,CoonsKraehenbuehl1958,HillerBean66,Moles66,Meyer67,Cohen1962}. The general thesis is that perceptible qualities and subjective states like uncertainty, surprise, complexity, tension, and interestingness are closely related to information-theoretic quantities like entropy, relative entropy, and mutual information. Music is also an inherently dynamic process, where listeners build up expectations on what is to happen next, which are either satisfied or modified as the music unfolds. In this paper, we explore this ``Information Dynamics'' view of music, discussing the theory behind it and some emerging appliations \subsection{Expectation and surprise in music} One of the effects of listening to music is to create expectations of what is to come next, which may be fulfilled immediately, after some delay, or not at all as the case may be. This is the thesis put forward by, amongst others, music theorists L. B. Meyer \cite{Meyer67} and Narmour \citep{Narmour77}, but was recognised much earlier; for example, it was elegantly put by Hanslick \cite{Hanslick1854} in the nineteenth century: \begin{quote} `The most important factor in the mental process which accompanies the act of listening to music, and which converts it to a source of pleasure, is \ldots the intellectual satisfaction which the listener derives from continually following and anticipating the composer's intentions---now, to see his expectations fulfilled, and now, to find himself agreeably mistaken. %It is a matter of course that %this intellectual flux and reflux, this perpetual giving and receiving %takes place unconsciously, and with the rapidity of lightning-flashes.' \end{quote} An essential aspect of this is that music is experienced as a phenomenon that `unfolds' in time, rather than being apprehended as a static object presented in its entirety. Meyer argued that musical experience depends on how we change and revise our conceptions \emph{as events happen}, on how expectation and prediction interact with occurrence, and that, to a large degree, the way to understand the effect of music is to focus on this `kinetics' of expectation and surprise. Prediction and expectation are essentially probabilistic concepts and can be treated mathematically using probability theory. We suppose that when we listen to music, expectations are created on the basis of our familiarity with various styles of music and our ability to detect and learn statistical regularities in the music as they emerge, There is experimental evidence that human listeners are able to internalise statistical knowledge about musical structure, \eg \citep{SaffranJohnsonAslin1999,EerolaToiviainenKrumhansl2002}, and also that statistical models can form an effective basis for computational analysis of music, \eg \cite{ConklinWitten95,PonsfordWigginsMellish1999,Pearce2005}. \comment{ The business of making predictions and assessing surprise is essentially one of reasoning under conditions of uncertainty and manipulating degrees of belief about the various proposition which may or may not hold, and, as has been argued elsewhere \cite{Cox1946,Jaynes27}, best quantified in terms of Bayesian probability theory. Thus, we suppose that when we listen to music, expectations are created on the basis of our familiarity with various stylistic norms that apply to music in general, the particular style (or styles) of music that seem best to fit the piece we are listening to, and the emerging structures peculiar to the current piece. There is experimental evidence that human listeners are able to internalise statistical knowledge about musical structure, \eg \citep{SaffranJohnsonAslin1999,EerolaToiviainenKrumhansl2002}, and also that statistical models can form an effective basis for computational analysis of music, \eg \cite{ConklinWitten95,PonsfordWigginsMellish1999,Pearce2005}. } % \subsection{Music and information theory} With a probabilistic framework for music modelling and prediction in hand, we are in a position to compute various \comment{ which provides us with a number of measures, such as entropy and mutual information, which are suitable for quantifying states of uncertainty and surprise, and thus could potentially enable us to build quantitative models of the listening process described above. They are what Berlyne \cite{Berlyne71} called `collative variables' since they are to do with patterns of occurrence rather than medium-specific details. Berlyne sought to show that the collative variables are closely related to perceptual qualities like complexity, tension, interestingness, and even aesthetic value, not just in music, but in other temporal or visual media. The relevance of information theory to music and art has also been addressed by researchers from the 1950s onwards \cite{Youngblood58,CoonsKraehenbuehl1958,Cohen1962,HillerBean66,Moles66,Meyer67}. } information-theoretic quantities like entropy, relative entropy, and mutual information. % and are major determinants of the overall experience. Berlyne \cite{Berlyne71} called such quantities `collative variables', since they are to do with patterns of occurrence rather than medium-specific details, and developed the ideas of `information aesthetics' in an experimental setting. % Berlyne's `new experimental aesthetics', the `information-aestheticians'. % Listeners then experience greater or lesser levels of surprise % in response to departures from these norms. % By careful manipulation % of the material, the composer can thus define, and induce within the % listener, a temporal programme of varying % levels of uncertainty, ambiguity and surprise. \subsection{Information dynamic approach} Bringing the various strands together, our working hypothesis is that as a listener (to which will refer as `it') listens to a piece of music, it maintains a dynamically evolving probabilistic model that enables it to make predictions about how the piece will continue, relying on both its previous experience of music and the immediate context of the piece. As events unfold, it revises its probabilistic belief state, which includes predictive distributions over possible future events. These % distributions and changes in distributions can be characterised in terms of a handful of information theoretic-measures such as entropy and relative entropy. By tracing the evolution of a these measures, we obtain a representation which captures much of the significant structure of the music. One of the consequences of this approach is that regardless of the details of the sensory input or even which sensory modality is being processed, the resulting analysis is in terms of the same units: quantities of information (bits) and rates of information flow (bits per second). The probabilistic and information theoretic concepts in terms of which the analysis is framed are universal to all sorts of data. In addition, when adaptive probabilistic models are used, expectations are created mainly in response to to \emph{patterns} of occurence, rather the details of which specific things occur. Together, these suggest that an information dynamic analysis captures a high level of \emph{abstraction}, and could be used to make structural comparisons between different temporal media, such as music, film, animation, and dance. % analyse and compare information % flow in different temporal media regardless of whether they are auditory, % visual or otherwise. Another consequence is that the information dynamic approach gives us a principled way to address the notion of \emph{subjectivity}, since the analysis is dependent on the probability model the observer starts off with, which may depend on prior experience or other factors, and which may change over time. Thus, inter-subject variablity and variation in subjects' responses over time are fundamental to the theory. %modelling the creative process, which often alternates between generative %and selective or evaluative phases \cite{Boden1990}, and would have %applications in tools for computer aided composition. \section{Theoretical review} \subsection{Entropy and information} \label{s:entro-info} Let $X$ denote some variable whose value is initially unknown to our hypothetical observer. We will treat $X$ mathematically as a random variable, with a value to be drawn from some set $\X$ and a probability distribution representing the observer's beliefs about the true value of $X$. In this case, the observer's uncertainty about $X$ can be quantified as the entropy of the random variable $H(X)$. For a discrete variable with probability mass function $p:\X \to [0,1]$, this is \begin{equation} H(X) = \sum_{x\in\X} -p(x) \log p(x), % = \expect{-\log p(X)}, \end{equation} % where $\expect{}$ is the expectation operator. The negative-log-probability $\ell(x) = -\log p(x)$ of a particular value $x$ can usefully be thought of as the \emph{surprisingness} of the value $x$ should it be observed, and hence the entropy is the expectation of the surprisingness $\expect \ell(X)$. Now suppose that the observer receives some new data $\Data$ that causes a revision of its beliefs about $X$. The \emph{information} in this new data \emph{about} $X$ can be quantified as the Kullback-Leibler (KL) divergence between the prior and posterior distributions $p(x)$ and $p(x|\Data)$ respectively: \begin{equation} \mathcal{I}_{\Data\to X} = D(p_{X|\Data} || p_{X}) = \sum_{x\in\X} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}. \label{eq:info} \end{equation} When there are multiple variables $X_1, X_2$ \etc which the observer believes to be dependent, then the observation of one may change its beliefs and hence yield information about the others. The joint and conditional entropies as described in any textbook on information theory (\eg \cite{CoverThomas}) then quantify the observer's expected uncertainty about groups of variables given the values of others. In particular, the \emph{mutual information} $I(X_1;X_2)$ is both the expected information in an observation of $X_2$ about $X_1$ and the expected reduction in uncertainty about $X_1$ after observing $X_2$: \begin{equation} I(X_1;X_2) = H(X_1) - H(X_1|X_2), \end{equation} where $H(X_1|X_2) = H(X_1,X_2) - H(X_2)$ is the conditional entropy of $X_2$ given $X_1$. A little algebra shows that $I(X_1;X_2)=I(X_2;X_1)$ and so the mutual information is symmetric in its arguments. A conditional form of the mutual information can be formulated analogously: \begin{equation} I(X_1;X_2|X_3) = H(X_1|X_3) - H(X_1|X_2,X_3). \end{equation} These relationships between the various entropies and mutual informations are conveniently visualised in Venn diagram-like \emph{information diagrams} or I-diagrams \cite{Yeung1991} such as the one in \figrf{venn-example}. \begin{fig}{venn-example} \newcommand\rad{2.2em}% \newcommand\circo{circle (3.4em)}% \newcommand\labrad{4.3em} \newcommand\bound{(-6em,-5em) rectangle (6em,6em)} \newcommand\colsep{\ } \newcommand\clipin[1]{\clip (#1) \circo;}% \newcommand\clipout[1]{\clip \bound (#1) \circo;}% \newcommand\cliptwo[3]{% \begin{scope} \clipin{#1}; \clipin{#2}; \clipout{#3}; \fill[black!30] \bound; \end{scope} }% \newcommand\clipone[3]{% \begin{scope} \clipin{#1}; \clipout{#2}; \clipout{#3}; \fill[black!15] \bound; \end{scope} }% \begin{tabular}{c@{\colsep}c} \begin{tikzpicture}[baseline=0pt] \coordinate (p1) at (90:\rad); \coordinate (p2) at (210:\rad); \coordinate (p3) at (-30:\rad); \clipone{p1}{p2}{p3}; \clipone{p2}{p3}{p1}; \clipone{p3}{p1}{p2}; \cliptwo{p1}{p2}{p3}; \cliptwo{p2}{p3}{p1}; \cliptwo{p3}{p1}{p2}; \begin{scope} \clip (p1) \circo; \clip (p2) \circo; \clip (p3) \circo; \fill[black!45] \bound; \end{scope} \draw (p1) \circo; \draw (p2) \circo; \draw (p3) \circo; \path (barycentric cs:p3=1,p1=-0.2,p2=-0.1) +(0ex,0) node {$I_{3|12}$} (barycentric cs:p1=1,p2=-0.2,p3=-0.1) +(0ex,0) node {$I_{1|23}$} (barycentric cs:p2=1,p3=-0.2,p1=-0.1) +(0ex,0) node {$I_{2|13}$} (barycentric cs:p3=1,p2=1,p1=-0.55) +(0ex,0) node {$I_{23|1}$} (barycentric cs:p1=1,p3=1,p2=-0.55) +(0ex,0) node {$I_{13|2}$} (barycentric cs:p2=1,p1=1,p3=-0.55) +(0ex,0) node {$I_{12|3}$} (barycentric cs:p3=1,p2=1,p1=1) node {$I_{123}$} ; \path (p1) +(140:\labrad) node {$X_1$} (p2) +(-140:\labrad) node {$X_2$} (p3) +(-40:\labrad) node {$X_3$}; \end{tikzpicture} & \parbox{0.5\linewidth}{ \small \begin{align*} I_{1|23} &= H(X_1|X_2,X_3) \\ I_{13|2} &= I(X_1;X_3|X_2) \\ I_{1|23} + I_{13|2} &= H(X_1|X_2) \\ I_{12|3} + I_{123} &= I(X_1;X_2) \end{align*} } \end{tabular} \caption{ I-diagram visualisation of entropies and mutual informations for three random variables $X_1$, $X_2$ and $X_3$. The areas of the three circles represent $H(X_1)$, $H(X_2)$ and $H(X_3)$ respectively. The total shaded area is the joint entropy $H(X_1,X_2,X_3)$. The central area $I_{123}$ is the co-information \cite{McGill1954}. Some other information measures are indicated in the legend. } \end{fig} \subsection{Surprise and information in sequences} \label{s:surprise-info-seq} Suppose that $(\ldots,X_{-1},X_0,X_1,\ldots)$ is a sequence of random variables, infinite in both directions, and that $\mu$ is the associated probability measure over all realisations of the sequence---in the following, $\mu$ will simply serve as a label for the process. We can indentify a number of information-theoretic measures meaningful in the context of a sequential observation of the sequence, during which, at any time $t$, the sequence of variables can be divided into a `present' $X_t$, a `past' $\past{X}_t \equiv (\ldots, X_{t-2}, X_{t-1})$, and a `future' $\fut{X}_t \equiv (X_{t+1},X_{t+2},\ldots)$. We will write the actually observed value of $X_t$ as $x_t$, and the sequence of observations up to but not including $x_t$ as $\past{x}_t$. % Since the sequence is assumed stationary, we can without loss of generality, % assume that $t=0$ in the following definitions. The in-context surprisingness of the observation $X_t=x_t$ depends on both $x_t$ and the context $\past{x}_t$: \begin{equation} \ell_t = - \log p(x_t|\past{x}_t). \end{equation} However, before $X_t$ is observed to be $x_t$, the observer can compute the \emph{expected} surprisingness as a measure of its uncertainty about the very next event; this may be written as an entropy $H(X_t|\ev(\past{X}_t = \past{x}_t))$, but note that this is conditional on the \emph{event} $\ev(\past{X}_t=\past{x}_t)$, not \emph{variables} $\past{X}_t$ as in the conventional conditional entropy. The surprisingness $\ell_t$ and expected surprisingness $H(X_t|\ev(\past{X}_t=\past{x}_t))$ can be understood as \emph{subjective} information dynamic measures, since they are based on the observer's probability model in the context of the actually observed sequence $\past{x}_t$---they characterise what it is like to `be in the observer's shoes'. If we view the observer as a purely passive or reactive agent, this would probably be sufficient, but for active agents such as humans or animals, it is often necessary to \emph{aniticipate} future events in order, for example, to plan the most effective course of action. It makes sense for such observers to be concerned about the predictive probability distribution over future events, $p(\fut{x}_t|\past{x}_t)$. When an observation $\ev(X_t=x_t)$ is made in this context, the \emph{instantaneous predictive information} (IPI) $\mathcal{I}_t$ at time $t$ is the information in the event $\ev(X_t=x_t)$ about the entire future of the sequence $\fut{X}_t$, \emph{given} the observed past $\past{X}_t=\past{x}_t$. Referring to the definition of information \eqrf{info}, this is the KL divergence between prior and posterior distributions over possible futures, which written out in full, is \begin{equation} \mathcal{I}_t = \sum_{\fut{x}_t \in \X^*} p(\fut{x}_t|x_t,\past{x}_t) \log \frac{ p(\fut{x}_t|x_t,\past{x}_t) }{ p(\fut{x}_t|\past{x}_t) }, \end{equation} where the sum is to be taken over the set of infinite sequences $\X^*$. Note that it is quite possible for an event to be surprising but not informative in predictive sense. As with the surprisingness, the observer can compute its \emph{expected} IPI at time $t$, which reduces to a mutual information $I(X_t;\fut{X}_t|\ev(\past{X}_t=\past{x}_t))$ conditioned on the observed past. This could be used, for example, as an estimate of attentional resources which should be directed at this stream of data, which may be in competition with other sensory streams. \subsection{Information measures for stationary random processes} \label{s:process-info} \begin{fig}{predinfo-bg} \newcommand\subfig[2]{\shortstack{#2\\[0.75em]#1}} \newcommand\rad{1.8em}% \newcommand\ovoid[1]{% ++(-#1,\rad) -- ++(2 * #1,0em) arc (90:-90:\rad) -- ++(-2 * #1,0em) arc (270:90:\rad) }% \newcommand\axis{2.75em}% \newcommand\olap{0.85em}% \newcommand\offs{3.6em} \newcommand\colsep{\hspace{5em}} \newcommand\longblob{\ovoid{\axis}} \newcommand\shortblob{\ovoid{1.75em}} \begin{tabular}{c} \subfig{(a) multi-information and entropy rates}{% \begin{tikzpicture}%[baseline=-1em] \newcommand\rc{1.75em} \newcommand\throw{2.5em} \coordinate (p1) at (180:1.5em); \coordinate (p2) at (0:0.3em); \newcommand\bound{(-7em,-2.6em) rectangle (7em,3.0em)} \newcommand\present{(p2) circle (\rc)} \newcommand\thepast{(p1) ++(-\throw,0) \ovoid{\throw}} \newcommand\fillclipped[2]{% \begin{scope}[even odd rule] \foreach \thing in {#2} {\clip \thing;} \fill[black!#1] \bound; \end{scope}% }% \fillclipped{30}{\present,\bound \thepast} \fillclipped{15}{\present,\bound \thepast} \fillclipped{45}{\present,\thepast} \draw \thepast; \draw \present; \node at (barycentric cs:p2=1,p1=-0.3) {$h_\mu$}; \node at (barycentric cs:p2=1,p1=1) [shape=rectangle,fill=black!45,inner sep=1pt]{$\rho_\mu$}; \path (p2) +(90:3em) node {$X_0$}; \path (p1) +(-3em,0em) node {\shortstack{infinite\\past}}; \path (p1) +(-4em,\rad) node [anchor=south] {$\ldots,X_{-1}$}; \end{tikzpicture}}% \\[1.25em] \subfig{(b) excess entropy}{% \newcommand\blob{\longblob} \begin{tikzpicture} \coordinate (p1) at (-\offs,0em); \coordinate (p2) at (\offs,0em); \begin{scope} \clip (p1) \blob; \clip (p2) \blob; \fill[lightgray] (-1,-1) rectangle (1,1); \end{scope} \draw (p1) +(-0.5em,0em) node{\shortstack{infinite\\past}} \blob; \draw (p2) +(0.5em,0em) node{\shortstack{infinite\\future}} \blob; \path (0,0) node (future) {$E$}; \path (p1) +(-2em,\rad) node [anchor=south] {$\ldots,X_{-1}$}; \path (p2) +(2em,\rad) node [anchor=south] {$X_0,\ldots$}; \end{tikzpicture}% }% \\[1.25em] \subfig{(c) predictive information rate $b_\mu$}{% \begin{tikzpicture}%[baseline=-1em] \newcommand\rc{2.1em} \newcommand\throw{2.5em} \coordinate (p1) at (210:1.5em); \coordinate (p2) at (90:0.7em); \coordinate (p3) at (-30:1.5em); \newcommand\bound{(-7em,-2.6em) rectangle (7em,3.0em)} \newcommand\present{(p2) circle (\rc)} \newcommand\thepast{(p1) ++(-\throw,0) \ovoid{\throw}} \newcommand\future{(p3) ++(\throw,0) \ovoid{\throw}} \newcommand\fillclipped[2]{% \begin{scope}[even odd rule] \foreach \thing in {#2} {\clip \thing;} \fill[black!#1] \bound; \end{scope}% }% \fillclipped{80}{\future,\thepast} \fillclipped{30}{\present,\future,\bound \thepast} \fillclipped{15}{\present,\bound \future,\bound \thepast} \draw \future; \fillclipped{45}{\present,\thepast} \draw \thepast; \draw \present; \node at (barycentric cs:p2=1,p1=-0.17,p3=-0.17) {$r_\mu$}; \node at (barycentric cs:p1=-0.4,p2=1.0,p3=1) {$b_\mu$}; \node at (barycentric cs:p3=0,p2=1,p1=1.2) [shape=rectangle,fill=black!45,inner sep=1pt]{$\rho_\mu$}; \path (p2) +(140:3em) node {$X_0$}; % \node at (barycentric cs:p3=0,p2=1,p1=1) {$\rho_\mu$}; \path (p3) +(3em,0em) node {\shortstack{infinite\\future}}; \path (p1) +(-3em,0em) node {\shortstack{infinite\\past}}; \path (p1) +(-4em,\rad) node [anchor=south] {$\ldots,X_{-1}$}; \path (p3) +(4em,\rad) node [anchor=south] {$X_1,\ldots$}; \end{tikzpicture}}% \\[0.5em] \end{tabular} \caption{ I-diagrams for several information measures in stationary random processes. Each circle or oval represents a random variable or sequence of random variables relative to time $t=0$. Overlapped areas correspond to various mutual information as in \Figrf{venn-example}. In (b), the circle represents the `present'. Its total area is $H(X_0)=\rho_\mu+r_\mu+b_\mu$, where $\rho_\mu$ is the multi-information rate, $r_\mu$ is the residual entropy rate, and $b_\mu$ is the predictive information rate. The entropy rate is $h_\mu = r_\mu+b_\mu$. The small dark region below $X_0$ in (c) is $\sigma_\mu = E-\rho_\mu$. } \end{fig} If we step back, out of the observer's shoes as it were, and consider the random process $(\ldots,X_{-1},X_0,X_1,\dots)$ as a statistical ensemble of possible realisations, and furthermore assume that it is stationary, then it becomes possible to define a number of information-theoretic measures, closely related to those described above, but which characterise the process as a whole, rather than on a moment-by-moment basis. Some of these, such as the entropy rate, are well-known, but others are only recently being investigated. (In the following, the assumption of stationarity means that the measures defined below are independent of $t$.) The \emph{entropy rate} of the process is the entropy of the next variable $X_t$ given all the previous ones. \begin{equation} \label{eq:entro-rate} h_\mu = H(X_t|\past{X}_t). \end{equation} The entropy rate is a measure of the overall surprisingness or unpredictability of the process, and gives an indication of the average level of surprise and uncertainty that would be experienced by an observer processing a sequence sampled from the process using the methods of \secrf{surprise-info-seq}. The \emph{multi-information rate} $\rho_\mu$ (following Dubnov's \cite{Dubnov2006} notation for what he called the `information rate') is the mutual information between the `past' and the `present': \begin{equation} \label{eq:multi-info} \rho_\mu = I(\past{X}_t;X_t) = H(X_t) - h_\mu. \end{equation} It is a measure of how much the context of an observation (that is, the observation of previous elements of the sequence) helps in predicting or reducing the suprisingness of the current observation. The \emph{excess entropy} \cite{CrutchfieldPackard1983} is the mutual information between the entire `past' and the entire `future': \begin{equation} E = I(\past{X}_t; X_t,\fut{X}_t). \end{equation} Both the excess entropy and the multi-information rate can be thought of as measures of \emph{redundancy}, quantifying the extent to which the same information is to be found in all parts of the sequence. The \emph{predictive information rate} (or PIR) \cite{AbdallahPlumbley2009} is the mutual information between the present and the infinite future given the infinite past: \begin{equation} \label{eq:PIR} b_\mu = I(X_t;\fut{X}_t|\past{X}_t) = H(\fut{X}_t|\past{X}_t) - H(\fut{X}_t|X_t,\past{X}_t). \end{equation} Equation \eqrf{PIR} can be read as the average reduction in uncertainty about the future on learning $X_t$, given the past. Due to the symmetry of the mutual information, it can also be written as \begin{equation} % \IXZ_t b_\mu = H(X_t|\past{X}_t) - H(X_t|\past{X}_t,\fut{X}_t) = h_\mu - r_\mu, % \label{<++>} \end{equation} % If $X$ is stationary, then where $r_\mu = H(X_t|\fut{X}_t,\past{X}_t)$, is the \emph{residual} \cite{AbdallahPlumbley2010}, or \emph{erasure} \cite{VerduWeissman2006} entropy rate. These relationships are illustrated in \Figrf{predinfo-bg}, along with several of the information measures we have discussed so far. The PIR gives an indication of the average IPI that would be experienced by an observer processing a sequence sampled from this process. James et al \cite{JamesEllisonCrutchfield2011} review several of these information measures and introduce some new related ones. In particular they identify the $\sigma_\mu = I(\past{X}_t;\fut{X}_t|X_t)$, the mutual information between the past and the future given the present, as an interesting quantity that measures the predictive benefit of model-building (that is, maintaining an internal state summarising past observations in order to make better predictions). It is shown as the small dark region below the circle in \figrf{predinfo-bg}(c). By comparing with \figrf{predinfo-bg}(b), we can see that $\sigma_\mu = E - \rho_\mu$. % They also identify % $w_\mu = \rho_\mu + b_{\mu}$, which they call the \emph{local exogenous % information} rate. \subsection{First and higher order Markov chains} \label{s:markov} First order Markov chains are the simplest non-trivial models to which information dynamics methods can be applied. In \cite{AbdallahPlumbley2009} we derived expressions for all the information measures described in \secrf{surprise-info-seq} for irreducible stationary Markov chains (\ie that have a unique stationary distribution). The derivation is greatly simplified by the dependency structure of the Markov chain: for the purpose of the analysis, the `past' and `future' segments $\past{X}_t$ and $\fut{X}_t$ can be collapsed to just the previous and next variables $X_{t-1}$ and $X_{t+1}$ respectively. We also showed that the predictive information rate can be expressed simply in terms of entropy rates: if we let $a$ denote the $K\times K$ transition matrix of a Markov chain over an alphabet of $\{1,\ldots,K\}$, such that $a_{ij} = \Pr(\ev(X_t=i|X_{t-1}=j))$, and let $h:\reals^{K\times K}\to \reals$ be the entropy rate function such that $h(a)$ is the entropy rate of a Markov chain with transition matrix $a$, then the predictive information rate $b(a)$ is \begin{equation} b(a) = h(a^2) - h(a), \end{equation} where $a^2$, the transition matrix squared, is the transition matrix of the `skip one' Markov chain obtained by jumping two steps at a time along the original chain. Second and higher order Markov chains can be treated in a similar way by transforming to a first order representation of the high order Markov chain. If we are dealing with an $N$th order model, this is done forming a new alphabet of size $K^N$ consisting of all possible $N$-tuples of symbols from the base alphabet. An observation $\hat{x}_t$ in this new model encodes a block of $N$ observations $(x_{t+1},\ldots,x_{t+N})$ from the base model. The next observation $\hat{x}_{t+1}$ encodes the block of $N$ obtained by shifting the previous block along by one step. The new Markov of chain is parameterised by a sparse $K^N\times K^N$ transition matrix $\hat{a}$. Adopting the label $\mu$ for the order $N$ system, we obtain: \begin{equation} h_\mu = h(\hat{a}), \qquad b_\mu = h({\hat{a}^{N+1}}) - N h({\hat{a}}), \end{equation} where $\hat{a}^{N+1}$ is the $(N+1)$th power of the first order transition matrix. Other information measures can also be computed for the high-order Markov chain, including the multi-information rate $\rho_\mu$ and the excess entropy $E$. These are identical for first order Markov chains, but for order $N$ chains, $E$ can be up to $N$ times larger than $\rho_\mu$. [Something about what kinds of Markov chain maximise $h_\mu$ (uncorrelated `white' sequences, no temporal structure), $\rho_\mu$ and $E$ (periodic) and $b_\mu$. We return this in \secrf{composition}.] \section{Information Dynamics in Analysis} \begin{fig}{twopages} \colfig[0.96]{matbase/fig9471} % update from mbc paper % \colfig[0.97]{matbase/fig72663}\\ % later update from mbc paper (Keith's new picks) \vspace*{1em} \colfig[0.97]{matbase/fig13377} % rule based analysis \caption{Analysis of \emph{Two Pages}. The thick vertical lines are the part boundaries as indicated in the score by the composer. The thin grey lines indicate changes in the melodic `figures' of which the piece is constructed. In the `model information rate' panel, the black asterisks mark the six most surprising moments selected by Keith Potter. The bottom panel shows a rule-based boundary strength analysis computed using Cambouropoulos' LBDM. All information measures are in nats and time is in notes. } \end{fig} \subsection{Musicological Analysis} In \cite{AbdallahPlumbley2009}, methods based on the theory described above were used to analysis two pieces of music in the minimalist style by Philip Glass: \emph{Two Pages} (1969) and \emph{Gradus} (1968). The analysis was done using a first-order Markov chain model, with the enhancement that the transition matrix of the model was allowed to evolve dynamically as the notes were processed, and was tracked (in a Bayesian way) as a \emph{distribution} over possible transition matrices, rather than a point estimate. The results are summarised in \figrf{twopages}: the upper four plots show the dynamically evolving subjective information measures as described in \secrf{surprise-info-seq} computed using a point estimate of the current transition matrix, but the fifth plot (the `model information rate') measures the information in each observation about the transition matrix. In \cite{AbdallahPlumbley2010b}, we showed that this `model information rate' is actually a component of the true IPI in a time-varying Markov chain, which was neglected when we computed the IPI from point estimates of the transition matrix as if the transition probabilities were constant. The peaks of the surprisingness and both components of the predictive information show good correspondence with structure of the piece both as marked in the score and as analysed by musicologist Keith Potter, who was asked to mark the six `most surprising moments' of the piece (shown as asterisks in the fifth plot)% \footnote{% Note that the boundary marked in the score at around note 5,400 is known to be anomalous; on the basis of a listening analysis, some musicologists [ref] have placed the boundary a few bars later, in agreement with our analysis.}. In contrast, the analyses shown in the lower two plots of \figrf{twopages}, obtained using two rule-based music segmentation algorithms, while clearly \emph{reflecting} the structure of the piece, do not \emph{segment} the piece, with no tendency to peaking of the boundary strength function at the boundaries in the piece. The complete analysis of \emph{Gradus} can be found in \cite{AbdallahPlumbley2009}, but \figrf{metre} illustrates the result of a metrical analysis: the piece was divided into bars of 32, 64 and 128 notes. In each case, the average surprisingness and IPI for the first, second, third \etc notes in each bar were computed. The plots show that the first note of each bar is, on average, significantly more surprising and informative than the others, up to the 64-note level, where as at the 128-note, level, the dominant periodicity appears to remain at 64 notes. \begin{fig}{metre} % \scalebox{1}[1]{% \begin{tabular}{cc} \colfig[0.45]{matbase/fig36859} & \colfig[0.48]{matbase/fig88658} \\ \colfig[0.45]{matbase/fig48061} & \colfig[0.48]{matbase/fig46367} \\ \colfig[0.45]{matbase/fig99042} & \colfig[0.47]{matbase/fig87490} % \colfig[0.46]{matbase/fig56807} & \colfig[0.48]{matbase/fig27144} \\ % \colfig[0.46]{matbase/fig87574} & \colfig[0.48]{matbase/fig13651} \\ % \colfig[0.44]{matbase/fig19913} & \colfig[0.46]{matbase/fig66144} \\ % \colfig[0.48]{matbase/fig73098} & \colfig[0.48]{matbase/fig57141} \\ % \colfig[0.48]{matbase/fig25703} & \colfig[0.48]{matbase/fig72080} \\ % \colfig[0.48]{matbase/fig9142} & \colfig[0.48]{matbase/fig27751} \end{tabular}% % } \caption{Metrical analysis by computing average surprisingness and informative of notes at different periodicities (\ie hypothetical bar lengths) and phases (\ie positions within a bar). } \end{fig} \subsection{Content analysis/Sound Categorisation} Using analogous definitions of differential entropy, the methods outlined in the previous section are equally applicable to continuous random variables. In the case of music, where expressive properties such as dynamics, tempo, timing and timbre are readily quantified on a continuous scale, the information dynamic framework thus may also be considered. In \cite{Dubnov2006}, Dubnov considers the class of stationary Gaussian processes. For such processes, the entropy rate may be obtained analytically from the power spectral density of the signal, allowing the multi-information rate to be subsequently obtained. Local stationarity is assumed, which may be achieved by windowing or change point detection \cite{Dubnov2008}. %TODO mention non-gaussian processes extension Similarly, the predictive information rate may be computed using a Gaussian linear formulation CITE. In this view, the PIR is a function of the correlation between random innovations supplied to the stochastic process. %Dubnov, MacAdams, Reynolds (2006) %Bailes and Dean (2009) [ Continuous domain information ] [Audio based music expectation modelling] [ Gaussian processes] \subsection{Beat Tracking} A probabilistic method for drum tracking was presented by Robertson \cite{Robertson11c}. The algorithm is used to synchronise a music sequencer to a live drummer. The expected beat time of the sequencer is represented by a click track, and the algorithm takes as input event times for discrete kick and snare drum events relative to this click track. These are obtained using dedicated microphones for each drum and using a percussive onset detector (Puckette 1998). The drum tracker continually updates distributions for tempo and phase on receiving a new event time. We can thus quantify the information contributed of an event by measuring the difference between the system's prior distribution and the posterior distribution using the Kullback-Leiber divergence. Here, we have calculated the KL divergence and entropy for kick and snare events in sixteen files. The analysis of information rates can be considered \emph{subjective}, in that it measures how the drum tracker's probability distributions change, and these are contingent upon the model used as well as external properties in the signal. We expect, however, that following periods of increased uncertainty, such as fills or expressive timing, the information contained in an individual event increases. We also examine whether the information is dependent upon metrical position. \section{Information dynamics as compositional aid} \label{s:composition} The use of stochastic processes in music composition has been widespread for decades---for instance Iannis Xenakis applied probabilistic mathematical models to the creation of musical materials\cite{Xenakis:1992ul}. While such processes can drive the \emph{generative} phase of the creative process, information dynamics can serve as a novel framework for a \emph{selective} phase, by providing a set of criteria to be used in judging which of the generated materials are of value. This alternation of generative and selective phases as been noted by art theorist Margaret Boden \cite{Boden1990}. Information-dynamic criteria can also be used as \emph{constraints} on the generative processes, for example, by specifying a certain temporal profile of suprisingness and uncertainty the composer wishes to induce in the listener as the piece unfolds. %stochastic and algorithmic processes: ; outputs can be filtered to match a set of %criteria defined in terms of information-dynamical characteristics, such as %predictability vs unpredictability %s model, this criteria thus becoming a means of interfacing with the generative processes. The tools of information dynamics provide a way to constrain and select musical materials at the level of patterns of expectation, implication, uncertainty, and predictability. In particular, the behaviour of the predictive information rate (PIR) defined in \secrf{process-info} make it interesting from a compositional point of view. The definition of the PIR is such that it is low both for extremely regular processes, such as constant or periodic sequences, \emph{and} low for extremely random processes, where each symbol is chosen independently of the others, in a kind of `white noise'. In the former case, the pattern, once established, is completely predictable and therefore there is no \emph{new} information in subsequent observations. In the latter case, the randomness and independence of all elements of the sequence means that, though potentially surprising, each observation carries no information about the ones to come. Processes with high PIR maintain a certain kind of balance between predictability and unpredictability in such a way that the observer must continually pay attention to each new observation as it occurs in order to make the best possible predictions about the evolution of the seqeunce. This balance between predictability and unpredictability is reminiscent of the inverted `U' shape of the Wundt curve (see \figrf{wundt}), which summarises the observations of Wundt that the greatest aesthetic value in art is to be found at intermediate levels of disorder, where there is a balance between `order' and `chaos'. Using the methods of \secrf{markov}, we found \cite{AbdallahPlumbley2009} a similar shape when plotting entropy rate againt PIR---this is visible in the upper envelope of the scatter plot in \figrf{mtriscat}, which is a 3-D scatter plot of three of the information measures discussed in \secrf{process-info} for several thousand first-order Markov chain transition matrices generated by a random sampling method. The coordinates of the `information space' are entropy rate ($h_\mu$), redundancy ($\rho_\mu$), and predictive information rate ($b_\mu$). The points along the 'redundancy' axis correspond to periodic Markov chains. Those along the `entropy' produce uncorrelated sequences with no temporal structure. Processes with high PIR are to be found at intermediate levels of entropy and redundancy. These observations led us to construct the `Melody Triangle' as a graphical interface for exploring the melodic patterns generated by each of the Markov chains represented as points in \figrf{mtriscat}. \begin{fig}{wundt} \raisebox{-4em}{\colfig[0.43]{wundt}} % {\ \shortstack{{\Large$\longrightarrow$}\\ {\scriptsize\emph{exposure}}}\ } {\ {\large$\longrightarrow$}\ } \raisebox{-4em}{\colfig[0.43]{wundt2}} \caption{ The Wundt curve relating randomness/complexity with perceived value. Repeated exposure sometimes results in a move to the left along the curve \cite{Berlyne71}. } \end{fig} %It is possible to apply information dynamics to the generation of content, such as to the composition of musical materials. %For instance a stochastic music generating process could be controlled by modifying %constraints on its output in terms of predictive information rate or entropy %rate. \subsection{The Melody Triangle} The Melody Triangle is an exploratory interface for the discovery of melodic content, where the input---positions within a triangle---directly map to information theoretic measures of the output. The measures---entropy rate, redundancy and predictive information rate---form a criteria with which to filter the output of the stochastic processes used to generate sequences of notes. These measures address notions of expectation and surprise in music, and as such the Melody Triangle is a means of interfacing with a generative process in terms of the predictability of its output. The triangle is `populated' with first order Markov chain transition matrices as illustrated in \figrf{mtriscat}. \begin{fig}{mtriscat} \colfig{mtriscat} \caption{The population of transition matrices distributed along three axes of redundancy, entropy rate and predictive information rate (all measured in bits). The concentrations of points along the redundancy axis correspond to Markov chains which are roughly periodic with periods of 2 (redundancy 1 bit), 3, 4, \etc all the way to period 8 (redundancy 3 bits). The colour of each point represents its PIR---note that the highest values are found at intermediate entropy and redundancy, and that the distribution as a whole makes a curved triangle. Although not visible in this plot, it is largely hollow in the middle.} \end{fig} The distribution of transition matrices plotted in this space forms an arch shape that is fairly thin. It thus becomes a reasonable approximation to pretend that it is just a sheet in two dimensions; and so we stretch out this curved arc into a flat triangle. It is this triangular sheet that is our `Melody Triangle' and forms the interface by which the system is controlled. Using this interface thus involves a mapping to information space; a user selects a position within the triangle, and a corresponding transition matrix is returned. \Figrf{TheTriangle} shows how the triangle maps to different measures of redundancy, entropy rate and predictive information rate. Each corner corresponds to three different extremes of predictability and unpredictability, which could be loosely characterised as `periodicity', `noise' and `repetition'. Melodies from the `noise' corner have no discernible pattern; they have high entropy rate, low predictive information rate and low redundancy. These melodies are essentially totally random. A melody along the `periodicity' to `repetition' edge are all deterministic loops that get shorter as we approach the `repetition' corner, until it becomes just one repeating note. It is the areas in between the extremes that provide the more `interesting' melodies. These melodies have some level of unpredictability, but are not completely random. Or, conversely, are predictable, but not entirely so. \begin{fig}{TheTriangle} \colfig[0.9]{TheTriangle.pdf} \caption{The Melody Triangle} \end{fig} %PERHAPS WE SHOULD FOREGO TALKING ABOUT THE %INSTALLATION VERSION OF THE TRIANGLE? %feels a bit like a tangent, and could do with the space.. The Melody Triangle exists in two incarnations; a standard screen based interface where a user moves tokens in and around a triangle on screen, and a multi-user interactive installation where a Kinect camera tracks individuals in a space and maps their positions in physical space to the triangle. In the latter each visitor that enters the installation generates a melody and can collaborate with their co-visitors to generate musical textures---a playful yet informative way to explore expectation and surprise in music. Additionally visitors can change the tempo, register, instrumentation and periodicity of their melody with body gestures. As a screen based interface the Melody Triangle can serve as a composition tool. A triangle is drawn on the screen, screen space thus mapped to the statistical space of the Melody Triangle. A number of tokens, each representing a melody, can be dragged in and around the triangle. For each token, a sequence of symbols with statistical properties that correspond to the token's position is generated. These symbols are then mapped to notes of a scale% \footnote{However they could just as well be mapped to any other property, such as intervals, chords, dynamics and timbres. It is even possible to map the symbols to non-sonic outputs, such as colours. The possibilities afforded by the Melody Triangle in these other domains remains to be investigated.}. Additionally keyboard commands give control over other musical parameters. The Melody Triangle can generate intricate musical textures when multiple tokens are in the triangle. Unlike other computer aided composition tools or programming environments, here the composer engages with music on a high and abstract level; the interface relating to subjective expectation and predictability. \subsection{Information Dynamics as Evaluative Feedback Mechanism} %NOT SURE THIS SHOULD BE HERE AT ALL..? \begin{fig}{mtri-results} \def\scat#1{\colfig[0.42]{mtri/#1}} \def\subj#1{\scat{scat_dwells_subj_#1} & \scat{scat_marks_subj_#1}} \begin{tabular}{cc} \subj{a} \\ \subj{b} \\ \subj{c} \\ \subj{d} \end{tabular} \caption{Dwell times and mark positions from user trials with the on-screen Melody Triangle interface. The left-hand column shows the positions in a 2D information space (entropy rate vs multi-information rate in bits) where spent their time; the area of each circle is proportional to the time spent there. The right-hand column shows point which subjects `liked'.} \end{fig} Information measures on a stream of symbols can form a feedback mechanism; a rudimentary `critic' of sorts. For instance symbol by symbol measure of predictive information rate, entropy rate and redundancy could tell us if a stream of symbols is currently `boring', either because it is too repetitive, or because it is too chaotic. Such feedback would be oblivious to long term and large scale structures and any cultural norms (such as style conventions), but nonetheless could provide a composer with valuable insight on the short term properties of a work. This could not only be used for the evaluation of pre-composed streams of symbols, but could also provide real-time feedback in an improvisatory setup. \section{Musical Preference and Information Dynamics} We are carrying out a study to investigate the relationship between musical preference and the information dynamics models, the experimental interface a simplified version of the screen-based Melody Triangle. Participants are asked to use this music pattern generator under various experimental conditions in a composition task. The data collected includes usage statistics of the system: where in the triangle they place the tokens, how long they leave them there and the state of the system when users, by pressing a key, indicate that they like what they are hearing. As such the experiments will help us identify any correlation between the information theoretic properties of a stream and its perceived aesthetic worth. Some initial results for four subjects are shown in \figrf{mtri-results}. Though subjects seem to exhibit distinct kinds of exploratory behaviour, we have not been able to show any systematic across-subjects preference for any particular region of the triangle. Subjects' comments: several noticed the main organisation of the triangle: repetative notes at the top, cyclic patters along the right edge, and unpredictable notes towards the bottom left (a,c,f). Some did systematic exploration. Felt that the right side was more `controllable' than the left (a,f)---a direct consequence of their ability to return to a particular periodic pattern and recognise at as one heard previously. Some (a,e) felt the trial was too long and became bored towards the end. One subject (f) felt there wasn't enough time to get to hear out the patterns properly. One subject (b) didn't enjoy the lower region whereas another (d) said the lower regions were more `melodic' and `interesting'. %\emph{comparable system} Gordon Pask's Musicolor (1953) applied a similar notion %of boredom in its design. The Musicolour would react to audio input through a %microphone by flashing coloured lights. Rather than a direct mapping of sound %to light, Pask designed the device to be a partner to a performing musician. It %would adapt its lighting pattern based on the rhythms and frequencies it would %hear, quickly `learning' to flash in time with the music. However Pask endowed %the device with the ability to `be bored'; if the rhythmic and frequency content %of the input remained the same for too long it would listen for other rhythms %and frequencies, only lighting when it heard these. As the Musicolour would %`get bored', the musician would have to change and vary their playing, eliciting %new and unexpected outputs in trying to keep the Musicolour interested. \section{Conclusion} We outlined our information dynamics approach to the modelling of the perception of music. This approach models the subjective assessments of an observer that updates its probabilistic model of a process dynamically as events unfold. We outlined `time-varying' information measures, including a novel `predictive information rate' that characterises the surprisingness and predictability of musical patterns. We have outlined how information dynamics can serve in three different forms of analysis; musicological analysis, sound categorisation and beat tracking. We have described the `Melody Triangle', a novel system that enables a user/composer to discover musical content in terms of the information theoretic properties of the output, and considered how information dynamics could be used to provide evaluative feedback on a composition or improvisation. Finally we outline a pilot study that used the Melody Triangle as an experimental interface to help determine if there are any correlations between aesthetic preference and information dynamics measures. \section{acknowledgments} This work is supported by EPSRC Doctoral Training Centre EP/G03723X/1 (HE), GR/S82213/01 and EP/E045235/1(SA), an EPSRC DTA Studentship (PF), an RAEng/EPSRC Research Fellowship 10216/88 (AR), an EPSRC Leadership Fellowship, EP/G007144/1 (MDP) and EPSRC IDyOM2 EP/H013059/1. This work is partly funded by the CoSound project, funded by the Danish Agency for Science, Technology and Innovation. \bibliographystyle{abbrv} {\bibliography{all,c4dm,nime,andrew}} \end{document}