Mercurial > hg > cip2012

Binary file draft.pdf has changed
--- a/draft.tex	Fri Mar 16 21:11:31 2012 +0000
+++ b/draft.tex	Fri Mar 16 22:50:48 2012 +0000
@@ -65,12 +65,17 @@

 \maketitle
 \begin{abstract}
-	People take in information when perceiving music.  With it they continually
-	build predictive models of what is going to happen.  There is a relationship
-	between information measures and how we perceive music.  An information
-	theoretic approach to music cognition is thus a fruitful avenue of research.
-	In this paper, we review the theoretical foundations of information dynamics
-	and discuss a few emerging areas of application.
+	We describe an information-theoretic approach to the analysis
+	of music and other sequential data, which emphasises the predictive aspects
+	of perception, and the dynamic process
+	of forming and modifying expectations about an unfolding stream of data,
+	characterising these using the tools of information theory: entropies,
+	mutual informations, and related quantities.
+	After reviewing the theoretical foundations,
+%	we present a new result on predictive information rates in high-order Markov chains, and
+	we discuss a few emerging areas of application, including
+	musicological analysis, real-time beat-tracking analysis, and the generation
+	of musical materials as a cognitively-informed compositional aid.
 \end{abstract}


@@ -86,17 +91,16 @@
 	entropy, relative entropy, and mutual information.

 	Music is also an inherently dynamic process,
-	where listeners build up expectations on what is to happen next,
-	which are either satisfied or modified as the music unfolds.
+	where listeners build up expectations about what is to happen next,
+	which may be fulfilled
+	immediately, after some delay, or modified as the music unfolds.
 	In this paper, we explore this ``Information Dynamics'' view of music,
-	discussing the theory behind it and some emerging appliations
+	discussing the theory behind it and some emerging applications.

 	\subsection{Expectation and surprise in music}
-	One of the effects of listening to music is to create
-	expectations of what is to come next, which may be fulfilled
-	immediately, after some delay, or not at all as the case may be.
-	This is the thesis put forward by, amongst others, music theorists
-	L. B. Meyer \cite{Meyer67} and Narmour \citep{Narmour77}, but was
+	The thesis that the musical experience is strongly shaped by the generation
+	and playing out of strong and weak expectations was put forward by, amongst others,
+	music theorists L. B. Meyer \cite{Meyer67} and Narmour \citep{Narmour77}, but was
 	recognised much earlier; for example,
 	it was elegantly put by Hanslick \cite{Hanslick1854} in the
 	nineteenth century:
@@ -112,8 +116,8 @@
 			%takes place unconsciously, and with the rapidity of lightning-flashes.'
 	\end{quote}
 	An essential aspect of this is that music is experienced as a phenomenon
-	that `unfolds' in time, rather than being apprehended as a static object
-	presented in its entirety. Meyer argued that musical experience depends
+	that unfolds in time, rather than being apprehended as a static object
+	presented in its entirety. Meyer argued that the experience depends
 	on how we change and revise our conceptions \emph{as events happen}, on
 	how expectation and prediction interact with occurrence, and that, to a
 	large degree, the way to understand the effect of music is to focus on
@@ -131,27 +135,6 @@
 	analysis of music, \eg
 	\cite{ConklinWitten95,PonsfordWigginsMellish1999,Pearce2005}.

-
-\comment{
-	The business of making predictions and assessing surprise is essentially
-	one of reasoning under conditions of uncertainty and manipulating
-	degrees of belief about the various proposition which may or may not
-	hold, and, as has been argued elsewhere \cite{Cox1946,Jaynes27}, best
-	quantified in terms of Bayesian probability theory.
-   Thus, we suppose that
-	when we listen to music, expectations are created on the basis of our
-	familiarity with various stylistic norms that apply to music in general,
-	the particular style (or styles) of music that seem best to fit the piece
-	we are listening to, and
-	the emerging structures peculiar to the current piece.  There is
-	experimental evidence that human listeners are able to internalise
-	statistical knowledge about musical structure, \eg
-	\citep{SaffranJohnsonAslin1999,EerolaToiviainenKrumhansl2002}, and also
-	that statistical models can form an effective basis for computational
-	analysis of music, \eg
-	\cite{ConklinWitten95,PonsfordWigginsMellish1999,Pearce2005}.
-}
-
 %	\subsection{Music and information theory}
 	With a probabilistic framework for music modelling and prediction in hand,
 	we are in a position to compute various
@@ -187,11 +170,11 @@


 \subsection{Information dynamic approach}
-	Bringing the various strands together, our working hypothesis is that as a
+	Our working hypothesis is that, as a
 	listener (to which will refer as `it') listens to a piece of music, it maintains
 	a dynamically evolving probabilistic model that enables it to make predictions
 	about how the piece will continue, relying on both its previous experience
-	of music and the immediate context of the piece.  As events unfold, it revises
+	of music and the emerging themes of the piece.  As events unfold, it revises
 	its probabilistic belief state, which includes predictive
 	distributions over possible future events.  These
 %	distributions and changes in distributions
@@ -203,11 +186,11 @@
 	One of the consequences of this approach is that regardless of the details of
 	the sensory input or even which sensory modality is being processed, the resulting
 	analysis is in terms of the same units: quantities of information (bits) and
-	rates of information flow (bits per second). The probabilistic and information
+	rates of information flow (bits per second). The information
 	theoretic concepts in terms of which the analysis is framed are universal to all sorts
 	of data.
 	In addition, when adaptive probabilistic models are used, expectations are
-	created mainly in response to to \emph{patterns} of occurence,
+	created mainly in response to \emph{patterns} of occurence,
 	rather the details of which specific things occur.
 	Together, these suggest that an information dynamic analysis captures a
 	high level of \emph{abstraction}, and could be used to
@@ -249,11 +232,12 @@
 	The negative-log-probability
 	$\ell(x) = -\log p(x)$ of a particular value $x$ can usefully be thought of as
 	the \emph{surprisingness} of the value $x$ should it be observed, and
-	hence the entropy is the expectation of the surprisingness $\expect \ell(X)$.
+	hence the entropy is the expectation of the surprisingness, $\expect \ell(X)$.

 	Now suppose that the observer receives some new data $\Data$ that
 	causes a revision of its beliefs about $X$. The \emph{information}
 	in this new data \emph{about} $X$ can be quantified as the
+	relative entropy or
 	Kullback-Leibler (KL) divergence between the prior and posterior
 	distributions $p(x)$ and $p(x|\Data)$ respectively:
 	\begin{equation}
@@ -282,7 +266,7 @@
 		I(X_1;X_2|X_3) = H(X_1|X_3) - H(X_1|X_2,X_3).
 	\end{equation}
 	These relationships between the various entropies and mutual
-	informations are conveniently visualised in Venn diagram-like \emph{information diagrams}
+	informations are conveniently visualised in \emph{information diagrams}
 	or I-diagrams \cite{Yeung1991} such as the one in \figrf{venn-example}.

 	\begin{fig}{venn-example}
@@ -355,7 +339,7 @@
 			}
 		\end{tabular}
 		\caption{
-		I-diagram visualisation of entropies and mutual informations
+		I-diagram of entropies and mutual informations
 		for three random variables $X_1$, $X_2$ and $X_3$. The areas of
 		the three circles represent $H(X_1)$, $H(X_2)$ and $H(X_3)$ respectively.
 		The total shaded area is the joint entropy $H(X_1,X_2,X_3)$.
@@ -371,10 +355,10 @@
 	Suppose that  $(\ldots,X_{-1},X_0,X_1,\ldots)$ is a sequence of
 	random variables, infinite in both directions,
 	and that $\mu$ is the associated probability measure over all
-	realisations of the sequence---in the following, $\mu$ will simply serve
+	realisations of the sequence. In the following, $\mu$ will simply serve
 	as a label for the process. We can indentify a number of information-theoretic
 	measures meaningful in the context of a sequential observation of the sequence, during
-	which, at any time $t$, the sequence of variables can be divided into a `present' $X_t$, a `past'
+	which, at any time $t$, the sequence can be divided into a `present' $X_t$, a `past'
 	$\past{X}_t \equiv (\ldots, X_{t-2}, X_{t-1})$, and a `future'
 	$\fut{X}_t \equiv (X_{t+1},X_{t+2},\ldots)$.
 	We will write the actually observed value of $X_t$ as $x_t$, and
@@ -388,18 +372,18 @@
 	\begin{equation}
 		\ell_t = - \log p(x_t|\past{x}_t).
 	\end{equation}
-	However, before $X_t$ is observed to be $x_t$, the observer can compute
+	However, before $X_t$ is observed, the observer can compute
 	the \emph{expected} surprisingness as a measure of its uncertainty about
-	the very next event; this may be written as an entropy
+	$X_t$; this may be written as an entropy
 	$H(X_t|\ev(\past{X}_t = \past{x}_t))$, but note that this is
-	conditional on the \emph{event} $\ev(\past{X}_t=\past{x}_t)$, not
+	conditional on the \emph{event} $\ev(\past{X}_t=\past{x}_t)$, not the
 	\emph{variables} $\past{X}_t$ as in the conventional conditional entropy.

 	The surprisingness $\ell_t$ and expected surprisingness
 	$H(X_t|\ev(\past{X}_t=\past{x}_t))$
 	can be understood as \emph{subjective} information dynamic measures, since they are
 	based on the observer's probability model in the context of the actually observed sequence
-	$\past{x}_t$---they characterise what it is like to `be in the observer's shoes'.
+	$\past{x}_t$. They characterise what it is like to be `in the observer's shoes'.
 	If we view the observer as a purely passive or reactive agent, this would
 	probably be sufficient, but for active agents such as humans or animals, it is
 	often necessary to \emph{aniticipate} future events in order, for example, to plan the
@@ -528,8 +512,8 @@
 		I-diagrams for several information measures in
 		stationary random processes. Each circle or oval represents a random
 		variable or sequence of random variables relative to time $t=0$. Overlapped areas
-		correspond to various mutual information as in \Figrf{venn-example}.
-		In (b), the circle represents the `present'. Its total area is
+		correspond to various mutual informations.
+		In (a) and (c), the circle represents the `present'. Its total area is
 		$H(X_0)=\rho_\mu+r_\mu+b_\mu$, where $\rho_\mu$ is the multi-information
 		rate, $r_\mu$ is the residual entropy rate, and $b_\mu$ is the predictive
 		information rate. The entropy rate is $h_\mu = r_\mu+b_\mu$. The small dark
@@ -547,8 +531,8 @@
 	investigated. (In the following, the assumption of stationarity means that
 	the measures defined below are independent of $t$.)

-	The \emph{entropy rate} of the process is the entropy of the next variable
-	$X_t$ given all the previous ones.
+	The \emph{entropy rate} of the process is the entropy of the `present'
+	$X_t$ given the `past':
 	\begin{equation}
 		\label{eq:entro-rate}
 		h_\mu = H(X_t|\past{X}_t).
@@ -556,8 +540,8 @@
 	The entropy rate is a measure of the overall surprisingness
 	or unpredictability of the process, and gives an indication of the average
 	level of surprise and uncertainty that would be experienced by an observer
-	processing a sequence sampled from the process using the methods of
-	\secrf{surprise-info-seq}.
+	computing the measures of \secrf{surprise-info-seq} on a sequence sampled
+	from the process.

 	The \emph{multi-information rate} $\rho_\mu$ (following Dubnov's \cite{Dubnov2006}
 	notation for what he called the `information rate') is the mutual
@@ -566,9 +550,8 @@
 		\label{eq:multi-info}
 			\rho_\mu = I(\past{X}_t;X_t) = H(X_t) - h_\mu.
 	\end{equation}
-	It is a measure of how much the context of an observation (that is,
-	the observation of previous elements of the sequence) helps in predicting
-	or reducing the suprisingness of the current observation.
+	It is a measure of how much the preceeding context of an observation
+	helps in predicting or reducing the suprisingness of the current observation.

 	The \emph{excess entropy} \cite{CrutchfieldPackard1983}
 	is the mutual information between
@@ -582,13 +565,13 @@


 	The \emph{predictive information rate} (or PIR) \cite{AbdallahPlumbley2009}
-	is the mutual information between the present and the infinite future given the infinite
-	past:
+	is the mutual information between the `present' and the `future' given the
+	`past':
 	\begin{equation}
 		\label{eq:PIR}
-		b_\mu = I(X_t;\fut{X}_t|\past{X}_t) = H(\fut{X}_t|\past{X}_t) - H(\fut{X}_t|X_t,\past{X}_t).
+		b_\mu = I(X_t;\fut{X}_t|\past{X}_t) = H(\fut{X}_t|\past{X}_t) - H(\fut{X}_t|X_t,\past{X}_t),
 	\end{equation}
-	Equation \eqrf{PIR} can be read as the average reduction
+	which can be read as the average reduction
 	in uncertainty about the future on learning $X_t$, given the past.
 	Due to the symmetry of the mutual information, it can also be written
 	as
@@ -612,8 +595,8 @@
 	In particular they identify the $\sigma_\mu = I(\past{X}_t;\fut{X}_t|X_t)$,
 	the mutual information between the past and the future given the present,
 	as an interesting quantity that measures the predictive benefit of
-	model-building (that is, maintaining an internal state summarising past
-	observations in order to make better predictions). It is shown as the
+	model-building, that is, maintaining an internal state summarising past
+	observations in order to make better predictions. It is shown as the
 	small dark region below the circle in \figrf{predinfo-bg}(c).
 	By comparing with \figrf{predinfo-bg}(b), we can see that
 	$\sigma_\mu = E - \rho_\mu$.
@@ -627,19 +610,21 @@
 	First order Markov chains are the simplest non-trivial models to which information
 	dynamics methods can be applied. In \cite{AbdallahPlumbley2009} we derived
 	expressions for all the information measures described in \secrf{surprise-info-seq} for
-	irreducible stationary Markov chains (\ie that have a unique stationary
-	distribution). The derivation is greatly simplified by the dependency structure
-	of the Markov chain: for the purpose of the analysis, the `past' and `future'
-	segments $\past{X}_t$ and $\fut{X}_t$ can be collapsed to just the previous
-	and next variables $X_{t-1}$ and $X_{t+1}$ respectively. We also showed that
+	ergodic Markov chains (\ie that have a unique stationary
+	distribution).
+%	The derivation is greatly simplified by the dependency structure
+%	of the Markov chain: for the purpose of the analysis, the `past' and `future'
+%	segments $\past{X}_t$ and $\fut{X}_t$ can be collapsed to just the previous
+%	and next variables $X_{t-1}$ and $X_{t+1}$ respectively.
+	We also showed that
 	the predictive information rate can be expressed simply in terms of entropy rates:
 	if we let $a$ denote the $K\times K$ transition matrix of a Markov chain over
 	an alphabet of $\{1,\ldots,K\}$, such that
-	$a_{ij} = \Pr(\ev(X_t=i|X_{t-1}=j))$, and let $h:\reals^{K\times K}\to \reals$ be
+	$a_{ij} = \Pr(\ev(X_t=i|\ev(X_{t-1}=j)))$, and let $h:\reals^{K\times K}\to \reals$ be
 	the entropy rate function such that $h(a)$ is the entropy rate of a Markov chain
-	with transition matrix $a$, then the predictive information rate $b(a)$ is
+	with transition matrix $a$, then the predictive information rate is
 	\begin{equation}
-		b(a) = h(a^2) - h(a),
+		b_\mu = h(a^2) - h(a),
 	\end{equation}
 	where $a^2$, the transition matrix squared, is the transition matrix
 	of the `skip one' Markov chain obtained by jumping two steps at a time
@@ -664,9 +649,19 @@
 	for first order Markov chains, but for order $N$ chains, $E$ can be up to $N$ times larger
 	than $\rho_\mu$.

-	[Something about what kinds of Markov chain maximise $h_\mu$ (uncorrelated `white'
-	sequences, no temporal structure), $\rho_\mu$ and $E$ (periodic) and $b_\mu$. We return
-	this in \secrf{composition}.]
+	In our early experiments with visualising and sonifying sequences sampled from
+	first order Markov chains \cite{AbdallahPlumbley2009}, we found that
+	the measures $h_\mu$, $\rho_\mu$ and $b_\mu$ are related to perceptible
+	characteristics, and that the kinds of transition matrices maximising or minimising
+	each of these quantities are quite distinct. High entropy rates are associated
+	with completely uncorrelated sequences with no recognisable temporal structure,
+	along with low $\rho_\mu$ and $b_\mu$.
+	High values of $\rho_\mu$ are associated with long periodic cycles, low $h_\mu$
+	and low $b_\mu$. High values of $b_\mu$ are associated with intermediate values
+	of $\rho_\mu$ and $h_\mu$, and recognisable, but not completely predictable,
+	temporal structures. These relationships are visible in \figrf{mtriscat} in
+	\secrf{composition}, where we pick up the thread with an application of
+	information dynamics in a compositional aid.


 \section{Information Dynamics in Analysis}
@@ -698,15 +693,15 @@
 	enhancement that the transition matrix of the model was allowed to
 	evolve dynamically as the notes were processed, and was tracked (in
 	a Bayesian way) as a \emph{distribution} over possible transition matrices,
-	rather than a point estimate. The results are summarised in \figrf{twopages}:
+	rather than a point estimate. Some results are summarised in \figrf{twopages}:
 	the  upper four plots show the dynamically evolving subjective information
 	measures as described in \secrf{surprise-info-seq} computed using a point
-	estimate of the current transition matrix, but the fifth plot (the `model information rate')
+	estimate of the current transition matrix; the fifth plot (the `model information rate')
 	measures the information in each observation about the transition matrix.
 	In \cite{AbdallahPlumbley2010b}, we showed that this `model information rate'
-	is actually a component of the true IPI in
-	a time-varying Markov chain, which was neglected when we computed the IPI from
-	point estimates of the transition matrix as if the transition probabilities
+	is actually a component of the true IPI when the transition
+	matrix is being learned online, and was neglected when we computed the IPI from
+	the transition matrix as if the transition probabilities
 	were constant.

 	The peaks of the surprisingness and both components of the predictive information
@@ -715,8 +710,9 @@
 	`most surprising moments' of the piece (shown as asterisks in the fifth plot)%
 	\footnote{%
 	Note that the boundary marked in the score at around note 5,400 is known to be
-	anomalous; on the basis of a listening analysis, some musicologists [ref] have
-	placed the boundary a few bars later, in agreement with our analysis.}.
+	anomalous; on the basis of a listening analysis, some musicologists have
+	placed the boundary a few bars later, in agreement with our analysis
+	\cite{PotterEtAl2007}.}

 	In contrast, the analyses shown in the lower two plots of \figrf{twopages},
 	obtained using two rule-based music segmentation algorithms, while clearly
@@ -763,14 +759,17 @@
 	 In \cite{Dubnov2006}, Dubnov considers the class of stationary Gaussian
 	 processes. For such processes, the entropy rate may be obtained analytically
 	 from the power spectral density of the signal, allowing the multi-information
-	 rate to be subsequently obtained. Local stationarity is assumed, which may
-	 be achieved by windowing or change point detection \cite{Dubnov2008}. %TODO
+	 rate to be subsequently obtained.
+%	 Local stationarity is assumed, which may be achieved by windowing or
+%	 change point detection \cite{Dubnov2008}.
+	 %TODO
 	 mention non-gaussian processes extension Similarly, the predictive information
 	 rate may be computed using a Gaussian linear formulation CITE. In this view,
 	 the PIR is a function of the correlation  between random innovations supplied
 	 to the stochastic process.  %Dubnov, MacAdams, Reynolds (2006) %Bailes and
 	 Dean (2009)

+	% !!! FIXME
 		[ Continuous domain information ]
 		[Audio based music expectation modelling]
 		[ Gaussian processes]
@@ -800,6 +799,7 @@
 increases. We also examine whether the information is dependent upon
 metrical position.

+	% !!! FIXME

 \section{Information dynamics as compositional aid}
 \label{s:composition}
@@ -1037,6 +1037,8 @@


 \section{Conclusion}
+
+	% !!! FIXME
 We outlined our information dynamics approach to the modelling of the perception
 of music.  This approach models the subjective assessments of an observer that
 updates its probabilistic model of a process dynamically as events unfold.  We
@@ -1062,6 +1064,7 @@
 GR/S82213/01 and EP/E045235/1(SA), an EPSRC DTA Studentship (PF), an RAEng/EPSRC Research Fellowship 10216/88 (AR), an EPSRC Leadership Fellowship, EP/G007144/1
 (MDP) and EPSRC IDyOM2 EP/H013059/1.
 This work is partly funded by the CoSound project, funded by the Danish Agency for Science, Technology and Innovation.
+Thanks also Marcus Pearce for providing the two rule-based analyses of \emph{Two Pages}.


 \bibliographystyle{IEEEtran}