Mercurial > hg > cip2012

--- a/c4dm.bib	Wed Mar 14 18:21:16 2012 +0000
+++ b/c4dm.bib	Wed Mar 14 23:04:12 2012 +0000
@@ -2,7 +2,7 @@
 %% http://bibdesk.sourceforge.net/


-%% Created for Samer Abdallah at 2011-12-28 11:58:20 +0000
+%% Created for Samer Abdallah at 2012-03-14 22:20:51 +0000


 %% Saved with string encoding Unicode (UTF-8)
@@ -65,6 +65,15 @@
 @string{sigproc = {Signal Processing}}


+@techreport{AbdallahPlumbley2010b,
+	Author = {Samer A. Abdallah and Mark D. Plumbley},
+	Date-Added = {2012-03-14 22:17:02 +0000},
+	Date-Modified = {2012-03-14 22:20:51 +0000},
+	Institution = {Queen Mary University of London},
+	Number = {C4DM-TR10-09},
+	Title = {Predictive information and Bayesian surprise in exchangeable random processes},
+	Year = {2010}}
+
 @article{AbdallahPlumbley2012,
 	Author = {Samer A. Abdallah and Mark D. Plumbley},
 	Date-Added = {2011-12-28 11:58:01 +0000},
Binary file draft.pdf has changed
--- a/draft.tex	Wed Mar 14 18:21:16 2012 +0000
+++ b/draft.tex	Wed Mar 14 23:04:12 2012 +0000
@@ -228,14 +228,14 @@
 	\subsection{Entropy and information}
 	Let $X$ denote some variable whose value is initially unknown to our
 	hypothetical observer. We will treat $X$ mathematically as a random variable,
-	with a value to be drawn from some set $\A$ and a
+	with a value to be drawn from some set $\X$ and a
 	probability distribution representing the observer's beliefs about the
 	true value of $X$.
 	In this case, the observer's uncertainty about $X$ can be quantified
 	as the entropy of the random variable $H(X)$. For a discrete variable
-	with probability mass function $p:\A \to [0,1]$, this is
+	with probability mass function $p:\X \to [0,1]$, this is
 	\begin{equation}
-		H(X) = \sum_{x\in\A} -p(x) \log p(x) = \expect{-\log p(X)},
+		H(X) = \sum_{x\in\X} -p(x) \log p(x) = \expect{-\log p(X)},
 	\end{equation}
 	where $\expect{}$ is the expectation operator. The negative-log-probability
 	$\ell(x) = -\log p(x)$ of a particular value $x$ can usefully be thought of as
@@ -249,7 +249,7 @@
 	distributions $p(x)$ and $p(x|\Data)$ respectively:
 	\begin{equation}
 		\mathcal{I}_{\Data\to X} = D(p_{X|\Data} || p_{X})
-			= \sum_{x\in\A} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}.
+			= \sum_{x\in\X} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}.
 	\end{equation}
 	When there are multiple variables $X_1, X_2$
 	\etc which the observer believes to be dependent, then the observation of
@@ -355,19 +355,51 @@
 	\end{fig}


-	\subsection{Entropy and information in sequences}
+	\subsection{Surprise and information in sequences}
+	\label{s:surprise-info-seq}

-	Suppose that  $(\ldots,X_{-1},X_0,X_1,\ldots)$ is a stationary sequence of
+	Suppose that  $(\ldots,X_{-1},X_0,X_1,\ldots)$ is a sequence of
 	random variables, infinite in both directions,
-	and that $\mu$ is the associated shift-invariant probability measure over all
-	configurations of the sequence---in the following, $\mu$ will simply serve
+	and that $\mu$ is the associated probability measure over all
+	realisations of the sequence---in the following, $\mu$ will simply serve
 	as a label for the process. We can indentify a number of information-theoretic
 	measures meaningful in the context of a sequential observation of the sequence, during
-	which, at any time $t$, there is `present' $X_t$, a `past'
+	which, at any time $t$, the sequence of variables can be divided into a `present' $X_t$, a `past'
 	$\past{X}_t \equiv (\ldots, X_{t-2}, X_{t-1})$, and a `future'
 	$\fut{X}_t \equiv (X_{t+1},X_{t+2},\ldots)$.
-	Since the sequence is assumed stationary, we can without loss of generality,
-	assume that $t=0$ in the following definitions.
+	The actually observed value of $X_t$ will be written as $x_t$, and
+	the sequence of observations up to but not including $x_t$ as
+	$\past{x}_t$.
+%	Since the sequence is assumed stationary, we can without loss of generality,
+%	assume that $t=0$ in the following definitions.
+
+	The in-context surprisingness of the observation $X_t=x_t$ is a function
+	of both $x_t$ and the context $\past{x}_t$:
+	\begin{equation}
+		\ell(x_t|\past{x}_t) = - \log p(x_t|\past{x}_t).
+	\end{equation}
+	However, before $X_t$ is observed to be $x_t$, the observer can compute
+	its \emph{expected} surprisingness as a measure of its uncertainty about
+	the very next event; this may be written as an entropy
+	$H(X_t|\ev(\past{X}_t = \past{x}_t))$, but note that this is
+	conditional on the \emph{event} $\ev(\past{X}_t=\past{x}_t)$, not
+	\emph{variables} $\past{X}_t$ as in the conventional conditional entropy.
+
+	The surprisingness $\ell(x_t|\past{x}_t)$ and expected surprisingness
+	$H(X_t|\ev(\past{X}_t=\past{x}_t))$
+	are subjective information dynamic measures since they are based on its
+	subjective probability model in the context of the actually observed sequence
+	$\past{x}_t$---they characterise what it is like to `be in the observer's shoes'.
+	If we view the observer as a purely passive or reactive agent, this would
+	probably be sufficient, but for active agents such as humans or animals, it is
+	often necessary to \emph{aniticipate} future events in order, for example, to plan the
+	most effective course of action. It makes sense for such observers to be
+	concerned about the predictive probability distribution over future events,
+	$p(\fut{x}_t|\past{x}_t)$. When an observation $\ev(X_t=x_t)$ is made in this context,
+	the \emph{instantaneous predictive information} (IPI) is the information in the
+	event $\ev(X_t=x_t)$ about the entire future of the sequence $\fut{X}_t$.
+
+	\subsection{Information measures for stationary random processes}

 	The \emph{entropy rate} of the process is the entropy of the next variable
 	$X_t$ given all the previous ones.
@@ -502,8 +534,6 @@
 	several of the information measures we have discussed so far.


-	\subsection{Other sequential information measures}
-
 	James et al \cite{JamesEllisonCrutchfield2011} study the predictive information
 	rate and also examine some related measures. In particular they identify the
 	$\sigma_\mu$, the difference between the multi-information rate and the excess
@@ -513,33 +543,6 @@
 	$w_\mu = \rho_\mu + b_{\mu}$, which they call the \emph{local exogenous
 	information} rate.

-	\subsection{First order Markov chains}
-	These are the simplest non-trivial models to which information dynamics methods
-	can be applied. In \cite{AbdallahPlumbley2009} we, showed that the predictive information
-	rate can be expressed simply in terms of the entropy rate of the Markov chain.
-	If we let $a$ denote the transition matrix of the Markov chain, and $h_a$ it's
-	entropy rate, then its predictive information rate $b_a$ is
-	\begin{equation}
-		b_a = h_{a^2} - h_a,
-	\end{equation}
-	where $a^2 = aa$, the transition matrix squared, is the transition matrix
-	of the `skip one' Markov chain obtained by leaving out every other observation.
-
-	\subsection{Higher order Markov chains}
-	Second and higher order Markov chains can be treated in a similar way by transforming
-	to a first order representation of the high order Markov chain. If we are dealing
-	with an $N$th order model, this is done forming a new alphabet of possible observations
-	consisting of all possible $N$-tuples of symbols from the base alphabet. An observation
-	in this new model represents a block of $N$ observations from the base model. The next
-	observation represents the block of $N$ obtained by shift the previous block along
-	by one step. The new Markov of chain is parameterised by a sparse $K^N\times K^N$
-	transition matrix $\hat{a}$.
-	\begin{equation}
-		b_{\hat{a}} = h_{\hat{a}^{N+1}} - N h_{\hat{a}},
-	\end{equation}
-	where $\hat{a}^{N+1}$ is the $N+1$th power of the transition matrix.
-
-
   \begin{fig}{wundt}
     \raisebox{-4em}{\colfig[0.43]{wundt}}
  %  {\ \shortstack{{\Large$\longrightarrow$}\\ {\scriptsize\emph{exposure}}}\ }
@@ -553,18 +556,46 @@
   \end{fig}


+	\subsection{First and higher order Markov chains}
+	First order Markov chains are the simplest non-trivial models to which information
+	dynamics methods can be applied. In \cite{AbdallahPlumbley2009} we derived
+	expressions for all the information measures introduced [above] for
+	irreducible stationary Markov chains (\ie that have a unique stationary
+	distribution). The derivation is greatly simplified by the dependency structure
+	of the Markov chain: for the purpose of the analysis, the `past' and `future'
+	segments $\past{X}_t$ and $\fut{X})_t$ can be reduced to just the previous
+	and next variables $X_{t-1}$ and $X_{t-1}$ respectively. We also showed that
+	the predictive information rate can be expressed simply in terms of entropy rates:
+	if we let $a$ denote the $K\times K$ transition matrix of a Markov chain over
+	an alphabet of $\{1,\ldots,K\}$, such that
+	$a_{ij} = \Pr(\ev(X_t=i|X_{t-1}=j))$, and let $h:\reals^{K\times K}\to \reals$ be
+	the entropy rate function such that $h(a)$ is the entropy rate of a Markov chain
+	with transition matrix $a$, then the predictive information rate $b(a)$ is
+	\begin{equation}
+		b(a) = h(a^2) - h(a),
+	\end{equation}
+	where $a^2$, the transition matrix squared, is the transition matrix
+	of the `skip one' Markov chain obtained by jumping two steps at a time
+	along the original chain.
+
+	Second and higher order Markov chains can be treated in a similar way by transforming
+	to a first order representation of the high order Markov chain. If we are dealing
+	with an $N$th order model, this is done forming a new alphabet of size $K^N$
+	consisting of all possible $N$-tuples of symbols from the base alphabet of size $K$.
+	An observation $\hat{x}_t$ in this new model represents a block of $N$ observations
+	$(x_{t+1},\ldots,x_{t+N})$ from the base model. The next
+	observation $\hat{x}_{t+1}$ represents the block of $N$ obtained by shifting the previous
+	block along by one step. The new Markov of chain is parameterised by a sparse $K^N\times K^N$
+	transition matrix $\hat{a}$. The entropy rate of the first order system is the same
+	as the entropy rate of the original order $N$ system, and its PIR is
+	\begin{equation}
+		b({\hat{a}}) = h({\hat{a}^{N+1}}) - N h({\hat{a}}),
+	\end{equation}
+	where $\hat{a}^{N+1}$ is the $(N+1)$th power of the first order transition matrix.
+
+
 \section{Information Dynamics in Analysis}

- 	\subsection{Musicological Analysis}
-	In \cite{AbdallahPlumbley2009}, methods based on the theory described above
-	were used to analysis two pieces of music in the minimalist style
-	by Philip Glass: \emph{Two Pages} (1969) and \emph{Gradus} (1968).
-	The analysis was done using a first-order Markov chain model, with the
-	enhancement that the transition matrix of the model was allowed to
-	evolve dynamically as the notes were processed, and was estimated (in
-	a Bayesian way) as a \emph{distribution} over possible transition matrices,
-	rather than a point estimate. [Bayesian surprise, other component of IPI].
-
     \begin{fig}{twopages}
       \colfig[0.96]{matbase/fig9471}  % update from mbc paper
 %      \colfig[0.97]{matbase/fig72663}\\  % later update from mbc paper (Keith's new picks)
@@ -584,6 +615,41 @@
       }
     \end{fig}

+ 	\subsection{Musicological Analysis}
+	In \cite{AbdallahPlumbley2009}, methods based on the theory described above
+	were used to analysis two pieces of music in the minimalist style
+	by Philip Glass: \emph{Two Pages} (1969) and \emph{Gradus} (1968).
+	The analysis was done using a first-order Markov chain model, with the
+	enhancement that the transition matrix of the model was allowed to
+	evolve dynamically as the notes were processed, and was tracked (in
+	a Bayesian way) as a \emph{distribution} over possible transition matrices,
+	rather than a point estimate. The results are summarised in \figrf{twopages}:
+	the  upper four plots show the dynamically evolving subjective information
+	measures as described in \secrf{surprise-info-seq} computed using a point
+	estimate of the current transition matrix, but the fifth plot (the `model information rate')
+	measures the information in each observation about the transition matrix.
+	In \cite{AbdallahPlumbley2010b}, we showed that this `model information rate'
+	is actually a component of the true IPI in
+	a time-varying Markov chain, which was neglected when we computed the IPI from
+	point estimates of the transition matrix as if the transition probabilities
+	were constant.
+
+	The peaks of the surprisingness and both components of the predictive information
+	show good correspondence with structure of the piece both as marked in the score
+	and as analysed by musicologist Keith Potter, who was asked to mark the six
+	`most surprising moments' of the piece (shown as asterisks in the fifth plot)%
+	\footnote{%
+	Note that the boundary marked in the score at around note 5,400 is known to be
+	anomalous; on the basis of a listening analysis, some musicologists [ref] have
+	placed the boundary a few bars later, in agreement with our analysis.}.
+
+	In contrast, the analyses shown in the lower two plots of \figrf{twopages},
+	obtained using two rule-based music segmentation algorithms, while clearly
+	\emph{reflecting} the structure of the piece, do not \emph{segment} the piece
+	clearly, with tendency to peaking of the boundary strength function at
+	the boundaries in the piece.
+
+
     \begin{fig}{metre}
 %      \scalebox{1}[1]{%
         \begin{tabular}{cc}
--- a/matlab/indplots.m	Wed Mar 14 18:21:16 2012 +0000
+++ b/matlab/indplots.m	Wed Mar 14 23:04:12 2012 +0000
@@ -1,7 +1,7 @@
 function indplots(files_shifts,varargin)
 	opts=prefs('print',0,varargin{:});
 	for i=1:length(files_shifts)
-		if finite_shift(files_shifts{i})
+		if exist(files_shifts{i}{1},'file') && finite_shift(files_shifts{i})
 			fileplot(files_shifts{i},'basename',sprintf('plots/file%d',i),opts);
 		end
 	end