Mercurial > hg > cip2012
changeset 36:ec7d64c0ae44
Work on section 2 and 3A.
author | samer |
---|---|
date | Wed, 14 Mar 2012 23:04:12 +0000 |
parents | 194c7ec7e35d |
children | f31433225faa |
files | c4dm.bib draft.pdf draft.tex matlab/indplots.m |
diffstat | 4 files changed, 127 insertions(+), 52 deletions(-) [+] |
line wrap: on
line diff
--- a/c4dm.bib Wed Mar 14 18:21:16 2012 +0000 +++ b/c4dm.bib Wed Mar 14 23:04:12 2012 +0000 @@ -2,7 +2,7 @@ %% http://bibdesk.sourceforge.net/ -%% Created for Samer Abdallah at 2011-12-28 11:58:20 +0000 +%% Created for Samer Abdallah at 2012-03-14 22:20:51 +0000 %% Saved with string encoding Unicode (UTF-8) @@ -65,6 +65,15 @@ @string{sigproc = {Signal Processing}} +@techreport{AbdallahPlumbley2010b, + Author = {Samer A. Abdallah and Mark D. Plumbley}, + Date-Added = {2012-03-14 22:17:02 +0000}, + Date-Modified = {2012-03-14 22:20:51 +0000}, + Institution = {Queen Mary University of London}, + Number = {C4DM-TR10-09}, + Title = {Predictive information and Bayesian surprise in exchangeable random processes}, + Year = {2010}} + @article{AbdallahPlumbley2012, Author = {Samer A. Abdallah and Mark D. Plumbley}, Date-Added = {2011-12-28 11:58:01 +0000},
--- a/draft.tex Wed Mar 14 18:21:16 2012 +0000 +++ b/draft.tex Wed Mar 14 23:04:12 2012 +0000 @@ -228,14 +228,14 @@ \subsection{Entropy and information} Let $X$ denote some variable whose value is initially unknown to our hypothetical observer. We will treat $X$ mathematically as a random variable, - with a value to be drawn from some set $\A$ and a + with a value to be drawn from some set $\X$ and a probability distribution representing the observer's beliefs about the true value of $X$. In this case, the observer's uncertainty about $X$ can be quantified as the entropy of the random variable $H(X)$. For a discrete variable - with probability mass function $p:\A \to [0,1]$, this is + with probability mass function $p:\X \to [0,1]$, this is \begin{equation} - H(X) = \sum_{x\in\A} -p(x) \log p(x) = \expect{-\log p(X)}, + H(X) = \sum_{x\in\X} -p(x) \log p(x) = \expect{-\log p(X)}, \end{equation} where $\expect{}$ is the expectation operator. The negative-log-probability $\ell(x) = -\log p(x)$ of a particular value $x$ can usefully be thought of as @@ -249,7 +249,7 @@ distributions $p(x)$ and $p(x|\Data)$ respectively: \begin{equation} \mathcal{I}_{\Data\to X} = D(p_{X|\Data} || p_{X}) - = \sum_{x\in\A} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}. + = \sum_{x\in\X} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}. \end{equation} When there are multiple variables $X_1, X_2$ \etc which the observer believes to be dependent, then the observation of @@ -355,19 +355,51 @@ \end{fig} - \subsection{Entropy and information in sequences} + \subsection{Surprise and information in sequences} + \label{s:surprise-info-seq} - Suppose that $(\ldots,X_{-1},X_0,X_1,\ldots)$ is a stationary sequence of + Suppose that $(\ldots,X_{-1},X_0,X_1,\ldots)$ is a sequence of random variables, infinite in both directions, - and that $\mu$ is the associated shift-invariant probability measure over all - configurations of the sequence---in the following, $\mu$ will simply serve + and that $\mu$ is the associated probability measure over all + realisations of the sequence---in the following, $\mu$ will simply serve as a label for the process. We can indentify a number of information-theoretic measures meaningful in the context of a sequential observation of the sequence, during - which, at any time $t$, there is `present' $X_t$, a `past' + which, at any time $t$, the sequence of variables can be divided into a `present' $X_t$, a `past' $\past{X}_t \equiv (\ldots, X_{t-2}, X_{t-1})$, and a `future' $\fut{X}_t \equiv (X_{t+1},X_{t+2},\ldots)$. - Since the sequence is assumed stationary, we can without loss of generality, - assume that $t=0$ in the following definitions. + The actually observed value of $X_t$ will be written as $x_t$, and + the sequence of observations up to but not including $x_t$ as + $\past{x}_t$. +% Since the sequence is assumed stationary, we can without loss of generality, +% assume that $t=0$ in the following definitions. + + The in-context surprisingness of the observation $X_t=x_t$ is a function + of both $x_t$ and the context $\past{x}_t$: + \begin{equation} + \ell(x_t|\past{x}_t) = - \log p(x_t|\past{x}_t). + \end{equation} + However, before $X_t$ is observed to be $x_t$, the observer can compute + its \emph{expected} surprisingness as a measure of its uncertainty about + the very next event; this may be written as an entropy + $H(X_t|\ev(\past{X}_t = \past{x}_t))$, but note that this is + conditional on the \emph{event} $\ev(\past{X}_t=\past{x}_t)$, not + \emph{variables} $\past{X}_t$ as in the conventional conditional entropy. + + The surprisingness $\ell(x_t|\past{x}_t)$ and expected surprisingness + $H(X_t|\ev(\past{X}_t=\past{x}_t))$ + are subjective information dynamic measures since they are based on its + subjective probability model in the context of the actually observed sequence + $\past{x}_t$---they characterise what it is like to `be in the observer's shoes'. + If we view the observer as a purely passive or reactive agent, this would + probably be sufficient, but for active agents such as humans or animals, it is + often necessary to \emph{aniticipate} future events in order, for example, to plan the + most effective course of action. It makes sense for such observers to be + concerned about the predictive probability distribution over future events, + $p(\fut{x}_t|\past{x}_t)$. When an observation $\ev(X_t=x_t)$ is made in this context, + the \emph{instantaneous predictive information} (IPI) is the information in the + event $\ev(X_t=x_t)$ about the entire future of the sequence $\fut{X}_t$. + + \subsection{Information measures for stationary random processes} The \emph{entropy rate} of the process is the entropy of the next variable $X_t$ given all the previous ones. @@ -502,8 +534,6 @@ several of the information measures we have discussed so far. - \subsection{Other sequential information measures} - James et al \cite{JamesEllisonCrutchfield2011} study the predictive information rate and also examine some related measures. In particular they identify the $\sigma_\mu$, the difference between the multi-information rate and the excess @@ -513,33 +543,6 @@ $w_\mu = \rho_\mu + b_{\mu}$, which they call the \emph{local exogenous information} rate. - \subsection{First order Markov chains} - These are the simplest non-trivial models to which information dynamics methods - can be applied. In \cite{AbdallahPlumbley2009} we, showed that the predictive information - rate can be expressed simply in terms of the entropy rate of the Markov chain. - If we let $a$ denote the transition matrix of the Markov chain, and $h_a$ it's - entropy rate, then its predictive information rate $b_a$ is - \begin{equation} - b_a = h_{a^2} - h_a, - \end{equation} - where $a^2 = aa$, the transition matrix squared, is the transition matrix - of the `skip one' Markov chain obtained by leaving out every other observation. - - \subsection{Higher order Markov chains} - Second and higher order Markov chains can be treated in a similar way by transforming - to a first order representation of the high order Markov chain. If we are dealing - with an $N$th order model, this is done forming a new alphabet of possible observations - consisting of all possible $N$-tuples of symbols from the base alphabet. An observation - in this new model represents a block of $N$ observations from the base model. The next - observation represents the block of $N$ obtained by shift the previous block along - by one step. The new Markov of chain is parameterised by a sparse $K^N\times K^N$ - transition matrix $\hat{a}$. - \begin{equation} - b_{\hat{a}} = h_{\hat{a}^{N+1}} - N h_{\hat{a}}, - \end{equation} - where $\hat{a}^{N+1}$ is the $N+1$th power of the transition matrix. - - \begin{fig}{wundt} \raisebox{-4em}{\colfig[0.43]{wundt}} % {\ \shortstack{{\Large$\longrightarrow$}\\ {\scriptsize\emph{exposure}}}\ } @@ -553,18 +556,46 @@ \end{fig} + \subsection{First and higher order Markov chains} + First order Markov chains are the simplest non-trivial models to which information + dynamics methods can be applied. In \cite{AbdallahPlumbley2009} we derived + expressions for all the information measures introduced [above] for + irreducible stationary Markov chains (\ie that have a unique stationary + distribution). The derivation is greatly simplified by the dependency structure + of the Markov chain: for the purpose of the analysis, the `past' and `future' + segments $\past{X}_t$ and $\fut{X})_t$ can be reduced to just the previous + and next variables $X_{t-1}$ and $X_{t-1}$ respectively. We also showed that + the predictive information rate can be expressed simply in terms of entropy rates: + if we let $a$ denote the $K\times K$ transition matrix of a Markov chain over + an alphabet of $\{1,\ldots,K\}$, such that + $a_{ij} = \Pr(\ev(X_t=i|X_{t-1}=j))$, and let $h:\reals^{K\times K}\to \reals$ be + the entropy rate function such that $h(a)$ is the entropy rate of a Markov chain + with transition matrix $a$, then the predictive information rate $b(a)$ is + \begin{equation} + b(a) = h(a^2) - h(a), + \end{equation} + where $a^2$, the transition matrix squared, is the transition matrix + of the `skip one' Markov chain obtained by jumping two steps at a time + along the original chain. + + Second and higher order Markov chains can be treated in a similar way by transforming + to a first order representation of the high order Markov chain. If we are dealing + with an $N$th order model, this is done forming a new alphabet of size $K^N$ + consisting of all possible $N$-tuples of symbols from the base alphabet of size $K$. + An observation $\hat{x}_t$ in this new model represents a block of $N$ observations + $(x_{t+1},\ldots,x_{t+N})$ from the base model. The next + observation $\hat{x}_{t+1}$ represents the block of $N$ obtained by shifting the previous + block along by one step. The new Markov of chain is parameterised by a sparse $K^N\times K^N$ + transition matrix $\hat{a}$. The entropy rate of the first order system is the same + as the entropy rate of the original order $N$ system, and its PIR is + \begin{equation} + b({\hat{a}}) = h({\hat{a}^{N+1}}) - N h({\hat{a}}), + \end{equation} + where $\hat{a}^{N+1}$ is the $(N+1)$th power of the first order transition matrix. + + \section{Information Dynamics in Analysis} - \subsection{Musicological Analysis} - In \cite{AbdallahPlumbley2009}, methods based on the theory described above - were used to analysis two pieces of music in the minimalist style - by Philip Glass: \emph{Two Pages} (1969) and \emph{Gradus} (1968). - The analysis was done using a first-order Markov chain model, with the - enhancement that the transition matrix of the model was allowed to - evolve dynamically as the notes were processed, and was estimated (in - a Bayesian way) as a \emph{distribution} over possible transition matrices, - rather than a point estimate. [Bayesian surprise, other component of IPI]. - \begin{fig}{twopages} \colfig[0.96]{matbase/fig9471} % update from mbc paper % \colfig[0.97]{matbase/fig72663}\\ % later update from mbc paper (Keith's new picks) @@ -584,6 +615,41 @@ } \end{fig} + \subsection{Musicological Analysis} + In \cite{AbdallahPlumbley2009}, methods based on the theory described above + were used to analysis two pieces of music in the minimalist style + by Philip Glass: \emph{Two Pages} (1969) and \emph{Gradus} (1968). + The analysis was done using a first-order Markov chain model, with the + enhancement that the transition matrix of the model was allowed to + evolve dynamically as the notes were processed, and was tracked (in + a Bayesian way) as a \emph{distribution} over possible transition matrices, + rather than a point estimate. The results are summarised in \figrf{twopages}: + the upper four plots show the dynamically evolving subjective information + measures as described in \secrf{surprise-info-seq} computed using a point + estimate of the current transition matrix, but the fifth plot (the `model information rate') + measures the information in each observation about the transition matrix. + In \cite{AbdallahPlumbley2010b}, we showed that this `model information rate' + is actually a component of the true IPI in + a time-varying Markov chain, which was neglected when we computed the IPI from + point estimates of the transition matrix as if the transition probabilities + were constant. + + The peaks of the surprisingness and both components of the predictive information + show good correspondence with structure of the piece both as marked in the score + and as analysed by musicologist Keith Potter, who was asked to mark the six + `most surprising moments' of the piece (shown as asterisks in the fifth plot)% + \footnote{% + Note that the boundary marked in the score at around note 5,400 is known to be + anomalous; on the basis of a listening analysis, some musicologists [ref] have + placed the boundary a few bars later, in agreement with our analysis.}. + + In contrast, the analyses shown in the lower two plots of \figrf{twopages}, + obtained using two rule-based music segmentation algorithms, while clearly + \emph{reflecting} the structure of the piece, do not \emph{segment} the piece + clearly, with tendency to peaking of the boundary strength function at + the boundaries in the piece. + + \begin{fig}{metre} % \scalebox{1}[1]{% \begin{tabular}{cc}
--- a/matlab/indplots.m Wed Mar 14 18:21:16 2012 +0000 +++ b/matlab/indplots.m Wed Mar 14 23:04:12 2012 +0000 @@ -1,7 +1,7 @@ function indplots(files_shifts,varargin) opts=prefs('print',0,varargin{:}); for i=1:length(files_shifts) - if finite_shift(files_shifts{i}) + if exist(files_shifts{i}{1},'file') && finite_shift(files_shifts{i}) fileplot(files_shifts{i},'basename',sprintf('plots/file%d',i),opts); end end