cip2012: draft.tex comparison

comparison draft.tex @ 36:ec7d64c0ae44

Work on section 2 and 3A.

author	samer
date	Wed, 14 Mar 2012 23:04:12 +0000
parents	194c7ec7e35d
children	f31433225faa

comparison

equal deleted inserted replaced

-:194c7ec7e35d
+:ec7d64c0ae44
 \section{Theoretical review}
 	\subsection{Entropy and information}
 	Let $X$ denote some variable whose value is initially unknown to our
 	hypothetical observer. We will treat $X$ mathematically as a random variable,
-	with a value to be drawn from some set $\A$ and a
+	with a value to be drawn from some set $\X$ and a
 	probability distribution representing the observer's beliefs about the
 	true value of $X$.
 	In this case, the observer's uncertainty about $X$ can be quantified
 	as the entropy of the random variable $H(X)$. For a discrete variable
-	with probability mass function $p:\A \to [0,1]$, this is
+	with probability mass function $p:\X \to [0,1]$, this is
 	\begin{equation}
-		H(X) = \sum_{x\in\A} -p(x) \log p(x) = \expect{-\log p(X)},
+		H(X) = \sum_{x\in\X} -p(x) \log p(x) = \expect{-\log p(X)},
 	\end{equation}
 	where $\expect{}$ is the expectation operator. The negative-log-probability
 	$\ell(x) = -\log p(x)$ of a particular value $x$ can usefully be thought of as
 	the \emph{surprisingness} of the value $x$ should it be observed, and
 	hence the entropy is the expected surprisingness.
 	in this new data \emph{about} $X$ can be quantified as the
 	Kullback-Leibler (KL) divergence between the prior and posterior
 	distributions $p(x)$ and $p(x|\Data)$ respectively:
 	\begin{equation}
 		\mathcal{I}_{\Data\to X} = D(p_{X|\Data} || p_{X})
-			= \sum_{x\in\A} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}.
+			= \sum_{x\in\X} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}.
 	\end{equation}
 	When there are multiple variables $X_1, X_2$
 	\etc which the observer believes to be dependent, then the observation of
 	one may change its beliefs and hence yield information about the
 	others. The joint and conditional entropies as described in any
 		Some other information measures are indicated in the legend.
 		}
 	\end{fig}
-	\subsection{Entropy and information in sequences}
+	\subsection{Surprise and information in sequences}
+	\label{s:surprise-info-seq}
-	Suppose that  $(\ldots,X_{-1},X_0,X_1,\ldots)$ is a stationary sequence of
+	Suppose that  $(\ldots,X_{-1},X_0,X_1,\ldots)$ is a sequence of
 	random variables, infinite in both directions,
-	and that $\mu$ is the associated shift-invariant probability measure over all
+	and that $\mu$ is the associated probability measure over all
-	configurations of the sequence---in the following, $\mu$ will simply serve
+	realisations of the sequence---in the following, $\mu$ will simply serve
 	as a label for the process. We can indentify a number of information-theoretic
 	measures meaningful in the context of a sequential observation of the sequence, during
-	which, at any time $t$, there is `present' $X_t$, a `past'
+	which, at any time $t$, the sequence of variables can be divided into a `present' $X_t$, a `past'
 	$\past{X}_t \equiv (\ldots, X_{t-2}, X_{t-1})$, and a `future'
 	$\fut{X}_t \equiv (X_{t+1},X_{t+2},\ldots)$.
-	Since the sequence is assumed stationary, we can without loss of generality,
+	The actually observed value of $X_t$ will be written as $x_t$, and
-	assume that $t=0$ in the following definitions.
+	the sequence of observations up to but not including $x_t$ as
+	$\past{x}_t$.
+%	Since the sequence is assumed stationary, we can without loss of generality,
+%	assume that $t=0$ in the following definitions.
+	The in-context surprisingness of the observation $X_t=x_t$ is a function
+	of both $x_t$ and the context $\past{x}_t$:
+	\begin{equation}
+		\ell(x_t|\past{x}_t) = - \log p(x_t|\past{x}_t).
+	\end{equation}
+	However, before $X_t$ is observed to be $x_t$, the observer can compute
+	its \emph{expected} surprisingness as a measure of its uncertainty about
+	the very next event; this may be written as an entropy
+	$H(X_t|\ev(\past{X}_t = \past{x}_t))$, but note that this is
+	conditional on the \emph{event} $\ev(\past{X}_t=\past{x}_t)$, not
+	\emph{variables} $\past{X}_t$ as in the conventional conditional entropy.
+	The surprisingness $\ell(x_t|\past{x}_t)$ and expected surprisingness
+	$H(X_t|\ev(\past{X}_t=\past{x}_t))$
+	are subjective information dynamic measures since they are based on its
+	subjective probability model in the context of the actually observed sequence
+	$\past{x}_t$---they characterise what it is like to `be in the observer's shoes'.
+	If we view the observer as a purely passive or reactive agent, this would
+	probably be sufficient, but for active agents such as humans or animals, it is
+	often necessary to \emph{aniticipate} future events in order, for example, to plan the
+	most effective course of action. It makes sense for such observers to be
+	concerned about the predictive probability distribution over future events,
+	$p(\fut{x}_t|\past{x}_t)$. When an observation $\ev(X_t=x_t)$ is made in this context,
+	the \emph{instantaneous predictive information} (IPI) is the information in the
+	event $\ev(X_t=x_t)$ about the entire future of the sequence $\fut{X}_t$.
+	\subsection{Information measures for stationary random processes}
 	The \emph{entropy rate} of the process is the entropy of the next variable
 	$X_t$ given all the previous ones.
 	\begin{equation}
 		\label{eq:entro-rate}
 	or \emph{erasure} \cite{VerduWeissman2006} entropy rate.
 	These relationships are illustrated in \Figrf{predinfo-bg}, along with
 	several of the information measures we have discussed so far.
-	\subsection{Other sequential information measures}
 	James et al \cite{JamesEllisonCrutchfield2011} study the predictive information
 	rate and also examine some related measures. In particular they identify the
 	$\sigma_\mu$, the difference between the multi-information rate and the excess
 	entropy, as an interesting quantity that measures the predictive benefit of
 	model-building (that is, maintaining an internal state summarising past
 	observations in order to make better predictions). They also identify
 	$w_\mu = \rho_\mu + b_{\mu}$, which they call the \emph{local exogenous
 	information} rate.
-	\subsection{First order Markov chains}
-	These are the simplest non-trivial models to which information dynamics methods
-	can be applied. In \cite{AbdallahPlumbley2009} we, showed that the predictive information
-	rate can be expressed simply in terms of the entropy rate of the Markov chain.
-	If we let $a$ denote the transition matrix of the Markov chain, and $h_a$ it's
-	entropy rate, then its predictive information rate $b_a$ is
-	\begin{equation}
-		b_a = h_{a^2} - h_a,
-	\end{equation}
-	where $a^2 = aa$, the transition matrix squared, is the transition matrix
-	of the `skip one' Markov chain obtained by leaving out every other observation.
-	\subsection{Higher order Markov chains}
-	Second and higher order Markov chains can be treated in a similar way by transforming
-	to a first order representation of the high order Markov chain. If we are dealing
-	with an $N$th order model, this is done forming a new alphabet of possible observations
-	consisting of all possible $N$-tuples of symbols from the base alphabet. An observation
-	in this new model represents a block of $N$ observations from the base model. The next
-	observation represents the block of $N$ obtained by shift the previous block along
-	by one step. The new Markov of chain is parameterised by a sparse $K^N\times K^N$
-	transition matrix $\hat{a}$.
-	\begin{equation}
-		b_{\hat{a}} = h_{\hat{a}^{N+1}} - N h_{\hat{a}},
-	\end{equation}
-	where $\hat{a}^{N+1}$ is the $N+1$th power of the transition matrix.
 \begin{fig}{wundt}
 \raisebox{-4em}{\colfig[0.43]{wundt}}
 %  {\ \shortstack{{\Large$\longrightarrow$}\\ {\scriptsize\emph{exposure}}}\ }
 {\ {\large$\longrightarrow$}\ }
 in a move to the left along the curve \cite{Berlyne71}.
 }
 \end{fig}
+	\subsection{First and higher order Markov chains}
+	First order Markov chains are the simplest non-trivial models to which information
+	dynamics methods can be applied. In \cite{AbdallahPlumbley2009} we derived
+	expressions for all the information measures introduced [above] for
+	irreducible stationary Markov chains (\ie that have a unique stationary
+	distribution). The derivation is greatly simplified by the dependency structure
+	of the Markov chain: for the purpose of the analysis, the `past' and `future'
+	segments $\past{X}_t$ and $\fut{X})_t$ can be reduced to just the previous
+	and next variables $X_{t-1}$ and $X_{t-1}$ respectively. We also showed that
+	the predictive information rate can be expressed simply in terms of entropy rates:
+	if we let $a$ denote the $K\times K$ transition matrix of a Markov chain over
+	an alphabet of $\{1,\ldots,K\}$, such that
+	$a_{ij} = \Pr(\ev(X_t=i|X_{t-1}=j))$, and let $h:\reals^{K\times K}\to \reals$ be
+	the entropy rate function such that $h(a)$ is the entropy rate of a Markov chain
+	with transition matrix $a$, then the predictive information rate $b(a)$ is
+	\begin{equation}
+		b(a) = h(a^2) - h(a),
+	\end{equation}
+	where $a^2$, the transition matrix squared, is the transition matrix
+	of the `skip one' Markov chain obtained by jumping two steps at a time
+	along the original chain.
+	Second and higher order Markov chains can be treated in a similar way by transforming
+	to a first order representation of the high order Markov chain. If we are dealing
+	with an $N$th order model, this is done forming a new alphabet of size $K^N$
+	consisting of all possible $N$-tuples of symbols from the base alphabet of size $K$.
+	An observation $\hat{x}_t$ in this new model represents a block of $N$ observations
+	$(x_{t+1},\ldots,x_{t+N})$ from the base model. The next
+	observation $\hat{x}_{t+1}$ represents the block of $N$ obtained by shifting the previous
+	block along by one step. The new Markov of chain is parameterised by a sparse $K^N\times K^N$
+	transition matrix $\hat{a}$. The entropy rate of the first order system is the same
+	as the entropy rate of the original order $N$ system, and its PIR is
+	\begin{equation}
+		b({\hat{a}}) = h({\hat{a}^{N+1}}) - N h({\hat{a}}),
+	\end{equation}
+	where $\hat{a}^{N+1}$ is the $(N+1)$th power of the first order transition matrix.
 \section{Information Dynamics in Analysis}
-	\subsection{Musicological Analysis}
-	In \cite{AbdallahPlumbley2009}, methods based on the theory described above
-	were used to analysis two pieces of music in the minimalist style
-	by Philip Glass: \emph{Two Pages} (1969) and \emph{Gradus} (1968).
-	The analysis was done using a first-order Markov chain model, with the
-	enhancement that the transition matrix of the model was allowed to
-	evolve dynamically as the notes were processed, and was estimated (in
-	a Bayesian way) as a \emph{distribution} over possible transition matrices,
-	rather than a point estimate. [Bayesian surprise, other component of IPI].
 \begin{fig}{twopages}
 \colfig[0.96]{matbase/fig9471}  % update from mbc paper
 %      \colfig[0.97]{matbase/fig72663}\\  % later update from mbc paper (Keith's new picks)
 			\vspace*{1em}
 \colfig[0.97]{matbase/fig13377}  % rule based analysis
 The bottom panel shows a rule-based boundary strength analysis computed
 using Cambouropoulos' LBDM.
 All information measures are in nats and time is in notes.
 }
 \end{fig}
+	\subsection{Musicological Analysis}
+	In \cite{AbdallahPlumbley2009}, methods based on the theory described above
+	were used to analysis two pieces of music in the minimalist style
+	by Philip Glass: \emph{Two Pages} (1969) and \emph{Gradus} (1968).
+	The analysis was done using a first-order Markov chain model, with the
+	enhancement that the transition matrix of the model was allowed to
+	evolve dynamically as the notes were processed, and was tracked (in
+	a Bayesian way) as a \emph{distribution} over possible transition matrices,
+	rather than a point estimate. The results are summarised in \figrf{twopages}:
+	the  upper four plots show the dynamically evolving subjective information
+	measures as described in \secrf{surprise-info-seq} computed using a point
+	estimate of the current transition matrix, but the fifth plot (the `model information rate')
+	measures the information in each observation about the transition matrix.
+	In \cite{AbdallahPlumbley2010b}, we showed that this `model information rate'
+	is actually a component of the true IPI in
+	a time-varying Markov chain, which was neglected when we computed the IPI from
+	point estimates of the transition matrix as if the transition probabilities
+	were constant.
+	The peaks of the surprisingness and both components of the predictive information
+	show good correspondence with structure of the piece both as marked in the score
+	and as analysed by musicologist Keith Potter, who was asked to mark the six
+	`most surprising moments' of the piece (shown as asterisks in the fifth plot)%
+	\footnote{%
+	Note that the boundary marked in the score at around note 5,400 is known to be
+	anomalous; on the basis of a listening analysis, some musicologists [ref] have
+	placed the boundary a few bars later, in agreement with our analysis.}.
+	In contrast, the analyses shown in the lower two plots of \figrf{twopages},
+	obtained using two rule-based music segmentation algorithms, while clearly
+	\emph{reflecting} the structure of the piece, do not \emph{segment} the piece
+	clearly, with tendency to peaking of the boundary strength function at
+	the boundaries in the piece.
 \begin{fig}{metre}
 %      \scalebox{1}[1]{%
 \begin{tabular}{cc}
 \colfig[0.45]{matbase/fig36859} & \colfig[0.48]{matbase/fig88658} \\

Mercurial > hg > cip2012

comparison draft.tex @ 36:ec7d64c0ae44