Mercurial > hg > cip2012

Binary file draft.pdf has changed
--- a/draft.tex	Tue Mar 13 19:30:38 2012 +0000
+++ b/draft.tex	Tue Mar 13 19:39:35 2012 +0000
@@ -225,43 +225,6 @@

 \section{Theoretical review}

-	\subsection{Entropy and information in sequences}
-	Let $X$ denote some variable whos value is initially unknown to our
-	hypothetical observer. We will treat $X$ mathematically as a random variable,
-	with a value to be drawn from some set (or \emph{alphabet}) $\A$ and a
-	probability distribution representing the observer's beliefs about the
-	true value of $X$.
-	In this case, the observer's uncertainty about $X$ can be quantified
-	as the entropy of the random variable $H(X)$. For a discrete variable
-	with probability mass function $p:\A \to [0,1]$, this is
-	\begin{equation}
-		H(X) = \sum_{x\in\A} -p(x) \log p(x) = \expect{-\log p(X)},
-	\end{equation}
-	where $\expect{}$ is the expectation operator. The negative-log-probability
-	$\ell(x) = -\log p(x)$ of a particular value $x$ can usefully be thought of as
-	the \emph{surprisingness} of the value $x$ should it be observed, and
-	hence the entropy is the expected surprisingness.
-
-	Now suppose that the observer receives some new data $\Data$ that
-	causes a revision of its beliefs about $X$. The \emph{information}
-	in this new data \emph{about} $X$ can be quantified as the
-	Kullback-Leibler (KL) divergence between the prior and posterior
-	distributions $p(x)$ and $p(x|\Data)$ respectively:
-	\begin{equation}
-		\mathcal{I}_{\Data\to X} = D(p_{X|\Data} || p_{X})
-			= \sum_{x\in\A} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}.
-	\end{equation}
-	If there are multiple variables $X_1, X_2$
-	\etc which the observer believes to be dependent, then the observation of
-	one may change its beliefs and hence yield information about the
-	others.
-	The relationships between the various joint entropies, conditional
-	entropies, mutual informations and conditional mutual informations
-	can be visualised in Venn diagram-like \emph{information diagrams}
-	or I-diagrams \cite{Yeung1991}, for example, the three-variable
-	I-diagram in \figrf{venn-example}.
-
-
 	\begin{fig}{venn-example}
 		\newcommand\rad{2.2em}%
 		\newcommand\circo{circle (3.4em)}%
@@ -332,7 +295,7 @@
 			}
 		\end{tabular}
 		\caption{
-		Information diagram visualisation of entropies and mutual informations
+		I-diagram visualisation of entropies and mutual informations
 		for three random variables $X_1$, $X_2$ and $X_3$. The areas of
 		the three circles represent $H(X_1)$, $H(X_2)$ and $H(X_3)$ respectively.
 		The total shaded area is the joint entropy $H(X_1,X_2,X_3)$.
@@ -340,10 +303,86 @@
 		Some other information measures are indicated in the legend.
 		}
 	\end{fig}
-	[Adopting notation of recent Binding information paper.]
-	\subsection{'Anatomy of a bit' stuff}
-	Entropy rates, redundancy, predictive information etc.
-	Information diagrams.
+
+	\subsection{Entropy and information}
+	Let $X$ denote some variable whose value is initially unknown to our
+	hypothetical observer. We will treat $X$ mathematically as a random variable,
+	with a value to be drawn from some set (or \emph{alphabet}) $\A$ and a
+	probability distribution representing the observer's beliefs about the
+	true value of $X$.
+	In this case, the observer's uncertainty about $X$ can be quantified
+	as the entropy of the random variable $H(X)$. For a discrete variable
+	with probability mass function $p:\A \to [0,1]$, this is
+	\begin{equation}
+		H(X) = \sum_{x\in\A} -p(x) \log p(x) = \expect{-\log p(X)},
+	\end{equation}
+	where $\expect{}$ is the expectation operator. The negative-log-probability
+	$\ell(x) = -\log p(x)$ of a particular value $x$ can usefully be thought of as
+	the \emph{surprisingness} of the value $x$ should it be observed, and
+	hence the entropy is the expected surprisingness.
+
+	Now suppose that the observer receives some new data $\Data$ that
+	causes a revision of its beliefs about $X$. The \emph{information}
+	in this new data \emph{about} $X$ can be quantified as the
+	Kullback-Leibler (KL) divergence between the prior and posterior
+	distributions $p(x)$ and $p(x|\Data)$ respectively:
+	\begin{equation}
+		\mathcal{I}_{\Data\to X} = D(p_{X|\Data} || p_{X})
+			= \sum_{x\in\A} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}.
+	\end{equation}
+	If there are multiple variables $X_1, X_2$
+	\etc which the observer believes to be dependent, then the observation of
+	one may change its beliefs and hence yield information about the
+	others.
+	The relationships between the various joint entropies, conditional
+	entropies, mutual informations and conditional mutual informations
+	can be visualised in Venn diagram-like \emph{information diagrams}
+	or I-diagrams \cite{Yeung1991}, for example, the three-variable
+	I-diagram in \figrf{venn-example}.
+
+
+	\subsection{Entropy and information in sequences}
+
+	Suppose that  $(\ldots,X_{-1},X_0,X_1,\ldots)$ is a stationary sequence of
+	random variables, infinite in both directions,
+	and that $\mu$ is the associated shift-invariant probability measure over all
+	configurations of the sequence---in the following, $\mu$ will simply serve
+	as a label for the process. We can indentify a number of information-theoretic
+	measures meaningful in the context of a sequential observation of the sequence, during
+	which, at any time $t$, there is `present' $X_t$, a `past'
+	$\past{X}_t \equiv (\ldots, X_{t-2}, X_{t-1})$, and a `future'
+	$\fut{X}_t \equiv (X_{t+1},X_{t+2},\ldots)$.
+	Since the sequence is assumed stationary, we can without loss of generality,
+	assume that $t=0$ in the following definitions.
+
+	The \emph{entropy rate} of the process is the entropy of the next variable
+	$X_t$ given all the previous ones.
+	\begin{equation}
+		\label{eq:entro-rate}
+		h_\mu = H(X_0|\past{X}_0).
+	\end{equation}
+	The entropy rate gives a measure of the overall randomness
+	or unpredictability of the process.
+
+	The \emph{multi-information rate} $\rho_\mu$ (following Dubnov's \cite{Dubnov2006}
+	notation for what he called the `information rate') is the mutual
+	information between the `past' and the `present':
+	\begin{equation}
+		\label{eq:multi-info}
+			\rho_\mu(t) = I(\past{X}_t;X_t) = H(X_0) - h_\mu.
+	\end{equation}
+	It is a measure of how much the context of an observation (that is,
+	the observation of previous elements of the sequence) helps in predicting
+	or reducing the suprisingness of the current observation.
+
+	The \emph{excess entropy} \cite{CrutchfieldPackard1983}
+	is the mutual information between
+	the entire `past' and the entire `future':
+	\begin{equation}
+		E = I(\past{X}_0; X_0,\fut{X}_0).
+	\end{equation}
+
+

  	\begin{fig}{predinfo-bg}
 		\newcommand\subfig[2]{\shortstack{#2\\[0.75em]#1}}
@@ -414,7 +453,7 @@
 				\\[0.5em]
 		\end{tabular}
 		\caption{
-		Venn diagram representation of several information measures for
+		I-diagrams for several information measures in
 		stationary random processes. Each circle or oval represents a random
 		variable or sequence of random variables relative to time $t=0$. Overlapped areas
 		correspond to various mutual information as in \Figrf{venn-example}.
@@ -425,29 +464,13 @@
 		}
 	\end{fig}

-	\subsection{Predictive information rate}
-	In previous work \cite{AbdallahPlumbley2009}, we introduced
-%	examined several
-%	information-theoretic measures that could be used to characterise
-%	not only random processes (\ie, an ensemble of possible sequences),
-%	but also the dynamic progress of specific realisations of such processes.
-%	One of these measures was
-%
-	the \emph{predictive information rate}
-	(PIR), which is the average information
-	in one observation about the infinite future given the infinite past.
-	If $\past{X}_t=(\ldots,X_{t-2},X_{t-1})$ denotes the variables
-	before time $t$,
-	and $\fut{X}_t = (X_{t+1},X_{t+2},\ldots)$ denotes
-	those after $t$,
-	the PIR at time $t$ is defined as a conditional mutual information:
+	The \emph{predictive information rate} (or PIR) \cite{AbdallahPlumbley2009}
+	is the average information in one observation about the infinite future given the infinite past,
+	and is defined as a conditional mutual information:
 	\begin{equation}
 		\label{eq:PIR}
-		\IXZ_t \define I(X_t;\fut{X}_t|\past{X}_t) = H(\fut{X}_t|\past{X}_t) - H(\fut{X}_t|X_t,\past{X}_t).
+		b_\mu = I(X_0;\fut{X}_0|\past{X}_0) = H(\fut{X}_0|\past{X}_0) - H(\fut{X}_0|X_0,\past{X}_0).
 	\end{equation}
-%	(The underline/overline notation follows that of \cite[\S 3]{AbdallahPlumbley2009}.)
-%	Hence, $\Ix_t$ quantifies the \emph{new}
-%	information gained about the future from the observation at time $t$.
 	Equation \eqrf{PIR} can be read as the average reduction
 	in uncertainty about the future on learning $X_t$, given the past.
 	Due to the symmetry of the mutual information, it can also be written
@@ -465,19 +488,7 @@
 	we called the \emph{residual entropy rate} $r_\mu$ in \cite{AbdallahPlumbley2010},
 	but was previously identified by Verd{\'u} and Weissman \cite{VerduWeissman2006} as the
 	\emph{erasure entropy rate}.
-%	It is not expressible in terms of the block entropy function $H(\cdot)$.
-	It can be defined as the limit
-	\begin{equation}
-		\label{eq:residual-entropy-rate}
-		r_\mu \define \lim_{N\tends\infty} H(X_{\bet(-N,N)}) - H(X_{\bet(-N,-1)},X_{\bet(1,N)}).
-	\end{equation}
-	The second term, $H(X_{\bet(1,N)},X_{\bet(-N,-1)})$,
-	is the joint entropy of two non-adjacent blocks each of length $N$ with a
-	gap between them,
-	and cannot be expressed as a function of block entropies alone.
-%	In order to associate it with the concept of \emph{binding information} which
-%	we will define in \secrf{binding-info}, we
-	Thus, the shift-invariant PIR (which we will write as $b_\mu$) is the difference between
+	Thus, the PIR is the difference between
 	the entropy rate and the erasure entropy rate: $b_\mu = h_\mu - r_\mu$.
 	These relationships are illustrated in \Figrf{predinfo-bg}, along with
 	several of the information measures we have discussed so far.
@@ -504,7 +515,7 @@
 	model-building (that is, maintaining an internal state summarising past
 	observations in order to make better predictions). They also identify
 	$w_\mu = \rho_\mu + b_{\mu}$, which they call the \emph{local exogenous
-	information}.
+	information} rate.

 	\subsection{First order Markov chains}
 	These are the simplest non-trivial models to which information dynamics methods