# HG changeset patch # User samer # Date 1331667575 0 # Node ID 08c6df6dcfe0916808845cbffedf19d6618588c0 # Parent 12deb9b1e6302e7b7d1cf376161733c1e84e48d8 Updates to section 2. diff -r 12deb9b1e630 -r 08c6df6dcfe0 draft.pdf Binary file draft.pdf has changed diff -r 12deb9b1e630 -r 08c6df6dcfe0 draft.tex --- a/draft.tex Tue Mar 13 19:30:38 2012 +0000 +++ b/draft.tex Tue Mar 13 19:39:35 2012 +0000 @@ -225,43 +225,6 @@ \section{Theoretical review} - \subsection{Entropy and information in sequences} - Let $X$ denote some variable whos value is initially unknown to our - hypothetical observer. We will treat $X$ mathematically as a random variable, - with a value to be drawn from some set (or \emph{alphabet}) $\A$ and a - probability distribution representing the observer's beliefs about the - true value of $X$. - In this case, the observer's uncertainty about $X$ can be quantified - as the entropy of the random variable $H(X)$. For a discrete variable - with probability mass function $p:\A \to [0,1]$, this is - \begin{equation} - H(X) = \sum_{x\in\A} -p(x) \log p(x) = \expect{-\log p(X)}, - \end{equation} - where $\expect{}$ is the expectation operator. The negative-log-probability - $\ell(x) = -\log p(x)$ of a particular value $x$ can usefully be thought of as - the \emph{surprisingness} of the value $x$ should it be observed, and - hence the entropy is the expected surprisingness. - - Now suppose that the observer receives some new data $\Data$ that - causes a revision of its beliefs about $X$. The \emph{information} - in this new data \emph{about} $X$ can be quantified as the - Kullback-Leibler (KL) divergence between the prior and posterior - distributions $p(x)$ and $p(x|\Data)$ respectively: - \begin{equation} - \mathcal{I}_{\Data\to X} = D(p_{X|\Data} || p_{X}) - = \sum_{x\in\A} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}. - \end{equation} - If there are multiple variables $X_1, X_2$ - \etc which the observer believes to be dependent, then the observation of - one may change its beliefs and hence yield information about the - others. - The relationships between the various joint entropies, conditional - entropies, mutual informations and conditional mutual informations - can be visualised in Venn diagram-like \emph{information diagrams} - or I-diagrams \cite{Yeung1991}, for example, the three-variable - I-diagram in \figrf{venn-example}. - - \begin{fig}{venn-example} \newcommand\rad{2.2em}% \newcommand\circo{circle (3.4em)}% @@ -332,7 +295,7 @@ } \end{tabular} \caption{ - Information diagram visualisation of entropies and mutual informations + I-diagram visualisation of entropies and mutual informations for three random variables $X_1$, $X_2$ and $X_3$. The areas of the three circles represent $H(X_1)$, $H(X_2)$ and $H(X_3)$ respectively. The total shaded area is the joint entropy $H(X_1,X_2,X_3)$. @@ -340,10 +303,86 @@ Some other information measures are indicated in the legend. } \end{fig} - [Adopting notation of recent Binding information paper.] - \subsection{'Anatomy of a bit' stuff} - Entropy rates, redundancy, predictive information etc. - Information diagrams. + + \subsection{Entropy and information} + Let $X$ denote some variable whose value is initially unknown to our + hypothetical observer. We will treat $X$ mathematically as a random variable, + with a value to be drawn from some set (or \emph{alphabet}) $\A$ and a + probability distribution representing the observer's beliefs about the + true value of $X$. + In this case, the observer's uncertainty about $X$ can be quantified + as the entropy of the random variable $H(X)$. For a discrete variable + with probability mass function $p:\A \to [0,1]$, this is + \begin{equation} + H(X) = \sum_{x\in\A} -p(x) \log p(x) = \expect{-\log p(X)}, + \end{equation} + where $\expect{}$ is the expectation operator. The negative-log-probability + $\ell(x) = -\log p(x)$ of a particular value $x$ can usefully be thought of as + the \emph{surprisingness} of the value $x$ should it be observed, and + hence the entropy is the expected surprisingness. + + Now suppose that the observer receives some new data $\Data$ that + causes a revision of its beliefs about $X$. The \emph{information} + in this new data \emph{about} $X$ can be quantified as the + Kullback-Leibler (KL) divergence between the prior and posterior + distributions $p(x)$ and $p(x|\Data)$ respectively: + \begin{equation} + \mathcal{I}_{\Data\to X} = D(p_{X|\Data} || p_{X}) + = \sum_{x\in\A} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}. + \end{equation} + If there are multiple variables $X_1, X_2$ + \etc which the observer believes to be dependent, then the observation of + one may change its beliefs and hence yield information about the + others. + The relationships between the various joint entropies, conditional + entropies, mutual informations and conditional mutual informations + can be visualised in Venn diagram-like \emph{information diagrams} + or I-diagrams \cite{Yeung1991}, for example, the three-variable + I-diagram in \figrf{venn-example}. + + + \subsection{Entropy and information in sequences} + + Suppose that $(\ldots,X_{-1},X_0,X_1,\ldots)$ is a stationary sequence of + random variables, infinite in both directions, + and that $\mu$ is the associated shift-invariant probability measure over all + configurations of the sequence---in the following, $\mu$ will simply serve + as a label for the process. We can indentify a number of information-theoretic + measures meaningful in the context of a sequential observation of the sequence, during + which, at any time $t$, there is `present' $X_t$, a `past' + $\past{X}_t \equiv (\ldots, X_{t-2}, X_{t-1})$, and a `future' + $\fut{X}_t \equiv (X_{t+1},X_{t+2},\ldots)$. + Since the sequence is assumed stationary, we can without loss of generality, + assume that $t=0$ in the following definitions. + + The \emph{entropy rate} of the process is the entropy of the next variable + $X_t$ given all the previous ones. + \begin{equation} + \label{eq:entro-rate} + h_\mu = H(X_0|\past{X}_0). + \end{equation} + The entropy rate gives a measure of the overall randomness + or unpredictability of the process. + + The \emph{multi-information rate} $\rho_\mu$ (following Dubnov's \cite{Dubnov2006} + notation for what he called the `information rate') is the mutual + information between the `past' and the `present': + \begin{equation} + \label{eq:multi-info} + \rho_\mu(t) = I(\past{X}_t;X_t) = H(X_0) - h_\mu. + \end{equation} + It is a measure of how much the context of an observation (that is, + the observation of previous elements of the sequence) helps in predicting + or reducing the suprisingness of the current observation. + + The \emph{excess entropy} \cite{CrutchfieldPackard1983} + is the mutual information between + the entire `past' and the entire `future': + \begin{equation} + E = I(\past{X}_0; X_0,\fut{X}_0). + \end{equation} + + \begin{fig}{predinfo-bg} \newcommand\subfig[2]{\shortstack{#2\\[0.75em]#1}} @@ -414,7 +453,7 @@ \\[0.5em] \end{tabular} \caption{ - Venn diagram representation of several information measures for + I-diagrams for several information measures in stationary random processes. Each circle or oval represents a random variable or sequence of random variables relative to time $t=0$. Overlapped areas correspond to various mutual information as in \Figrf{venn-example}. @@ -425,29 +464,13 @@ } \end{fig} - \subsection{Predictive information rate} - In previous work \cite{AbdallahPlumbley2009}, we introduced -% examined several -% information-theoretic measures that could be used to characterise -% not only random processes (\ie, an ensemble of possible sequences), -% but also the dynamic progress of specific realisations of such processes. -% One of these measures was -% - the \emph{predictive information rate} - (PIR), which is the average information - in one observation about the infinite future given the infinite past. - If $\past{X}_t=(\ldots,X_{t-2},X_{t-1})$ denotes the variables - before time $t$, - and $\fut{X}_t = (X_{t+1},X_{t+2},\ldots)$ denotes - those after $t$, - the PIR at time $t$ is defined as a conditional mutual information: + The \emph{predictive information rate} (or PIR) \cite{AbdallahPlumbley2009} + is the average information in one observation about the infinite future given the infinite past, + and is defined as a conditional mutual information: \begin{equation} \label{eq:PIR} - \IXZ_t \define I(X_t;\fut{X}_t|\past{X}_t) = H(\fut{X}_t|\past{X}_t) - H(\fut{X}_t|X_t,\past{X}_t). + b_\mu = I(X_0;\fut{X}_0|\past{X}_0) = H(\fut{X}_0|\past{X}_0) - H(\fut{X}_0|X_0,\past{X}_0). \end{equation} -% (The underline/overline notation follows that of \cite[\S 3]{AbdallahPlumbley2009}.) -% Hence, $\Ix_t$ quantifies the \emph{new} -% information gained about the future from the observation at time $t$. Equation \eqrf{PIR} can be read as the average reduction in uncertainty about the future on learning $X_t$, given the past. Due to the symmetry of the mutual information, it can also be written @@ -465,19 +488,7 @@ we called the \emph{residual entropy rate} $r_\mu$ in \cite{AbdallahPlumbley2010}, but was previously identified by Verd{\'u} and Weissman \cite{VerduWeissman2006} as the \emph{erasure entropy rate}. -% It is not expressible in terms of the block entropy function $H(\cdot)$. - It can be defined as the limit - \begin{equation} - \label{eq:residual-entropy-rate} - r_\mu \define \lim_{N\tends\infty} H(X_{\bet(-N,N)}) - H(X_{\bet(-N,-1)},X_{\bet(1,N)}). - \end{equation} - The second term, $H(X_{\bet(1,N)},X_{\bet(-N,-1)})$, - is the joint entropy of two non-adjacent blocks each of length $N$ with a - gap between them, - and cannot be expressed as a function of block entropies alone. -% In order to associate it with the concept of \emph{binding information} which -% we will define in \secrf{binding-info}, we - Thus, the shift-invariant PIR (which we will write as $b_\mu$) is the difference between + Thus, the PIR is the difference between the entropy rate and the erasure entropy rate: $b_\mu = h_\mu - r_\mu$. These relationships are illustrated in \Figrf{predinfo-bg}, along with several of the information measures we have discussed so far. @@ -504,7 +515,7 @@ model-building (that is, maintaining an internal state summarising past observations in order to make better predictions). They also identify $w_\mu = \rho_\mu + b_{\mu}$, which they call the \emph{local exogenous - information}. + information} rate. \subsection{First order Markov chains} These are the simplest non-trivial models to which information dynamics methods