comparison draft.tex @ 30:08c6df6dcfe0

Updates to section 2.
author samer
date Tue, 13 Mar 2012 19:39:35 +0000
parents fb1bfe785c05
children a9c8580cb8ca
comparison
equal deleted inserted replaced
29:12deb9b1e630 30:08c6df6dcfe0
222 %and selective or evaluative phases \cite{Boden1990}, and would have 222 %and selective or evaluative phases \cite{Boden1990}, and would have
223 %applications in tools for computer aided composition. 223 %applications in tools for computer aided composition.
224 224
225 225
226 \section{Theoretical review} 226 \section{Theoretical review}
227
228 \subsection{Entropy and information in sequences}
229 Let $X$ denote some variable whos value is initially unknown to our
230 hypothetical observer. We will treat $X$ mathematically as a random variable,
231 with a value to be drawn from some set (or \emph{alphabet}) $\A$ and a
232 probability distribution representing the observer's beliefs about the
233 true value of $X$.
234 In this case, the observer's uncertainty about $X$ can be quantified
235 as the entropy of the random variable $H(X)$. For a discrete variable
236 with probability mass function $p:\A \to [0,1]$, this is
237 \begin{equation}
238 H(X) = \sum_{x\in\A} -p(x) \log p(x) = \expect{-\log p(X)},
239 \end{equation}
240 where $\expect{}$ is the expectation operator. The negative-log-probability
241 $\ell(x) = -\log p(x)$ of a particular value $x$ can usefully be thought of as
242 the \emph{surprisingness} of the value $x$ should it be observed, and
243 hence the entropy is the expected surprisingness.
244
245 Now suppose that the observer receives some new data $\Data$ that
246 causes a revision of its beliefs about $X$. The \emph{information}
247 in this new data \emph{about} $X$ can be quantified as the
248 Kullback-Leibler (KL) divergence between the prior and posterior
249 distributions $p(x)$ and $p(x|\Data)$ respectively:
250 \begin{equation}
251 \mathcal{I}_{\Data\to X} = D(p_{X|\Data} || p_{X})
252 = \sum_{x\in\A} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}.
253 \end{equation}
254 If there are multiple variables $X_1, X_2$
255 \etc which the observer believes to be dependent, then the observation of
256 one may change its beliefs and hence yield information about the
257 others.
258 The relationships between the various joint entropies, conditional
259 entropies, mutual informations and conditional mutual informations
260 can be visualised in Venn diagram-like \emph{information diagrams}
261 or I-diagrams \cite{Yeung1991}, for example, the three-variable
262 I-diagram in \figrf{venn-example}.
263
264 227
265 \begin{fig}{venn-example} 228 \begin{fig}{venn-example}
266 \newcommand\rad{2.2em}% 229 \newcommand\rad{2.2em}%
267 \newcommand\circo{circle (3.4em)}% 230 \newcommand\circo{circle (3.4em)}%
268 \newcommand\labrad{4.3em} 231 \newcommand\labrad{4.3em}
330 I_{12|3} + I_{123} &= I(X_1;X_2) 293 I_{12|3} + I_{123} &= I(X_1;X_2)
331 \end{align*} 294 \end{align*}
332 } 295 }
333 \end{tabular} 296 \end{tabular}
334 \caption{ 297 \caption{
335 Information diagram visualisation of entropies and mutual informations 298 I-diagram visualisation of entropies and mutual informations
336 for three random variables $X_1$, $X_2$ and $X_3$. The areas of 299 for three random variables $X_1$, $X_2$ and $X_3$. The areas of
337 the three circles represent $H(X_1)$, $H(X_2)$ and $H(X_3)$ respectively. 300 the three circles represent $H(X_1)$, $H(X_2)$ and $H(X_3)$ respectively.
338 The total shaded area is the joint entropy $H(X_1,X_2,X_3)$. 301 The total shaded area is the joint entropy $H(X_1,X_2,X_3)$.
339 The central area $I_{123}$ is the co-information \cite{McGill1954}. 302 The central area $I_{123}$ is the co-information \cite{McGill1954}.
340 Some other information measures are indicated in the legend. 303 Some other information measures are indicated in the legend.
341 } 304 }
342 \end{fig} 305 \end{fig}
343 [Adopting notation of recent Binding information paper.] 306
344 \subsection{'Anatomy of a bit' stuff} 307 \subsection{Entropy and information}
345 Entropy rates, redundancy, predictive information etc. 308 Let $X$ denote some variable whose value is initially unknown to our
346 Information diagrams. 309 hypothetical observer. We will treat $X$ mathematically as a random variable,
310 with a value to be drawn from some set (or \emph{alphabet}) $\A$ and a
311 probability distribution representing the observer's beliefs about the
312 true value of $X$.
313 In this case, the observer's uncertainty about $X$ can be quantified
314 as the entropy of the random variable $H(X)$. For a discrete variable
315 with probability mass function $p:\A \to [0,1]$, this is
316 \begin{equation}
317 H(X) = \sum_{x\in\A} -p(x) \log p(x) = \expect{-\log p(X)},
318 \end{equation}
319 where $\expect{}$ is the expectation operator. The negative-log-probability
320 $\ell(x) = -\log p(x)$ of a particular value $x$ can usefully be thought of as
321 the \emph{surprisingness} of the value $x$ should it be observed, and
322 hence the entropy is the expected surprisingness.
323
324 Now suppose that the observer receives some new data $\Data$ that
325 causes a revision of its beliefs about $X$. The \emph{information}
326 in this new data \emph{about} $X$ can be quantified as the
327 Kullback-Leibler (KL) divergence between the prior and posterior
328 distributions $p(x)$ and $p(x|\Data)$ respectively:
329 \begin{equation}
330 \mathcal{I}_{\Data\to X} = D(p_{X|\Data} || p_{X})
331 = \sum_{x\in\A} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}.
332 \end{equation}
333 If there are multiple variables $X_1, X_2$
334 \etc which the observer believes to be dependent, then the observation of
335 one may change its beliefs and hence yield information about the
336 others.
337 The relationships between the various joint entropies, conditional
338 entropies, mutual informations and conditional mutual informations
339 can be visualised in Venn diagram-like \emph{information diagrams}
340 or I-diagrams \cite{Yeung1991}, for example, the three-variable
341 I-diagram in \figrf{venn-example}.
342
343
344 \subsection{Entropy and information in sequences}
345
346 Suppose that $(\ldots,X_{-1},X_0,X_1,\ldots)$ is a stationary sequence of
347 random variables, infinite in both directions,
348 and that $\mu$ is the associated shift-invariant probability measure over all
349 configurations of the sequence---in the following, $\mu$ will simply serve
350 as a label for the process. We can indentify a number of information-theoretic
351 measures meaningful in the context of a sequential observation of the sequence, during
352 which, at any time $t$, there is `present' $X_t$, a `past'
353 $\past{X}_t \equiv (\ldots, X_{t-2}, X_{t-1})$, and a `future'
354 $\fut{X}_t \equiv (X_{t+1},X_{t+2},\ldots)$.
355 Since the sequence is assumed stationary, we can without loss of generality,
356 assume that $t=0$ in the following definitions.
357
358 The \emph{entropy rate} of the process is the entropy of the next variable
359 $X_t$ given all the previous ones.
360 \begin{equation}
361 \label{eq:entro-rate}
362 h_\mu = H(X_0|\past{X}_0).
363 \end{equation}
364 The entropy rate gives a measure of the overall randomness
365 or unpredictability of the process.
366
367 The \emph{multi-information rate} $\rho_\mu$ (following Dubnov's \cite{Dubnov2006}
368 notation for what he called the `information rate') is the mutual
369 information between the `past' and the `present':
370 \begin{equation}
371 \label{eq:multi-info}
372 \rho_\mu(t) = I(\past{X}_t;X_t) = H(X_0) - h_\mu.
373 \end{equation}
374 It is a measure of how much the context of an observation (that is,
375 the observation of previous elements of the sequence) helps in predicting
376 or reducing the suprisingness of the current observation.
377
378 The \emph{excess entropy} \cite{CrutchfieldPackard1983}
379 is the mutual information between
380 the entire `past' and the entire `future':
381 \begin{equation}
382 E = I(\past{X}_0; X_0,\fut{X}_0).
383 \end{equation}
384
385
347 386
348 \begin{fig}{predinfo-bg} 387 \begin{fig}{predinfo-bg}
349 \newcommand\subfig[2]{\shortstack{#2\\[0.75em]#1}} 388 \newcommand\subfig[2]{\shortstack{#2\\[0.75em]#1}}
350 \newcommand\rad{1.8em}% 389 \newcommand\rad{1.8em}%
351 \newcommand\ovoid[1]{% 390 \newcommand\ovoid[1]{%
412 \path (p3) +(4em,\rad) node [anchor=south] {$X_1,\ldots$}; 451 \path (p3) +(4em,\rad) node [anchor=south] {$X_1,\ldots$};
413 \end{tikzpicture}}% 452 \end{tikzpicture}}%
414 \\[0.5em] 453 \\[0.5em]
415 \end{tabular} 454 \end{tabular}
416 \caption{ 455 \caption{
417 Venn diagram representation of several information measures for 456 I-diagrams for several information measures in
418 stationary random processes. Each circle or oval represents a random 457 stationary random processes. Each circle or oval represents a random
419 variable or sequence of random variables relative to time $t=0$. Overlapped areas 458 variable or sequence of random variables relative to time $t=0$. Overlapped areas
420 correspond to various mutual information as in \Figrf{venn-example}. 459 correspond to various mutual information as in \Figrf{venn-example}.
421 In (c), the circle represents the `present'. Its total area is 460 In (c), the circle represents the `present'. Its total area is
422 $H(X_0)=H(1)=\rho_\mu+r_\mu+b_\mu$, where $\rho_\mu$ is the multi-information 461 $H(X_0)=H(1)=\rho_\mu+r_\mu+b_\mu$, where $\rho_\mu$ is the multi-information
423 rate, $r_\mu$ is the residual entropy rate, and $b_\mu$ is the predictive 462 rate, $r_\mu$ is the residual entropy rate, and $b_\mu$ is the predictive
424 information rate. The entropy rate is $h_\mu = r_\mu+b_\mu$. 463 information rate. The entropy rate is $h_\mu = r_\mu+b_\mu$.
425 } 464 }
426 \end{fig} 465 \end{fig}
427 466
428 \subsection{Predictive information rate} 467 The \emph{predictive information rate} (or PIR) \cite{AbdallahPlumbley2009}
429 In previous work \cite{AbdallahPlumbley2009}, we introduced 468 is the average information in one observation about the infinite future given the infinite past,
430 % examined several 469 and is defined as a conditional mutual information:
431 % information-theoretic measures that could be used to characterise
432 % not only random processes (\ie, an ensemble of possible sequences),
433 % but also the dynamic progress of specific realisations of such processes.
434 % One of these measures was
435 %
436 the \emph{predictive information rate}
437 (PIR), which is the average information
438 in one observation about the infinite future given the infinite past.
439 If $\past{X}_t=(\ldots,X_{t-2},X_{t-1})$ denotes the variables
440 before time $t$,
441 and $\fut{X}_t = (X_{t+1},X_{t+2},\ldots)$ denotes
442 those after $t$,
443 the PIR at time $t$ is defined as a conditional mutual information:
444 \begin{equation} 470 \begin{equation}
445 \label{eq:PIR} 471 \label{eq:PIR}
446 \IXZ_t \define I(X_t;\fut{X}_t|\past{X}_t) = H(\fut{X}_t|\past{X}_t) - H(\fut{X}_t|X_t,\past{X}_t). 472 b_\mu = I(X_0;\fut{X}_0|\past{X}_0) = H(\fut{X}_0|\past{X}_0) - H(\fut{X}_0|X_0,\past{X}_0).
447 \end{equation} 473 \end{equation}
448 % (The underline/overline notation follows that of \cite[\S 3]{AbdallahPlumbley2009}.)
449 % Hence, $\Ix_t$ quantifies the \emph{new}
450 % information gained about the future from the observation at time $t$.
451 Equation \eqrf{PIR} can be read as the average reduction 474 Equation \eqrf{PIR} can be read as the average reduction
452 in uncertainty about the future on learning $X_t$, given the past. 475 in uncertainty about the future on learning $X_t$, given the past.
453 Due to the symmetry of the mutual information, it can also be written 476 Due to the symmetry of the mutual information, it can also be written
454 as 477 as
455 \begin{equation} 478 \begin{equation}
463 the conditional entropy of one variable given \emph{all} the others 486 the conditional entropy of one variable given \emph{all} the others
464 in the sequence, future as well as past, is what 487 in the sequence, future as well as past, is what
465 we called the \emph{residual entropy rate} $r_\mu$ in \cite{AbdallahPlumbley2010}, 488 we called the \emph{residual entropy rate} $r_\mu$ in \cite{AbdallahPlumbley2010},
466 but was previously identified by Verd{\'u} and Weissman \cite{VerduWeissman2006} as the 489 but was previously identified by Verd{\'u} and Weissman \cite{VerduWeissman2006} as the
467 \emph{erasure entropy rate}. 490 \emph{erasure entropy rate}.
468 % It is not expressible in terms of the block entropy function $H(\cdot)$. 491 Thus, the PIR is the difference between
469 It can be defined as the limit
470 \begin{equation}
471 \label{eq:residual-entropy-rate}
472 r_\mu \define \lim_{N\tends\infty} H(X_{\bet(-N,N)}) - H(X_{\bet(-N,-1)},X_{\bet(1,N)}).
473 \end{equation}
474 The second term, $H(X_{\bet(1,N)},X_{\bet(-N,-1)})$,
475 is the joint entropy of two non-adjacent blocks each of length $N$ with a
476 gap between them,
477 and cannot be expressed as a function of block entropies alone.
478 % In order to associate it with the concept of \emph{binding information} which
479 % we will define in \secrf{binding-info}, we
480 Thus, the shift-invariant PIR (which we will write as $b_\mu$) is the difference between
481 the entropy rate and the erasure entropy rate: $b_\mu = h_\mu - r_\mu$. 492 the entropy rate and the erasure entropy rate: $b_\mu = h_\mu - r_\mu$.
482 These relationships are illustrated in \Figrf{predinfo-bg}, along with 493 These relationships are illustrated in \Figrf{predinfo-bg}, along with
483 several of the information measures we have discussed so far. 494 several of the information measures we have discussed so far.
484 495
485 496
502 $\sigma_\mu$, the difference between the multi-information rate and the excess 513 $\sigma_\mu$, the difference between the multi-information rate and the excess
503 entropy, as an interesting quantity that measures the predictive benefit of 514 entropy, as an interesting quantity that measures the predictive benefit of
504 model-building (that is, maintaining an internal state summarising past 515 model-building (that is, maintaining an internal state summarising past
505 observations in order to make better predictions). They also identify 516 observations in order to make better predictions). They also identify
506 $w_\mu = \rho_\mu + b_{\mu}$, which they call the \emph{local exogenous 517 $w_\mu = \rho_\mu + b_{\mu}$, which they call the \emph{local exogenous
507 information}. 518 information} rate.
508 519
509 \subsection{First order Markov chains} 520 \subsection{First order Markov chains}
510 These are the simplest non-trivial models to which information dynamics methods 521 These are the simplest non-trivial models to which information dynamics methods
511 can be applied. In \cite{AbdallahPlumbley2009} we, showed that the predictive information 522 can be applied. In \cite{AbdallahPlumbley2009} we, showed that the predictive information
512 rate can be expressed simply in terms of the entropy rate of the Markov chain. 523 rate can be expressed simply in terms of the entropy rate of the Markov chain.