Mercurial > hg > cip2012
comparison draft.tex @ 30:08c6df6dcfe0
Updates to section 2.
author | samer |
---|---|
date | Tue, 13 Mar 2012 19:39:35 +0000 |
parents | fb1bfe785c05 |
children | a9c8580cb8ca |
comparison
equal
deleted
inserted
replaced
29:12deb9b1e630 | 30:08c6df6dcfe0 |
---|---|
222 %and selective or evaluative phases \cite{Boden1990}, and would have | 222 %and selective or evaluative phases \cite{Boden1990}, and would have |
223 %applications in tools for computer aided composition. | 223 %applications in tools for computer aided composition. |
224 | 224 |
225 | 225 |
226 \section{Theoretical review} | 226 \section{Theoretical review} |
227 | |
228 \subsection{Entropy and information in sequences} | |
229 Let $X$ denote some variable whos value is initially unknown to our | |
230 hypothetical observer. We will treat $X$ mathematically as a random variable, | |
231 with a value to be drawn from some set (or \emph{alphabet}) $\A$ and a | |
232 probability distribution representing the observer's beliefs about the | |
233 true value of $X$. | |
234 In this case, the observer's uncertainty about $X$ can be quantified | |
235 as the entropy of the random variable $H(X)$. For a discrete variable | |
236 with probability mass function $p:\A \to [0,1]$, this is | |
237 \begin{equation} | |
238 H(X) = \sum_{x\in\A} -p(x) \log p(x) = \expect{-\log p(X)}, | |
239 \end{equation} | |
240 where $\expect{}$ is the expectation operator. The negative-log-probability | |
241 $\ell(x) = -\log p(x)$ of a particular value $x$ can usefully be thought of as | |
242 the \emph{surprisingness} of the value $x$ should it be observed, and | |
243 hence the entropy is the expected surprisingness. | |
244 | |
245 Now suppose that the observer receives some new data $\Data$ that | |
246 causes a revision of its beliefs about $X$. The \emph{information} | |
247 in this new data \emph{about} $X$ can be quantified as the | |
248 Kullback-Leibler (KL) divergence between the prior and posterior | |
249 distributions $p(x)$ and $p(x|\Data)$ respectively: | |
250 \begin{equation} | |
251 \mathcal{I}_{\Data\to X} = D(p_{X|\Data} || p_{X}) | |
252 = \sum_{x\in\A} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}. | |
253 \end{equation} | |
254 If there are multiple variables $X_1, X_2$ | |
255 \etc which the observer believes to be dependent, then the observation of | |
256 one may change its beliefs and hence yield information about the | |
257 others. | |
258 The relationships between the various joint entropies, conditional | |
259 entropies, mutual informations and conditional mutual informations | |
260 can be visualised in Venn diagram-like \emph{information diagrams} | |
261 or I-diagrams \cite{Yeung1991}, for example, the three-variable | |
262 I-diagram in \figrf{venn-example}. | |
263 | |
264 | 227 |
265 \begin{fig}{venn-example} | 228 \begin{fig}{venn-example} |
266 \newcommand\rad{2.2em}% | 229 \newcommand\rad{2.2em}% |
267 \newcommand\circo{circle (3.4em)}% | 230 \newcommand\circo{circle (3.4em)}% |
268 \newcommand\labrad{4.3em} | 231 \newcommand\labrad{4.3em} |
330 I_{12|3} + I_{123} &= I(X_1;X_2) | 293 I_{12|3} + I_{123} &= I(X_1;X_2) |
331 \end{align*} | 294 \end{align*} |
332 } | 295 } |
333 \end{tabular} | 296 \end{tabular} |
334 \caption{ | 297 \caption{ |
335 Information diagram visualisation of entropies and mutual informations | 298 I-diagram visualisation of entropies and mutual informations |
336 for three random variables $X_1$, $X_2$ and $X_3$. The areas of | 299 for three random variables $X_1$, $X_2$ and $X_3$. The areas of |
337 the three circles represent $H(X_1)$, $H(X_2)$ and $H(X_3)$ respectively. | 300 the three circles represent $H(X_1)$, $H(X_2)$ and $H(X_3)$ respectively. |
338 The total shaded area is the joint entropy $H(X_1,X_2,X_3)$. | 301 The total shaded area is the joint entropy $H(X_1,X_2,X_3)$. |
339 The central area $I_{123}$ is the co-information \cite{McGill1954}. | 302 The central area $I_{123}$ is the co-information \cite{McGill1954}. |
340 Some other information measures are indicated in the legend. | 303 Some other information measures are indicated in the legend. |
341 } | 304 } |
342 \end{fig} | 305 \end{fig} |
343 [Adopting notation of recent Binding information paper.] | 306 |
344 \subsection{'Anatomy of a bit' stuff} | 307 \subsection{Entropy and information} |
345 Entropy rates, redundancy, predictive information etc. | 308 Let $X$ denote some variable whose value is initially unknown to our |
346 Information diagrams. | 309 hypothetical observer. We will treat $X$ mathematically as a random variable, |
310 with a value to be drawn from some set (or \emph{alphabet}) $\A$ and a | |
311 probability distribution representing the observer's beliefs about the | |
312 true value of $X$. | |
313 In this case, the observer's uncertainty about $X$ can be quantified | |
314 as the entropy of the random variable $H(X)$. For a discrete variable | |
315 with probability mass function $p:\A \to [0,1]$, this is | |
316 \begin{equation} | |
317 H(X) = \sum_{x\in\A} -p(x) \log p(x) = \expect{-\log p(X)}, | |
318 \end{equation} | |
319 where $\expect{}$ is the expectation operator. The negative-log-probability | |
320 $\ell(x) = -\log p(x)$ of a particular value $x$ can usefully be thought of as | |
321 the \emph{surprisingness} of the value $x$ should it be observed, and | |
322 hence the entropy is the expected surprisingness. | |
323 | |
324 Now suppose that the observer receives some new data $\Data$ that | |
325 causes a revision of its beliefs about $X$. The \emph{information} | |
326 in this new data \emph{about} $X$ can be quantified as the | |
327 Kullback-Leibler (KL) divergence between the prior and posterior | |
328 distributions $p(x)$ and $p(x|\Data)$ respectively: | |
329 \begin{equation} | |
330 \mathcal{I}_{\Data\to X} = D(p_{X|\Data} || p_{X}) | |
331 = \sum_{x\in\A} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}. | |
332 \end{equation} | |
333 If there are multiple variables $X_1, X_2$ | |
334 \etc which the observer believes to be dependent, then the observation of | |
335 one may change its beliefs and hence yield information about the | |
336 others. | |
337 The relationships between the various joint entropies, conditional | |
338 entropies, mutual informations and conditional mutual informations | |
339 can be visualised in Venn diagram-like \emph{information diagrams} | |
340 or I-diagrams \cite{Yeung1991}, for example, the three-variable | |
341 I-diagram in \figrf{venn-example}. | |
342 | |
343 | |
344 \subsection{Entropy and information in sequences} | |
345 | |
346 Suppose that $(\ldots,X_{-1},X_0,X_1,\ldots)$ is a stationary sequence of | |
347 random variables, infinite in both directions, | |
348 and that $\mu$ is the associated shift-invariant probability measure over all | |
349 configurations of the sequence---in the following, $\mu$ will simply serve | |
350 as a label for the process. We can indentify a number of information-theoretic | |
351 measures meaningful in the context of a sequential observation of the sequence, during | |
352 which, at any time $t$, there is `present' $X_t$, a `past' | |
353 $\past{X}_t \equiv (\ldots, X_{t-2}, X_{t-1})$, and a `future' | |
354 $\fut{X}_t \equiv (X_{t+1},X_{t+2},\ldots)$. | |
355 Since the sequence is assumed stationary, we can without loss of generality, | |
356 assume that $t=0$ in the following definitions. | |
357 | |
358 The \emph{entropy rate} of the process is the entropy of the next variable | |
359 $X_t$ given all the previous ones. | |
360 \begin{equation} | |
361 \label{eq:entro-rate} | |
362 h_\mu = H(X_0|\past{X}_0). | |
363 \end{equation} | |
364 The entropy rate gives a measure of the overall randomness | |
365 or unpredictability of the process. | |
366 | |
367 The \emph{multi-information rate} $\rho_\mu$ (following Dubnov's \cite{Dubnov2006} | |
368 notation for what he called the `information rate') is the mutual | |
369 information between the `past' and the `present': | |
370 \begin{equation} | |
371 \label{eq:multi-info} | |
372 \rho_\mu(t) = I(\past{X}_t;X_t) = H(X_0) - h_\mu. | |
373 \end{equation} | |
374 It is a measure of how much the context of an observation (that is, | |
375 the observation of previous elements of the sequence) helps in predicting | |
376 or reducing the suprisingness of the current observation. | |
377 | |
378 The \emph{excess entropy} \cite{CrutchfieldPackard1983} | |
379 is the mutual information between | |
380 the entire `past' and the entire `future': | |
381 \begin{equation} | |
382 E = I(\past{X}_0; X_0,\fut{X}_0). | |
383 \end{equation} | |
384 | |
385 | |
347 | 386 |
348 \begin{fig}{predinfo-bg} | 387 \begin{fig}{predinfo-bg} |
349 \newcommand\subfig[2]{\shortstack{#2\\[0.75em]#1}} | 388 \newcommand\subfig[2]{\shortstack{#2\\[0.75em]#1}} |
350 \newcommand\rad{1.8em}% | 389 \newcommand\rad{1.8em}% |
351 \newcommand\ovoid[1]{% | 390 \newcommand\ovoid[1]{% |
412 \path (p3) +(4em,\rad) node [anchor=south] {$X_1,\ldots$}; | 451 \path (p3) +(4em,\rad) node [anchor=south] {$X_1,\ldots$}; |
413 \end{tikzpicture}}% | 452 \end{tikzpicture}}% |
414 \\[0.5em] | 453 \\[0.5em] |
415 \end{tabular} | 454 \end{tabular} |
416 \caption{ | 455 \caption{ |
417 Venn diagram representation of several information measures for | 456 I-diagrams for several information measures in |
418 stationary random processes. Each circle or oval represents a random | 457 stationary random processes. Each circle or oval represents a random |
419 variable or sequence of random variables relative to time $t=0$. Overlapped areas | 458 variable or sequence of random variables relative to time $t=0$. Overlapped areas |
420 correspond to various mutual information as in \Figrf{venn-example}. | 459 correspond to various mutual information as in \Figrf{venn-example}. |
421 In (c), the circle represents the `present'. Its total area is | 460 In (c), the circle represents the `present'. Its total area is |
422 $H(X_0)=H(1)=\rho_\mu+r_\mu+b_\mu$, where $\rho_\mu$ is the multi-information | 461 $H(X_0)=H(1)=\rho_\mu+r_\mu+b_\mu$, where $\rho_\mu$ is the multi-information |
423 rate, $r_\mu$ is the residual entropy rate, and $b_\mu$ is the predictive | 462 rate, $r_\mu$ is the residual entropy rate, and $b_\mu$ is the predictive |
424 information rate. The entropy rate is $h_\mu = r_\mu+b_\mu$. | 463 information rate. The entropy rate is $h_\mu = r_\mu+b_\mu$. |
425 } | 464 } |
426 \end{fig} | 465 \end{fig} |
427 | 466 |
428 \subsection{Predictive information rate} | 467 The \emph{predictive information rate} (or PIR) \cite{AbdallahPlumbley2009} |
429 In previous work \cite{AbdallahPlumbley2009}, we introduced | 468 is the average information in one observation about the infinite future given the infinite past, |
430 % examined several | 469 and is defined as a conditional mutual information: |
431 % information-theoretic measures that could be used to characterise | |
432 % not only random processes (\ie, an ensemble of possible sequences), | |
433 % but also the dynamic progress of specific realisations of such processes. | |
434 % One of these measures was | |
435 % | |
436 the \emph{predictive information rate} | |
437 (PIR), which is the average information | |
438 in one observation about the infinite future given the infinite past. | |
439 If $\past{X}_t=(\ldots,X_{t-2},X_{t-1})$ denotes the variables | |
440 before time $t$, | |
441 and $\fut{X}_t = (X_{t+1},X_{t+2},\ldots)$ denotes | |
442 those after $t$, | |
443 the PIR at time $t$ is defined as a conditional mutual information: | |
444 \begin{equation} | 470 \begin{equation} |
445 \label{eq:PIR} | 471 \label{eq:PIR} |
446 \IXZ_t \define I(X_t;\fut{X}_t|\past{X}_t) = H(\fut{X}_t|\past{X}_t) - H(\fut{X}_t|X_t,\past{X}_t). | 472 b_\mu = I(X_0;\fut{X}_0|\past{X}_0) = H(\fut{X}_0|\past{X}_0) - H(\fut{X}_0|X_0,\past{X}_0). |
447 \end{equation} | 473 \end{equation} |
448 % (The underline/overline notation follows that of \cite[\S 3]{AbdallahPlumbley2009}.) | |
449 % Hence, $\Ix_t$ quantifies the \emph{new} | |
450 % information gained about the future from the observation at time $t$. | |
451 Equation \eqrf{PIR} can be read as the average reduction | 474 Equation \eqrf{PIR} can be read as the average reduction |
452 in uncertainty about the future on learning $X_t$, given the past. | 475 in uncertainty about the future on learning $X_t$, given the past. |
453 Due to the symmetry of the mutual information, it can also be written | 476 Due to the symmetry of the mutual information, it can also be written |
454 as | 477 as |
455 \begin{equation} | 478 \begin{equation} |
463 the conditional entropy of one variable given \emph{all} the others | 486 the conditional entropy of one variable given \emph{all} the others |
464 in the sequence, future as well as past, is what | 487 in the sequence, future as well as past, is what |
465 we called the \emph{residual entropy rate} $r_\mu$ in \cite{AbdallahPlumbley2010}, | 488 we called the \emph{residual entropy rate} $r_\mu$ in \cite{AbdallahPlumbley2010}, |
466 but was previously identified by Verd{\'u} and Weissman \cite{VerduWeissman2006} as the | 489 but was previously identified by Verd{\'u} and Weissman \cite{VerduWeissman2006} as the |
467 \emph{erasure entropy rate}. | 490 \emph{erasure entropy rate}. |
468 % It is not expressible in terms of the block entropy function $H(\cdot)$. | 491 Thus, the PIR is the difference between |
469 It can be defined as the limit | |
470 \begin{equation} | |
471 \label{eq:residual-entropy-rate} | |
472 r_\mu \define \lim_{N\tends\infty} H(X_{\bet(-N,N)}) - H(X_{\bet(-N,-1)},X_{\bet(1,N)}). | |
473 \end{equation} | |
474 The second term, $H(X_{\bet(1,N)},X_{\bet(-N,-1)})$, | |
475 is the joint entropy of two non-adjacent blocks each of length $N$ with a | |
476 gap between them, | |
477 and cannot be expressed as a function of block entropies alone. | |
478 % In order to associate it with the concept of \emph{binding information} which | |
479 % we will define in \secrf{binding-info}, we | |
480 Thus, the shift-invariant PIR (which we will write as $b_\mu$) is the difference between | |
481 the entropy rate and the erasure entropy rate: $b_\mu = h_\mu - r_\mu$. | 492 the entropy rate and the erasure entropy rate: $b_\mu = h_\mu - r_\mu$. |
482 These relationships are illustrated in \Figrf{predinfo-bg}, along with | 493 These relationships are illustrated in \Figrf{predinfo-bg}, along with |
483 several of the information measures we have discussed so far. | 494 several of the information measures we have discussed so far. |
484 | 495 |
485 | 496 |
502 $\sigma_\mu$, the difference between the multi-information rate and the excess | 513 $\sigma_\mu$, the difference between the multi-information rate and the excess |
503 entropy, as an interesting quantity that measures the predictive benefit of | 514 entropy, as an interesting quantity that measures the predictive benefit of |
504 model-building (that is, maintaining an internal state summarising past | 515 model-building (that is, maintaining an internal state summarising past |
505 observations in order to make better predictions). They also identify | 516 observations in order to make better predictions). They also identify |
506 $w_\mu = \rho_\mu + b_{\mu}$, which they call the \emph{local exogenous | 517 $w_\mu = \rho_\mu + b_{\mu}$, which they call the \emph{local exogenous |
507 information}. | 518 information} rate. |
508 | 519 |
509 \subsection{First order Markov chains} | 520 \subsection{First order Markov chains} |
510 These are the simplest non-trivial models to which information dynamics methods | 521 These are the simplest non-trivial models to which information dynamics methods |
511 can be applied. In \cite{AbdallahPlumbley2009} we, showed that the predictive information | 522 can be applied. In \cite{AbdallahPlumbley2009} we, showed that the predictive information |
512 rate can be expressed simply in terms of the entropy rate of the Markov chain. | 523 rate can be expressed simply in terms of the entropy rate of the Markov chain. |