Mercurial > hg > cip2012
comparison draft.tex @ 36:ec7d64c0ae44
Work on section 2 and 3A.
author | samer |
---|---|
date | Wed, 14 Mar 2012 23:04:12 +0000 |
parents | 194c7ec7e35d |
children | f31433225faa |
comparison
equal
deleted
inserted
replaced
35:194c7ec7e35d | 36:ec7d64c0ae44 |
---|---|
226 \section{Theoretical review} | 226 \section{Theoretical review} |
227 | 227 |
228 \subsection{Entropy and information} | 228 \subsection{Entropy and information} |
229 Let $X$ denote some variable whose value is initially unknown to our | 229 Let $X$ denote some variable whose value is initially unknown to our |
230 hypothetical observer. We will treat $X$ mathematically as a random variable, | 230 hypothetical observer. We will treat $X$ mathematically as a random variable, |
231 with a value to be drawn from some set $\A$ and a | 231 with a value to be drawn from some set $\X$ and a |
232 probability distribution representing the observer's beliefs about the | 232 probability distribution representing the observer's beliefs about the |
233 true value of $X$. | 233 true value of $X$. |
234 In this case, the observer's uncertainty about $X$ can be quantified | 234 In this case, the observer's uncertainty about $X$ can be quantified |
235 as the entropy of the random variable $H(X)$. For a discrete variable | 235 as the entropy of the random variable $H(X)$. For a discrete variable |
236 with probability mass function $p:\A \to [0,1]$, this is | 236 with probability mass function $p:\X \to [0,1]$, this is |
237 \begin{equation} | 237 \begin{equation} |
238 H(X) = \sum_{x\in\A} -p(x) \log p(x) = \expect{-\log p(X)}, | 238 H(X) = \sum_{x\in\X} -p(x) \log p(x) = \expect{-\log p(X)}, |
239 \end{equation} | 239 \end{equation} |
240 where $\expect{}$ is the expectation operator. The negative-log-probability | 240 where $\expect{}$ is the expectation operator. The negative-log-probability |
241 $\ell(x) = -\log p(x)$ of a particular value $x$ can usefully be thought of as | 241 $\ell(x) = -\log p(x)$ of a particular value $x$ can usefully be thought of as |
242 the \emph{surprisingness} of the value $x$ should it be observed, and | 242 the \emph{surprisingness} of the value $x$ should it be observed, and |
243 hence the entropy is the expected surprisingness. | 243 hence the entropy is the expected surprisingness. |
247 in this new data \emph{about} $X$ can be quantified as the | 247 in this new data \emph{about} $X$ can be quantified as the |
248 Kullback-Leibler (KL) divergence between the prior and posterior | 248 Kullback-Leibler (KL) divergence between the prior and posterior |
249 distributions $p(x)$ and $p(x|\Data)$ respectively: | 249 distributions $p(x)$ and $p(x|\Data)$ respectively: |
250 \begin{equation} | 250 \begin{equation} |
251 \mathcal{I}_{\Data\to X} = D(p_{X|\Data} || p_{X}) | 251 \mathcal{I}_{\Data\to X} = D(p_{X|\Data} || p_{X}) |
252 = \sum_{x\in\A} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}. | 252 = \sum_{x\in\X} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}. |
253 \end{equation} | 253 \end{equation} |
254 When there are multiple variables $X_1, X_2$ | 254 When there are multiple variables $X_1, X_2$ |
255 \etc which the observer believes to be dependent, then the observation of | 255 \etc which the observer believes to be dependent, then the observation of |
256 one may change its beliefs and hence yield information about the | 256 one may change its beliefs and hence yield information about the |
257 others. The joint and conditional entropies as described in any | 257 others. The joint and conditional entropies as described in any |
353 Some other information measures are indicated in the legend. | 353 Some other information measures are indicated in the legend. |
354 } | 354 } |
355 \end{fig} | 355 \end{fig} |
356 | 356 |
357 | 357 |
358 \subsection{Entropy and information in sequences} | 358 \subsection{Surprise and information in sequences} |
359 | 359 \label{s:surprise-info-seq} |
360 Suppose that $(\ldots,X_{-1},X_0,X_1,\ldots)$ is a stationary sequence of | 360 |
361 Suppose that $(\ldots,X_{-1},X_0,X_1,\ldots)$ is a sequence of | |
361 random variables, infinite in both directions, | 362 random variables, infinite in both directions, |
362 and that $\mu$ is the associated shift-invariant probability measure over all | 363 and that $\mu$ is the associated probability measure over all |
363 configurations of the sequence---in the following, $\mu$ will simply serve | 364 realisations of the sequence---in the following, $\mu$ will simply serve |
364 as a label for the process. We can indentify a number of information-theoretic | 365 as a label for the process. We can indentify a number of information-theoretic |
365 measures meaningful in the context of a sequential observation of the sequence, during | 366 measures meaningful in the context of a sequential observation of the sequence, during |
366 which, at any time $t$, there is `present' $X_t$, a `past' | 367 which, at any time $t$, the sequence of variables can be divided into a `present' $X_t$, a `past' |
367 $\past{X}_t \equiv (\ldots, X_{t-2}, X_{t-1})$, and a `future' | 368 $\past{X}_t \equiv (\ldots, X_{t-2}, X_{t-1})$, and a `future' |
368 $\fut{X}_t \equiv (X_{t+1},X_{t+2},\ldots)$. | 369 $\fut{X}_t \equiv (X_{t+1},X_{t+2},\ldots)$. |
369 Since the sequence is assumed stationary, we can without loss of generality, | 370 The actually observed value of $X_t$ will be written as $x_t$, and |
370 assume that $t=0$ in the following definitions. | 371 the sequence of observations up to but not including $x_t$ as |
372 $\past{x}_t$. | |
373 % Since the sequence is assumed stationary, we can without loss of generality, | |
374 % assume that $t=0$ in the following definitions. | |
375 | |
376 The in-context surprisingness of the observation $X_t=x_t$ is a function | |
377 of both $x_t$ and the context $\past{x}_t$: | |
378 \begin{equation} | |
379 \ell(x_t|\past{x}_t) = - \log p(x_t|\past{x}_t). | |
380 \end{equation} | |
381 However, before $X_t$ is observed to be $x_t$, the observer can compute | |
382 its \emph{expected} surprisingness as a measure of its uncertainty about | |
383 the very next event; this may be written as an entropy | |
384 $H(X_t|\ev(\past{X}_t = \past{x}_t))$, but note that this is | |
385 conditional on the \emph{event} $\ev(\past{X}_t=\past{x}_t)$, not | |
386 \emph{variables} $\past{X}_t$ as in the conventional conditional entropy. | |
387 | |
388 The surprisingness $\ell(x_t|\past{x}_t)$ and expected surprisingness | |
389 $H(X_t|\ev(\past{X}_t=\past{x}_t))$ | |
390 are subjective information dynamic measures since they are based on its | |
391 subjective probability model in the context of the actually observed sequence | |
392 $\past{x}_t$---they characterise what it is like to `be in the observer's shoes'. | |
393 If we view the observer as a purely passive or reactive agent, this would | |
394 probably be sufficient, but for active agents such as humans or animals, it is | |
395 often necessary to \emph{aniticipate} future events in order, for example, to plan the | |
396 most effective course of action. It makes sense for such observers to be | |
397 concerned about the predictive probability distribution over future events, | |
398 $p(\fut{x}_t|\past{x}_t)$. When an observation $\ev(X_t=x_t)$ is made in this context, | |
399 the \emph{instantaneous predictive information} (IPI) is the information in the | |
400 event $\ev(X_t=x_t)$ about the entire future of the sequence $\fut{X}_t$. | |
401 | |
402 \subsection{Information measures for stationary random processes} | |
371 | 403 |
372 The \emph{entropy rate} of the process is the entropy of the next variable | 404 The \emph{entropy rate} of the process is the entropy of the next variable |
373 $X_t$ given all the previous ones. | 405 $X_t$ given all the previous ones. |
374 \begin{equation} | 406 \begin{equation} |
375 \label{eq:entro-rate} | 407 \label{eq:entro-rate} |
500 or \emph{erasure} \cite{VerduWeissman2006} entropy rate. | 532 or \emph{erasure} \cite{VerduWeissman2006} entropy rate. |
501 These relationships are illustrated in \Figrf{predinfo-bg}, along with | 533 These relationships are illustrated in \Figrf{predinfo-bg}, along with |
502 several of the information measures we have discussed so far. | 534 several of the information measures we have discussed so far. |
503 | 535 |
504 | 536 |
505 \subsection{Other sequential information measures} | |
506 | |
507 James et al \cite{JamesEllisonCrutchfield2011} study the predictive information | 537 James et al \cite{JamesEllisonCrutchfield2011} study the predictive information |
508 rate and also examine some related measures. In particular they identify the | 538 rate and also examine some related measures. In particular they identify the |
509 $\sigma_\mu$, the difference between the multi-information rate and the excess | 539 $\sigma_\mu$, the difference between the multi-information rate and the excess |
510 entropy, as an interesting quantity that measures the predictive benefit of | 540 entropy, as an interesting quantity that measures the predictive benefit of |
511 model-building (that is, maintaining an internal state summarising past | 541 model-building (that is, maintaining an internal state summarising past |
512 observations in order to make better predictions). They also identify | 542 observations in order to make better predictions). They also identify |
513 $w_\mu = \rho_\mu + b_{\mu}$, which they call the \emph{local exogenous | 543 $w_\mu = \rho_\mu + b_{\mu}$, which they call the \emph{local exogenous |
514 information} rate. | 544 information} rate. |
515 | |
516 \subsection{First order Markov chains} | |
517 These are the simplest non-trivial models to which information dynamics methods | |
518 can be applied. In \cite{AbdallahPlumbley2009} we, showed that the predictive information | |
519 rate can be expressed simply in terms of the entropy rate of the Markov chain. | |
520 If we let $a$ denote the transition matrix of the Markov chain, and $h_a$ it's | |
521 entropy rate, then its predictive information rate $b_a$ is | |
522 \begin{equation} | |
523 b_a = h_{a^2} - h_a, | |
524 \end{equation} | |
525 where $a^2 = aa$, the transition matrix squared, is the transition matrix | |
526 of the `skip one' Markov chain obtained by leaving out every other observation. | |
527 | |
528 \subsection{Higher order Markov chains} | |
529 Second and higher order Markov chains can be treated in a similar way by transforming | |
530 to a first order representation of the high order Markov chain. If we are dealing | |
531 with an $N$th order model, this is done forming a new alphabet of possible observations | |
532 consisting of all possible $N$-tuples of symbols from the base alphabet. An observation | |
533 in this new model represents a block of $N$ observations from the base model. The next | |
534 observation represents the block of $N$ obtained by shift the previous block along | |
535 by one step. The new Markov of chain is parameterised by a sparse $K^N\times K^N$ | |
536 transition matrix $\hat{a}$. | |
537 \begin{equation} | |
538 b_{\hat{a}} = h_{\hat{a}^{N+1}} - N h_{\hat{a}}, | |
539 \end{equation} | |
540 where $\hat{a}^{N+1}$ is the $N+1$th power of the transition matrix. | |
541 | |
542 | 545 |
543 \begin{fig}{wundt} | 546 \begin{fig}{wundt} |
544 \raisebox{-4em}{\colfig[0.43]{wundt}} | 547 \raisebox{-4em}{\colfig[0.43]{wundt}} |
545 % {\ \shortstack{{\Large$\longrightarrow$}\\ {\scriptsize\emph{exposure}}}\ } | 548 % {\ \shortstack{{\Large$\longrightarrow$}\\ {\scriptsize\emph{exposure}}}\ } |
546 {\ {\large$\longrightarrow$}\ } | 549 {\ {\large$\longrightarrow$}\ } |
551 in a move to the left along the curve \cite{Berlyne71}. | 554 in a move to the left along the curve \cite{Berlyne71}. |
552 } | 555 } |
553 \end{fig} | 556 \end{fig} |
554 | 557 |
555 | 558 |
559 \subsection{First and higher order Markov chains} | |
560 First order Markov chains are the simplest non-trivial models to which information | |
561 dynamics methods can be applied. In \cite{AbdallahPlumbley2009} we derived | |
562 expressions for all the information measures introduced [above] for | |
563 irreducible stationary Markov chains (\ie that have a unique stationary | |
564 distribution). The derivation is greatly simplified by the dependency structure | |
565 of the Markov chain: for the purpose of the analysis, the `past' and `future' | |
566 segments $\past{X}_t$ and $\fut{X})_t$ can be reduced to just the previous | |
567 and next variables $X_{t-1}$ and $X_{t-1}$ respectively. We also showed that | |
568 the predictive information rate can be expressed simply in terms of entropy rates: | |
569 if we let $a$ denote the $K\times K$ transition matrix of a Markov chain over | |
570 an alphabet of $\{1,\ldots,K\}$, such that | |
571 $a_{ij} = \Pr(\ev(X_t=i|X_{t-1}=j))$, and let $h:\reals^{K\times K}\to \reals$ be | |
572 the entropy rate function such that $h(a)$ is the entropy rate of a Markov chain | |
573 with transition matrix $a$, then the predictive information rate $b(a)$ is | |
574 \begin{equation} | |
575 b(a) = h(a^2) - h(a), | |
576 \end{equation} | |
577 where $a^2$, the transition matrix squared, is the transition matrix | |
578 of the `skip one' Markov chain obtained by jumping two steps at a time | |
579 along the original chain. | |
580 | |
581 Second and higher order Markov chains can be treated in a similar way by transforming | |
582 to a first order representation of the high order Markov chain. If we are dealing | |
583 with an $N$th order model, this is done forming a new alphabet of size $K^N$ | |
584 consisting of all possible $N$-tuples of symbols from the base alphabet of size $K$. | |
585 An observation $\hat{x}_t$ in this new model represents a block of $N$ observations | |
586 $(x_{t+1},\ldots,x_{t+N})$ from the base model. The next | |
587 observation $\hat{x}_{t+1}$ represents the block of $N$ obtained by shifting the previous | |
588 block along by one step. The new Markov of chain is parameterised by a sparse $K^N\times K^N$ | |
589 transition matrix $\hat{a}$. The entropy rate of the first order system is the same | |
590 as the entropy rate of the original order $N$ system, and its PIR is | |
591 \begin{equation} | |
592 b({\hat{a}}) = h({\hat{a}^{N+1}}) - N h({\hat{a}}), | |
593 \end{equation} | |
594 where $\hat{a}^{N+1}$ is the $(N+1)$th power of the first order transition matrix. | |
595 | |
596 | |
556 \section{Information Dynamics in Analysis} | 597 \section{Information Dynamics in Analysis} |
557 | 598 |
558 \subsection{Musicological Analysis} | |
559 In \cite{AbdallahPlumbley2009}, methods based on the theory described above | |
560 were used to analysis two pieces of music in the minimalist style | |
561 by Philip Glass: \emph{Two Pages} (1969) and \emph{Gradus} (1968). | |
562 The analysis was done using a first-order Markov chain model, with the | |
563 enhancement that the transition matrix of the model was allowed to | |
564 evolve dynamically as the notes were processed, and was estimated (in | |
565 a Bayesian way) as a \emph{distribution} over possible transition matrices, | |
566 rather than a point estimate. [Bayesian surprise, other component of IPI]. | |
567 | |
568 \begin{fig}{twopages} | 599 \begin{fig}{twopages} |
569 \colfig[0.96]{matbase/fig9471} % update from mbc paper | 600 \colfig[0.96]{matbase/fig9471} % update from mbc paper |
570 % \colfig[0.97]{matbase/fig72663}\\ % later update from mbc paper (Keith's new picks) | 601 % \colfig[0.97]{matbase/fig72663}\\ % later update from mbc paper (Keith's new picks) |
571 \vspace*{1em} | 602 \vspace*{1em} |
572 \colfig[0.97]{matbase/fig13377} % rule based analysis | 603 \colfig[0.97]{matbase/fig13377} % rule based analysis |
581 The bottom panel shows a rule-based boundary strength analysis computed | 612 The bottom panel shows a rule-based boundary strength analysis computed |
582 using Cambouropoulos' LBDM. | 613 using Cambouropoulos' LBDM. |
583 All information measures are in nats and time is in notes. | 614 All information measures are in nats and time is in notes. |
584 } | 615 } |
585 \end{fig} | 616 \end{fig} |
617 | |
618 \subsection{Musicological Analysis} | |
619 In \cite{AbdallahPlumbley2009}, methods based on the theory described above | |
620 were used to analysis two pieces of music in the minimalist style | |
621 by Philip Glass: \emph{Two Pages} (1969) and \emph{Gradus} (1968). | |
622 The analysis was done using a first-order Markov chain model, with the | |
623 enhancement that the transition matrix of the model was allowed to | |
624 evolve dynamically as the notes were processed, and was tracked (in | |
625 a Bayesian way) as a \emph{distribution} over possible transition matrices, | |
626 rather than a point estimate. The results are summarised in \figrf{twopages}: | |
627 the upper four plots show the dynamically evolving subjective information | |
628 measures as described in \secrf{surprise-info-seq} computed using a point | |
629 estimate of the current transition matrix, but the fifth plot (the `model information rate') | |
630 measures the information in each observation about the transition matrix. | |
631 In \cite{AbdallahPlumbley2010b}, we showed that this `model information rate' | |
632 is actually a component of the true IPI in | |
633 a time-varying Markov chain, which was neglected when we computed the IPI from | |
634 point estimates of the transition matrix as if the transition probabilities | |
635 were constant. | |
636 | |
637 The peaks of the surprisingness and both components of the predictive information | |
638 show good correspondence with structure of the piece both as marked in the score | |
639 and as analysed by musicologist Keith Potter, who was asked to mark the six | |
640 `most surprising moments' of the piece (shown as asterisks in the fifth plot)% | |
641 \footnote{% | |
642 Note that the boundary marked in the score at around note 5,400 is known to be | |
643 anomalous; on the basis of a listening analysis, some musicologists [ref] have | |
644 placed the boundary a few bars later, in agreement with our analysis.}. | |
645 | |
646 In contrast, the analyses shown in the lower two plots of \figrf{twopages}, | |
647 obtained using two rule-based music segmentation algorithms, while clearly | |
648 \emph{reflecting} the structure of the piece, do not \emph{segment} the piece | |
649 clearly, with tendency to peaking of the boundary strength function at | |
650 the boundaries in the piece. | |
651 | |
586 | 652 |
587 \begin{fig}{metre} | 653 \begin{fig}{metre} |
588 % \scalebox{1}[1]{% | 654 % \scalebox{1}[1]{% |
589 \begin{tabular}{cc} | 655 \begin{tabular}{cc} |
590 \colfig[0.45]{matbase/fig36859} & \colfig[0.48]{matbase/fig88658} \\ | 656 \colfig[0.45]{matbase/fig36859} & \colfig[0.48]{matbase/fig88658} \\ |