comparison draft.tex @ 36:ec7d64c0ae44

Work on section 2 and 3A.
author samer
date Wed, 14 Mar 2012 23:04:12 +0000
parents 194c7ec7e35d
children f31433225faa
comparison
equal deleted inserted replaced
35:194c7ec7e35d 36:ec7d64c0ae44
226 \section{Theoretical review} 226 \section{Theoretical review}
227 227
228 \subsection{Entropy and information} 228 \subsection{Entropy and information}
229 Let $X$ denote some variable whose value is initially unknown to our 229 Let $X$ denote some variable whose value is initially unknown to our
230 hypothetical observer. We will treat $X$ mathematically as a random variable, 230 hypothetical observer. We will treat $X$ mathematically as a random variable,
231 with a value to be drawn from some set $\A$ and a 231 with a value to be drawn from some set $\X$ and a
232 probability distribution representing the observer's beliefs about the 232 probability distribution representing the observer's beliefs about the
233 true value of $X$. 233 true value of $X$.
234 In this case, the observer's uncertainty about $X$ can be quantified 234 In this case, the observer's uncertainty about $X$ can be quantified
235 as the entropy of the random variable $H(X)$. For a discrete variable 235 as the entropy of the random variable $H(X)$. For a discrete variable
236 with probability mass function $p:\A \to [0,1]$, this is 236 with probability mass function $p:\X \to [0,1]$, this is
237 \begin{equation} 237 \begin{equation}
238 H(X) = \sum_{x\in\A} -p(x) \log p(x) = \expect{-\log p(X)}, 238 H(X) = \sum_{x\in\X} -p(x) \log p(x) = \expect{-\log p(X)},
239 \end{equation} 239 \end{equation}
240 where $\expect{}$ is the expectation operator. The negative-log-probability 240 where $\expect{}$ is the expectation operator. The negative-log-probability
241 $\ell(x) = -\log p(x)$ of a particular value $x$ can usefully be thought of as 241 $\ell(x) = -\log p(x)$ of a particular value $x$ can usefully be thought of as
242 the \emph{surprisingness} of the value $x$ should it be observed, and 242 the \emph{surprisingness} of the value $x$ should it be observed, and
243 hence the entropy is the expected surprisingness. 243 hence the entropy is the expected surprisingness.
247 in this new data \emph{about} $X$ can be quantified as the 247 in this new data \emph{about} $X$ can be quantified as the
248 Kullback-Leibler (KL) divergence between the prior and posterior 248 Kullback-Leibler (KL) divergence between the prior and posterior
249 distributions $p(x)$ and $p(x|\Data)$ respectively: 249 distributions $p(x)$ and $p(x|\Data)$ respectively:
250 \begin{equation} 250 \begin{equation}
251 \mathcal{I}_{\Data\to X} = D(p_{X|\Data} || p_{X}) 251 \mathcal{I}_{\Data\to X} = D(p_{X|\Data} || p_{X})
252 = \sum_{x\in\A} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}. 252 = \sum_{x\in\X} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}.
253 \end{equation} 253 \end{equation}
254 When there are multiple variables $X_1, X_2$ 254 When there are multiple variables $X_1, X_2$
255 \etc which the observer believes to be dependent, then the observation of 255 \etc which the observer believes to be dependent, then the observation of
256 one may change its beliefs and hence yield information about the 256 one may change its beliefs and hence yield information about the
257 others. The joint and conditional entropies as described in any 257 others. The joint and conditional entropies as described in any
353 Some other information measures are indicated in the legend. 353 Some other information measures are indicated in the legend.
354 } 354 }
355 \end{fig} 355 \end{fig}
356 356
357 357
358 \subsection{Entropy and information in sequences} 358 \subsection{Surprise and information in sequences}
359 359 \label{s:surprise-info-seq}
360 Suppose that $(\ldots,X_{-1},X_0,X_1,\ldots)$ is a stationary sequence of 360
361 Suppose that $(\ldots,X_{-1},X_0,X_1,\ldots)$ is a sequence of
361 random variables, infinite in both directions, 362 random variables, infinite in both directions,
362 and that $\mu$ is the associated shift-invariant probability measure over all 363 and that $\mu$ is the associated probability measure over all
363 configurations of the sequence---in the following, $\mu$ will simply serve 364 realisations of the sequence---in the following, $\mu$ will simply serve
364 as a label for the process. We can indentify a number of information-theoretic 365 as a label for the process. We can indentify a number of information-theoretic
365 measures meaningful in the context of a sequential observation of the sequence, during 366 measures meaningful in the context of a sequential observation of the sequence, during
366 which, at any time $t$, there is `present' $X_t$, a `past' 367 which, at any time $t$, the sequence of variables can be divided into a `present' $X_t$, a `past'
367 $\past{X}_t \equiv (\ldots, X_{t-2}, X_{t-1})$, and a `future' 368 $\past{X}_t \equiv (\ldots, X_{t-2}, X_{t-1})$, and a `future'
368 $\fut{X}_t \equiv (X_{t+1},X_{t+2},\ldots)$. 369 $\fut{X}_t \equiv (X_{t+1},X_{t+2},\ldots)$.
369 Since the sequence is assumed stationary, we can without loss of generality, 370 The actually observed value of $X_t$ will be written as $x_t$, and
370 assume that $t=0$ in the following definitions. 371 the sequence of observations up to but not including $x_t$ as
372 $\past{x}_t$.
373 % Since the sequence is assumed stationary, we can without loss of generality,
374 % assume that $t=0$ in the following definitions.
375
376 The in-context surprisingness of the observation $X_t=x_t$ is a function
377 of both $x_t$ and the context $\past{x}_t$:
378 \begin{equation}
379 \ell(x_t|\past{x}_t) = - \log p(x_t|\past{x}_t).
380 \end{equation}
381 However, before $X_t$ is observed to be $x_t$, the observer can compute
382 its \emph{expected} surprisingness as a measure of its uncertainty about
383 the very next event; this may be written as an entropy
384 $H(X_t|\ev(\past{X}_t = \past{x}_t))$, but note that this is
385 conditional on the \emph{event} $\ev(\past{X}_t=\past{x}_t)$, not
386 \emph{variables} $\past{X}_t$ as in the conventional conditional entropy.
387
388 The surprisingness $\ell(x_t|\past{x}_t)$ and expected surprisingness
389 $H(X_t|\ev(\past{X}_t=\past{x}_t))$
390 are subjective information dynamic measures since they are based on its
391 subjective probability model in the context of the actually observed sequence
392 $\past{x}_t$---they characterise what it is like to `be in the observer's shoes'.
393 If we view the observer as a purely passive or reactive agent, this would
394 probably be sufficient, but for active agents such as humans or animals, it is
395 often necessary to \emph{aniticipate} future events in order, for example, to plan the
396 most effective course of action. It makes sense for such observers to be
397 concerned about the predictive probability distribution over future events,
398 $p(\fut{x}_t|\past{x}_t)$. When an observation $\ev(X_t=x_t)$ is made in this context,
399 the \emph{instantaneous predictive information} (IPI) is the information in the
400 event $\ev(X_t=x_t)$ about the entire future of the sequence $\fut{X}_t$.
401
402 \subsection{Information measures for stationary random processes}
371 403
372 The \emph{entropy rate} of the process is the entropy of the next variable 404 The \emph{entropy rate} of the process is the entropy of the next variable
373 $X_t$ given all the previous ones. 405 $X_t$ given all the previous ones.
374 \begin{equation} 406 \begin{equation}
375 \label{eq:entro-rate} 407 \label{eq:entro-rate}
500 or \emph{erasure} \cite{VerduWeissman2006} entropy rate. 532 or \emph{erasure} \cite{VerduWeissman2006} entropy rate.
501 These relationships are illustrated in \Figrf{predinfo-bg}, along with 533 These relationships are illustrated in \Figrf{predinfo-bg}, along with
502 several of the information measures we have discussed so far. 534 several of the information measures we have discussed so far.
503 535
504 536
505 \subsection{Other sequential information measures}
506
507 James et al \cite{JamesEllisonCrutchfield2011} study the predictive information 537 James et al \cite{JamesEllisonCrutchfield2011} study the predictive information
508 rate and also examine some related measures. In particular they identify the 538 rate and also examine some related measures. In particular they identify the
509 $\sigma_\mu$, the difference between the multi-information rate and the excess 539 $\sigma_\mu$, the difference between the multi-information rate and the excess
510 entropy, as an interesting quantity that measures the predictive benefit of 540 entropy, as an interesting quantity that measures the predictive benefit of
511 model-building (that is, maintaining an internal state summarising past 541 model-building (that is, maintaining an internal state summarising past
512 observations in order to make better predictions). They also identify 542 observations in order to make better predictions). They also identify
513 $w_\mu = \rho_\mu + b_{\mu}$, which they call the \emph{local exogenous 543 $w_\mu = \rho_\mu + b_{\mu}$, which they call the \emph{local exogenous
514 information} rate. 544 information} rate.
515
516 \subsection{First order Markov chains}
517 These are the simplest non-trivial models to which information dynamics methods
518 can be applied. In \cite{AbdallahPlumbley2009} we, showed that the predictive information
519 rate can be expressed simply in terms of the entropy rate of the Markov chain.
520 If we let $a$ denote the transition matrix of the Markov chain, and $h_a$ it's
521 entropy rate, then its predictive information rate $b_a$ is
522 \begin{equation}
523 b_a = h_{a^2} - h_a,
524 \end{equation}
525 where $a^2 = aa$, the transition matrix squared, is the transition matrix
526 of the `skip one' Markov chain obtained by leaving out every other observation.
527
528 \subsection{Higher order Markov chains}
529 Second and higher order Markov chains can be treated in a similar way by transforming
530 to a first order representation of the high order Markov chain. If we are dealing
531 with an $N$th order model, this is done forming a new alphabet of possible observations
532 consisting of all possible $N$-tuples of symbols from the base alphabet. An observation
533 in this new model represents a block of $N$ observations from the base model. The next
534 observation represents the block of $N$ obtained by shift the previous block along
535 by one step. The new Markov of chain is parameterised by a sparse $K^N\times K^N$
536 transition matrix $\hat{a}$.
537 \begin{equation}
538 b_{\hat{a}} = h_{\hat{a}^{N+1}} - N h_{\hat{a}},
539 \end{equation}
540 where $\hat{a}^{N+1}$ is the $N+1$th power of the transition matrix.
541
542 545
543 \begin{fig}{wundt} 546 \begin{fig}{wundt}
544 \raisebox{-4em}{\colfig[0.43]{wundt}} 547 \raisebox{-4em}{\colfig[0.43]{wundt}}
545 % {\ \shortstack{{\Large$\longrightarrow$}\\ {\scriptsize\emph{exposure}}}\ } 548 % {\ \shortstack{{\Large$\longrightarrow$}\\ {\scriptsize\emph{exposure}}}\ }
546 {\ {\large$\longrightarrow$}\ } 549 {\ {\large$\longrightarrow$}\ }
551 in a move to the left along the curve \cite{Berlyne71}. 554 in a move to the left along the curve \cite{Berlyne71}.
552 } 555 }
553 \end{fig} 556 \end{fig}
554 557
555 558
559 \subsection{First and higher order Markov chains}
560 First order Markov chains are the simplest non-trivial models to which information
561 dynamics methods can be applied. In \cite{AbdallahPlumbley2009} we derived
562 expressions for all the information measures introduced [above] for
563 irreducible stationary Markov chains (\ie that have a unique stationary
564 distribution). The derivation is greatly simplified by the dependency structure
565 of the Markov chain: for the purpose of the analysis, the `past' and `future'
566 segments $\past{X}_t$ and $\fut{X})_t$ can be reduced to just the previous
567 and next variables $X_{t-1}$ and $X_{t-1}$ respectively. We also showed that
568 the predictive information rate can be expressed simply in terms of entropy rates:
569 if we let $a$ denote the $K\times K$ transition matrix of a Markov chain over
570 an alphabet of $\{1,\ldots,K\}$, such that
571 $a_{ij} = \Pr(\ev(X_t=i|X_{t-1}=j))$, and let $h:\reals^{K\times K}\to \reals$ be
572 the entropy rate function such that $h(a)$ is the entropy rate of a Markov chain
573 with transition matrix $a$, then the predictive information rate $b(a)$ is
574 \begin{equation}
575 b(a) = h(a^2) - h(a),
576 \end{equation}
577 where $a^2$, the transition matrix squared, is the transition matrix
578 of the `skip one' Markov chain obtained by jumping two steps at a time
579 along the original chain.
580
581 Second and higher order Markov chains can be treated in a similar way by transforming
582 to a first order representation of the high order Markov chain. If we are dealing
583 with an $N$th order model, this is done forming a new alphabet of size $K^N$
584 consisting of all possible $N$-tuples of symbols from the base alphabet of size $K$.
585 An observation $\hat{x}_t$ in this new model represents a block of $N$ observations
586 $(x_{t+1},\ldots,x_{t+N})$ from the base model. The next
587 observation $\hat{x}_{t+1}$ represents the block of $N$ obtained by shifting the previous
588 block along by one step. The new Markov of chain is parameterised by a sparse $K^N\times K^N$
589 transition matrix $\hat{a}$. The entropy rate of the first order system is the same
590 as the entropy rate of the original order $N$ system, and its PIR is
591 \begin{equation}
592 b({\hat{a}}) = h({\hat{a}^{N+1}}) - N h({\hat{a}}),
593 \end{equation}
594 where $\hat{a}^{N+1}$ is the $(N+1)$th power of the first order transition matrix.
595
596
556 \section{Information Dynamics in Analysis} 597 \section{Information Dynamics in Analysis}
557 598
558 \subsection{Musicological Analysis}
559 In \cite{AbdallahPlumbley2009}, methods based on the theory described above
560 were used to analysis two pieces of music in the minimalist style
561 by Philip Glass: \emph{Two Pages} (1969) and \emph{Gradus} (1968).
562 The analysis was done using a first-order Markov chain model, with the
563 enhancement that the transition matrix of the model was allowed to
564 evolve dynamically as the notes were processed, and was estimated (in
565 a Bayesian way) as a \emph{distribution} over possible transition matrices,
566 rather than a point estimate. [Bayesian surprise, other component of IPI].
567
568 \begin{fig}{twopages} 599 \begin{fig}{twopages}
569 \colfig[0.96]{matbase/fig9471} % update from mbc paper 600 \colfig[0.96]{matbase/fig9471} % update from mbc paper
570 % \colfig[0.97]{matbase/fig72663}\\ % later update from mbc paper (Keith's new picks) 601 % \colfig[0.97]{matbase/fig72663}\\ % later update from mbc paper (Keith's new picks)
571 \vspace*{1em} 602 \vspace*{1em}
572 \colfig[0.97]{matbase/fig13377} % rule based analysis 603 \colfig[0.97]{matbase/fig13377} % rule based analysis
581 The bottom panel shows a rule-based boundary strength analysis computed 612 The bottom panel shows a rule-based boundary strength analysis computed
582 using Cambouropoulos' LBDM. 613 using Cambouropoulos' LBDM.
583 All information measures are in nats and time is in notes. 614 All information measures are in nats and time is in notes.
584 } 615 }
585 \end{fig} 616 \end{fig}
617
618 \subsection{Musicological Analysis}
619 In \cite{AbdallahPlumbley2009}, methods based on the theory described above
620 were used to analysis two pieces of music in the minimalist style
621 by Philip Glass: \emph{Two Pages} (1969) and \emph{Gradus} (1968).
622 The analysis was done using a first-order Markov chain model, with the
623 enhancement that the transition matrix of the model was allowed to
624 evolve dynamically as the notes were processed, and was tracked (in
625 a Bayesian way) as a \emph{distribution} over possible transition matrices,
626 rather than a point estimate. The results are summarised in \figrf{twopages}:
627 the upper four plots show the dynamically evolving subjective information
628 measures as described in \secrf{surprise-info-seq} computed using a point
629 estimate of the current transition matrix, but the fifth plot (the `model information rate')
630 measures the information in each observation about the transition matrix.
631 In \cite{AbdallahPlumbley2010b}, we showed that this `model information rate'
632 is actually a component of the true IPI in
633 a time-varying Markov chain, which was neglected when we computed the IPI from
634 point estimates of the transition matrix as if the transition probabilities
635 were constant.
636
637 The peaks of the surprisingness and both components of the predictive information
638 show good correspondence with structure of the piece both as marked in the score
639 and as analysed by musicologist Keith Potter, who was asked to mark the six
640 `most surprising moments' of the piece (shown as asterisks in the fifth plot)%
641 \footnote{%
642 Note that the boundary marked in the score at around note 5,400 is known to be
643 anomalous; on the basis of a listening analysis, some musicologists [ref] have
644 placed the boundary a few bars later, in agreement with our analysis.}.
645
646 In contrast, the analyses shown in the lower two plots of \figrf{twopages},
647 obtained using two rule-based music segmentation algorithms, while clearly
648 \emph{reflecting} the structure of the piece, do not \emph{segment} the piece
649 clearly, with tendency to peaking of the boundary strength function at
650 the boundaries in the piece.
651
586 652
587 \begin{fig}{metre} 653 \begin{fig}{metre}
588 % \scalebox{1}[1]{% 654 % \scalebox{1}[1]{%
589 \begin{tabular}{cc} 655 \begin{tabular}{cc}
590 \colfig[0.45]{matbase/fig36859} & \colfig[0.48]{matbase/fig88658} \\ 656 \colfig[0.45]{matbase/fig36859} & \colfig[0.48]{matbase/fig88658} \\