comparison draft.tex @ 34:25846c37a08a

New bits, rearranged figure placement.
author samer
date Wed, 14 Mar 2012 13:17:05 +0000
parents a9c8580cb8ca
children 194c7ec7e35d
comparison
equal deleted inserted replaced
33:a9c8580cb8ca 34:25846c37a08a
222 %and selective or evaluative phases \cite{Boden1990}, and would have 222 %and selective or evaluative phases \cite{Boden1990}, and would have
223 %applications in tools for computer aided composition. 223 %applications in tools for computer aided composition.
224 224
225 225
226 \section{Theoretical review} 226 \section{Theoretical review}
227
228 \subsection{Entropy and information}
229 Let $X$ denote some variable whose value is initially unknown to our
230 hypothetical observer. We will treat $X$ mathematically as a random variable,
231 with a value to be drawn from some set $\A$ and a
232 probability distribution representing the observer's beliefs about the
233 true value of $X$.
234 In this case, the observer's uncertainty about $X$ can be quantified
235 as the entropy of the random variable $H(X)$. For a discrete variable
236 with probability mass function $p:\A \to [0,1]$, this is
237 \begin{equation}
238 H(X) = \sum_{x\in\A} -p(x) \log p(x) = \expect{-\log p(X)},
239 \end{equation}
240 where $\expect{}$ is the expectation operator. The negative-log-probability
241 $\ell(x) = -\log p(x)$ of a particular value $x$ can usefully be thought of as
242 the \emph{surprisingness} of the value $x$ should it be observed, and
243 hence the entropy is the expected surprisingness.
244
245 Now suppose that the observer receives some new data $\Data$ that
246 causes a revision of its beliefs about $X$. The \emph{information}
247 in this new data \emph{about} $X$ can be quantified as the
248 Kullback-Leibler (KL) divergence between the prior and posterior
249 distributions $p(x)$ and $p(x|\Data)$ respectively:
250 \begin{equation}
251 \mathcal{I}_{\Data\to X} = D(p_{X|\Data} || p_{X})
252 = \sum_{x\in\A} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}.
253 \end{equation}
254 When there are multiple variables $X_1, X_2$
255 \etc which the observer believes to be dependent, then the observation of
256 one may change its beliefs and hence yield information about the
257 others. The joint and conditional entropies as described in any
258 textbook on information theory (\eg \cite{CoverThomas}) then quantify
259 the observer's expected uncertainty about groups of variables given the
260 values of others. In particular, the \emph{mutual information}
261 $I(X_1;X_2)$ is both the expected information
262 in an observation of $X_2$ about $X_1$ and the expected reduction
263 in uncertainty about $X_1$ after observing $X_2$:
264 \begin{equation}
265 I(X_1;X_2) = H(X_1) - H(X_1|X_2),
266 \end{equation}
267 where $H(X_1|X_2) = H(X_1,X_2) - H(X_2)$ is the conditional entropy
268 of $X_2$ given $X_1$. A little algebra shows that $I(X_1;X_2)=I(X_2;X_1)$
269 and so the mutual information is symmetric in its arguments. A conditional
270 form of the mutual information can be formulated analogously:
271 \begin{equation}
272 I(X_1;X_2|X_3) = H(X_1|X_3) - H(X_1|X_2,X_3).
273 \end{equation}
274 These relationships between the various entropies and mutual
275 informations are conveniently visualised in Venn diagram-like \emph{information diagrams}
276 or I-diagrams \cite{Yeung1991} such as the one in \figrf{venn-example}.
227 277
228 \begin{fig}{venn-example} 278 \begin{fig}{venn-example}
229 \newcommand\rad{2.2em}% 279 \newcommand\rad{2.2em}%
230 \newcommand\circo{circle (3.4em)}% 280 \newcommand\circo{circle (3.4em)}%
231 \newcommand\labrad{4.3em} 281 \newcommand\labrad{4.3em}
301 The total shaded area is the joint entropy $H(X_1,X_2,X_3)$. 351 The total shaded area is the joint entropy $H(X_1,X_2,X_3)$.
302 The central area $I_{123}$ is the co-information \cite{McGill1954}. 352 The central area $I_{123}$ is the co-information \cite{McGill1954}.
303 Some other information measures are indicated in the legend. 353 Some other information measures are indicated in the legend.
304 } 354 }
305 \end{fig} 355 \end{fig}
306
307 \subsection{Entropy and information}
308 Let $X$ denote some variable whose value is initially unknown to our
309 hypothetical observer. We will treat $X$ mathematically as a random variable,
310 with a value to be drawn from some set (or \emph{alphabet}) $\A$ and a
311 probability distribution representing the observer's beliefs about the
312 true value of $X$.
313 In this case, the observer's uncertainty about $X$ can be quantified
314 as the entropy of the random variable $H(X)$. For a discrete variable
315 with probability mass function $p:\A \to [0,1]$, this is
316 \begin{equation}
317 H(X) = \sum_{x\in\A} -p(x) \log p(x) = \expect{-\log p(X)},
318 \end{equation}
319 where $\expect{}$ is the expectation operator. The negative-log-probability
320 $\ell(x) = -\log p(x)$ of a particular value $x$ can usefully be thought of as
321 the \emph{surprisingness} of the value $x$ should it be observed, and
322 hence the entropy is the expected surprisingness.
323
324 Now suppose that the observer receives some new data $\Data$ that
325 causes a revision of its beliefs about $X$. The \emph{information}
326 in this new data \emph{about} $X$ can be quantified as the
327 Kullback-Leibler (KL) divergence between the prior and posterior
328 distributions $p(x)$ and $p(x|\Data)$ respectively:
329 \begin{equation}
330 \mathcal{I}_{\Data\to X} = D(p_{X|\Data} || p_{X})
331 = \sum_{x\in\A} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}.
332 \end{equation}
333 If there are multiple variables $X_1, X_2$
334 \etc which the observer believes to be dependent, then the observation of
335 one may change its beliefs and hence yield information about the
336 others.
337 The relationships between the various joint entropies, conditional
338 entropies, mutual informations and conditional mutual informations
339 can be visualised in Venn diagram-like \emph{information diagrams}
340 or I-diagrams \cite{Yeung1991}, for example, the three-variable
341 I-diagram in \figrf{venn-example}.
342 356
343 357
344 \subsection{Entropy and information in sequences} 358 \subsection{Entropy and information in sequences}
345 359
346 Suppose that $(\ldots,X_{-1},X_0,X_1,\ldots)$ is a stationary sequence of 360 Suppose that $(\ldots,X_{-1},X_0,X_1,\ldots)$ is a stationary sequence of
475 in uncertainty about the future on learning $X_t$, given the past. 489 in uncertainty about the future on learning $X_t$, given the past.
476 Due to the symmetry of the mutual information, it can also be written 490 Due to the symmetry of the mutual information, it can also be written
477 as 491 as
478 \begin{equation} 492 \begin{equation}
479 % \IXZ_t 493 % \IXZ_t
480 I(X_t;\fut{X}_t|\past{X}_t) = H(X_t|\past{X}_t) - H(X_t|\fut{X}_t,\past{X}_t). 494 I(X_0;\fut{X}_0|\past{X}_0) = h_\mu - r_\mu,
481 % \label{<++>} 495 % \label{<++>}
482 \end{equation} 496 \end{equation}
483 % If $X$ is stationary, then 497 % If $X$ is stationary, then
484 Now, in the shift-invariant case, $H(X_t|\past{X}_t)$ 498 where $r_\mu = H(X_0|\fut{X}_0,\past{X}_0)$,
485 is the familiar entropy rate $h_\mu$, but $H(X_t|\fut{X}_t,\past{X}_t)$, 499 is the \emph{residual} \cite{AbdallahPlumbley2010},
486 the conditional entropy of one variable given \emph{all} the others 500 or \emph{erasure} \cite{VerduWeissman2006} entropy rate.
487 in the sequence, future as well as past, is what
488 we called the \emph{residual entropy rate} $r_\mu$ in \cite{AbdallahPlumbley2010},
489 but was previously identified by Verd{\'u} and Weissman \cite{VerduWeissman2006} as the
490 \emph{erasure entropy rate}.
491 Thus, the PIR is the difference between
492 the entropy rate and the erasure entropy rate: $b_\mu = h_\mu - r_\mu$.
493 These relationships are illustrated in \Figrf{predinfo-bg}, along with 501 These relationships are illustrated in \Figrf{predinfo-bg}, along with
494 several of the information measures we have discussed so far. 502 several of the information measures we have discussed so far.
495 503
504
505 \subsection{Other sequential information measures}
506
507 James et al \cite{JamesEllisonCrutchfield2011} study the predictive information
508 rate and also examine some related measures. In particular they identify the
509 $\sigma_\mu$, the difference between the multi-information rate and the excess
510 entropy, as an interesting quantity that measures the predictive benefit of
511 model-building (that is, maintaining an internal state summarising past
512 observations in order to make better predictions). They also identify
513 $w_\mu = \rho_\mu + b_{\mu}$, which they call the \emph{local exogenous
514 information} rate.
515
516 \subsection{First order Markov chains}
517 These are the simplest non-trivial models to which information dynamics methods
518 can be applied. In \cite{AbdallahPlumbley2009} we, showed that the predictive information
519 rate can be expressed simply in terms of the entropy rate of the Markov chain.
520 If we let $a$ denote the transition matrix of the Markov chain, and $h_a$ it's
521 entropy rate, then its predictive information rate $b_a$ is
522 \begin{equation}
523 b_a = h_{a^2} - h_a,
524 \end{equation}
525 where $a^2 = aa$, the transition matrix squared, is the transition matrix
526 of the `skip one' Markov chain obtained by leaving out every other observation.
527
528 \subsection{Higher order Markov chains}
529 Second and higher order Markov chains can be treated in a similar way by transforming
530 to a first order representation of the high order Markov chain. If we are dealing
531 with an $N$th order model, this is done forming a new alphabet of possible observations
532 consisting of all possible $N$-tuples of symbols from the base alphabet. An observation
533 in this new model represents a block of $N$ observations from the base model. The next
534 observation represents the block of $N$ obtained by shift the previous block along
535 by one step. The new Markov of chain is parameterised by a sparse $K^N\times K^N$
536 transition matrix $\hat{a}$.
537 \begin{equation}
538 b_{\hat{a}} = h_{\hat{a}^{N+1}} - N h_{\hat{a}},
539 \end{equation}
540 where $\hat{a}^{N+1}$ is the $N+1$th power of the transition matrix.
541
496 542
497 \begin{fig}{wundt} 543 \begin{fig}{wundt}
498 \raisebox{-4em}{\colfig[0.43]{wundt}} 544 \raisebox{-4em}{\colfig[0.43]{wundt}}
499 % {\ \shortstack{{\Large$\longrightarrow$}\\ {\scriptsize\emph{exposure}}}\ } 545 % {\ \shortstack{{\Large$\longrightarrow$}\\ {\scriptsize\emph{exposure}}}\ }
500 {\ {\large$\longrightarrow$}\ } 546 {\ {\large$\longrightarrow$}\ }
504 perceived value. Repeated exposure sometimes results 550 perceived value. Repeated exposure sometimes results
505 in a move to the left along the curve \cite{Berlyne71}. 551 in a move to the left along the curve \cite{Berlyne71}.
506 } 552 }
507 \end{fig} 553 \end{fig}
508 554
509 \subsection{Other sequential information measures}
510
511 James et al \cite{JamesEllisonCrutchfield2011} study the predictive information
512 rate and also examine some related measures. In particular they identify the
513 $\sigma_\mu$, the difference between the multi-information rate and the excess
514 entropy, as an interesting quantity that measures the predictive benefit of
515 model-building (that is, maintaining an internal state summarising past
516 observations in order to make better predictions). They also identify
517 $w_\mu = \rho_\mu + b_{\mu}$, which they call the \emph{local exogenous
518 information} rate.
519
520 \subsection{First order Markov chains}
521 These are the simplest non-trivial models to which information dynamics methods
522 can be applied. In \cite{AbdallahPlumbley2009} we, showed that the predictive information
523 rate can be expressed simply in terms of the entropy rate of the Markov chain.
524 If we let $a$ denote the transition matrix of the Markov chain, and $h_a$ it's
525 entropy rate, then its predictive information rate $b_a$ is
526 \begin{equation}
527 b_a = h_{a^2} - h_a,
528 \end{equation}
529 where $a^2 = aa$, the transition matrix squared, is the transition matrix
530 of the `skip one' Markov chain obtained by leaving out every other observation.
531
532 \subsection{Higher order Markov chains}
533 Second and higher order Markov chains can be treated in a similar way by transforming
534 to a first order representation of the high order Markov chain. If we are dealing
535 with an $N$th order model, this is done forming a new alphabet of possible observations
536 consisting of all possible $N$-tuples of symbols from the base alphabet. An observation
537 in this new model represents a block of $N$ observations from the base model. The next
538 observation represents the block of $N$ obtained by shift the previous block along
539 by one step. The new Markov of chain is parameterised by a sparse $K^N\times K^N$
540 transition matrix $\hat{a}$.
541 \begin{equation}
542 b_{\hat{a}} = h_{\hat{a}^{N+1}} - N h_{\hat{a}},
543 \end{equation}
544 where $\hat{a}^{N+1}$ is the $N+1$th power of the transition matrix.
545
546
547 555
548 \section{Information Dynamics in Analysis} 556 \section{Information Dynamics in Analysis}
549 557
550 \subsection{Musicological Analysis} 558 \subsection{Musicological Analysis}
551 refer to the work with the analysis of minimalist pieces 559 In \cite{AbdallahPlumbley2009}, methods based on the theory described above
560 were used to analysis two pieces of music in the minimalist style
561 by Philip Glass: \emph{Two Pages} (1969) and \emph{Gradus} (1968).
562 The analysis was done using a first-order Markov chain model, with the
563 enhancement that the transition matrix of the model was allowed to
564 evolve dynamically as the notes were processed, and was estimated (in
565 a Bayesian way) as a \emph{distribution} over possible transition matrices,
566 rather than a point estimate. [Bayesian surprise, other component of IPI].
552 567
553 \begin{fig}{twopages} 568 \begin{fig}{twopages}
554 \colfig[0.96]{matbase/fig9471} % update from mbc paper 569 \colfig[0.96]{matbase/fig9471} % update from mbc paper
555 % \colfig[0.97]{matbase/fig72663}\\ % later update from mbc paper (Keith's new picks) 570 % \colfig[0.97]{matbase/fig72663}\\ % later update from mbc paper (Keith's new picks)
556 \vspace*{1em} 571 \vspace*{1em}
639 Markov chains, by a random sampling method. In figure \ref{InfoDynEngine} we see 654 Markov chains, by a random sampling method. In figure \ref{InfoDynEngine} we see
640 a representation of how these matrixes are distributed in the 3d statistical 655 a representation of how these matrixes are distributed in the 3d statistical
641 space; each one of these points corresponds to a transition 656 space; each one of these points corresponds to a transition
642 matrix.\emph{self-plagiarised} 657 matrix.\emph{self-plagiarised}
643 658
644 \begin{figure}
645 \centering
646 \includegraphics[width=\linewidth]{figs/mtriscat}
647 \caption{The population of transition matrices distributed along three axes of
648 redundancy, entropy rate and predictive information rate (all measured in bits).
649 The concentrations of points along the redundancy axis correspond
650 to Markov chains which are roughly periodic with periods of 2 (redundancy 1 bit),
651 3, 4, \etc all the way to period 8 (redundancy 3 bits). The colour of each point
652 represents its PIR---note that the highest values are found at intermediate entropy
653 and redundancy, and that the distribution as a whole makes a curved triangle. Although
654 not visible in this plot, it is largely hollow in the middle.
655 \label{InfoDynEngine}}
656 \end{figure}
657
658 659
659 When we look at the distribution of transition matrixes plotted in this space, 660 When we look at the distribution of transition matrixes plotted in this space,
660 we see that it forms an arch shape that is fairly thin. It thus becomes a 661 we see that it forms an arch shape that is fairly thin. It thus becomes a
661 reasonable approximation to pretend that it is just a sheet in two dimensions; 662 reasonable approximation to pretend that it is just a sheet in two dimensions;
662 and so we stretch out this curved arc into a flat triangle. It is this triangular 663 and so we stretch out this curved arc into a flat triangle. It is this triangular
667 system, or as an interactive installation, it involves a mapping to this statistical 668 system, or as an interactive installation, it involves a mapping to this statistical
668 space. When the user, through the interface, selects a position within the 669 space. When the user, through the interface, selects a position within the
669 triangle, the corresponding transition matrix is returned. Figure \ref{TheTriangle} 670 triangle, the corresponding transition matrix is returned. Figure \ref{TheTriangle}
670 shows how the triangle maps to different measures of redundancy, entropy rate 671 shows how the triangle maps to different measures of redundancy, entropy rate
671 and predictive information rate.\emph{self-plagiarised} 672 and predictive information rate.\emph{self-plagiarised}
672 \begin{figure} 673
673 \centering
674 \includegraphics[width=0.9\linewidth]{figs/TheTriangle.pdf}
675 \caption{The Melody Triangle\label{TheTriangle}}
676 \end{figure}
677 Each corner corresponds to three different extremes of predictability and 674 Each corner corresponds to three different extremes of predictability and
678 unpredictability, which could be loosely characterised as `periodicity', `noise' 675 unpredictability, which could be loosely characterised as `periodicity', `noise'
679 and `repetition'. Melodies from the `noise' corner have no discernible pattern; 676 and `repetition'. Melodies from the `noise' corner have no discernible pattern;
680 they have high entropy rate, low predictive information rate and low redundancy. 677 they have high entropy rate, low predictive information rate and low redundancy.
681 These melodies are essentially totally random. A melody along the `periodicity' 678 These melodies are essentially totally random. A melody along the `periodicity'
686 dom. Or, conversely, that are predictable, but not entirely so. This triangular 683 dom. Or, conversely, that are predictable, but not entirely so. This triangular
687 space allows for an intuitive explorationof expectation and surprise in temporal 684 space allows for an intuitive explorationof expectation and surprise in temporal
688 sequences based on a simple model of how one might guess the next event given 685 sequences based on a simple model of how one might guess the next event given
689 the previous one.\emph{self-plagiarised} 686 the previous one.\emph{self-plagiarised}
690 687
688 \begin{figure}
689 \centering
690 \includegraphics[width=\linewidth]{figs/mtriscat}
691 \caption{The population of transition matrices distributed along three axes of
692 redundancy, entropy rate and predictive information rate (all measured in bits).
693 The concentrations of points along the redundancy axis correspond
694 to Markov chains which are roughly periodic with periods of 2 (redundancy 1 bit),
695 3, 4, \etc all the way to period 8 (redundancy 3 bits). The colour of each point
696 represents its PIR---note that the highest values are found at intermediate entropy
697 and redundancy, and that the distribution as a whole makes a curved triangle. Although
698 not visible in this plot, it is largely hollow in the middle.
699 \label{InfoDynEngine}}
700 \end{figure}
701
691 702
692 703
693 Any number of interfaces could be developed for the Melody Triangle. We have 704 Any number of interfaces could be developed for the Melody Triangle. We have
694 developed two; a standard screen based interface where a user moves tokens with 705 developed two; a standard screen based interface where a user moves tokens with
695 a mouse in and around a triangle on screen, and a multi-user interactive 706 a mouse in and around a triangle on screen, and a multi-user interactive
716 727
717 Additionally the Melody Triangle serves as an effective tool for experimental investigations into musical preference and their relationship to the information dynamics models. 728 Additionally the Melody Triangle serves as an effective tool for experimental investigations into musical preference and their relationship to the information dynamics models.
718 729
719 %As the Melody Triangle essentially operates on a stream of symbols, it it is possible to apply the melody triangle to the design of non-sonic content. 730 %As the Melody Triangle essentially operates on a stream of symbols, it it is possible to apply the melody triangle to the design of non-sonic content.
720 731
732 \begin{figure}
733 \centering
734 \includegraphics[width=0.9\linewidth]{figs/TheTriangle.pdf}
735 \caption{The Melody Triangle\label{TheTriangle}}
736 \end{figure}
737
721 \section{Musical Preference and Information Dynamics} 738 \section{Musical Preference and Information Dynamics}
722 We carried out a preliminary study that sought to identify any correlation between 739 We carried out a preliminary study that sought to identify any correlation between
723 aesthetic preference and the information theoretical measures of the Melody 740 aesthetic preference and the information theoretical measures of the Melody
724 Triangle. In this study participants were asked to use the screen based interface 741 Triangle. In this study participants were asked to use the screen based interface
725 but it was simplified so that all they could do was move tokens around. To help 742 but it was simplified so that all they could do was move tokens around. To help