Mercurial > hg > cip2012
comparison draft.tex @ 34:25846c37a08a
New bits, rearranged figure placement.
author | samer |
---|---|
date | Wed, 14 Mar 2012 13:17:05 +0000 |
parents | a9c8580cb8ca |
children | 194c7ec7e35d |
comparison
equal
deleted
inserted
replaced
33:a9c8580cb8ca | 34:25846c37a08a |
---|---|
222 %and selective or evaluative phases \cite{Boden1990}, and would have | 222 %and selective or evaluative phases \cite{Boden1990}, and would have |
223 %applications in tools for computer aided composition. | 223 %applications in tools for computer aided composition. |
224 | 224 |
225 | 225 |
226 \section{Theoretical review} | 226 \section{Theoretical review} |
227 | |
228 \subsection{Entropy and information} | |
229 Let $X$ denote some variable whose value is initially unknown to our | |
230 hypothetical observer. We will treat $X$ mathematically as a random variable, | |
231 with a value to be drawn from some set $\A$ and a | |
232 probability distribution representing the observer's beliefs about the | |
233 true value of $X$. | |
234 In this case, the observer's uncertainty about $X$ can be quantified | |
235 as the entropy of the random variable $H(X)$. For a discrete variable | |
236 with probability mass function $p:\A \to [0,1]$, this is | |
237 \begin{equation} | |
238 H(X) = \sum_{x\in\A} -p(x) \log p(x) = \expect{-\log p(X)}, | |
239 \end{equation} | |
240 where $\expect{}$ is the expectation operator. The negative-log-probability | |
241 $\ell(x) = -\log p(x)$ of a particular value $x$ can usefully be thought of as | |
242 the \emph{surprisingness} of the value $x$ should it be observed, and | |
243 hence the entropy is the expected surprisingness. | |
244 | |
245 Now suppose that the observer receives some new data $\Data$ that | |
246 causes a revision of its beliefs about $X$. The \emph{information} | |
247 in this new data \emph{about} $X$ can be quantified as the | |
248 Kullback-Leibler (KL) divergence between the prior and posterior | |
249 distributions $p(x)$ and $p(x|\Data)$ respectively: | |
250 \begin{equation} | |
251 \mathcal{I}_{\Data\to X} = D(p_{X|\Data} || p_{X}) | |
252 = \sum_{x\in\A} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}. | |
253 \end{equation} | |
254 When there are multiple variables $X_1, X_2$ | |
255 \etc which the observer believes to be dependent, then the observation of | |
256 one may change its beliefs and hence yield information about the | |
257 others. The joint and conditional entropies as described in any | |
258 textbook on information theory (\eg \cite{CoverThomas}) then quantify | |
259 the observer's expected uncertainty about groups of variables given the | |
260 values of others. In particular, the \emph{mutual information} | |
261 $I(X_1;X_2)$ is both the expected information | |
262 in an observation of $X_2$ about $X_1$ and the expected reduction | |
263 in uncertainty about $X_1$ after observing $X_2$: | |
264 \begin{equation} | |
265 I(X_1;X_2) = H(X_1) - H(X_1|X_2), | |
266 \end{equation} | |
267 where $H(X_1|X_2) = H(X_1,X_2) - H(X_2)$ is the conditional entropy | |
268 of $X_2$ given $X_1$. A little algebra shows that $I(X_1;X_2)=I(X_2;X_1)$ | |
269 and so the mutual information is symmetric in its arguments. A conditional | |
270 form of the mutual information can be formulated analogously: | |
271 \begin{equation} | |
272 I(X_1;X_2|X_3) = H(X_1|X_3) - H(X_1|X_2,X_3). | |
273 \end{equation} | |
274 These relationships between the various entropies and mutual | |
275 informations are conveniently visualised in Venn diagram-like \emph{information diagrams} | |
276 or I-diagrams \cite{Yeung1991} such as the one in \figrf{venn-example}. | |
227 | 277 |
228 \begin{fig}{venn-example} | 278 \begin{fig}{venn-example} |
229 \newcommand\rad{2.2em}% | 279 \newcommand\rad{2.2em}% |
230 \newcommand\circo{circle (3.4em)}% | 280 \newcommand\circo{circle (3.4em)}% |
231 \newcommand\labrad{4.3em} | 281 \newcommand\labrad{4.3em} |
301 The total shaded area is the joint entropy $H(X_1,X_2,X_3)$. | 351 The total shaded area is the joint entropy $H(X_1,X_2,X_3)$. |
302 The central area $I_{123}$ is the co-information \cite{McGill1954}. | 352 The central area $I_{123}$ is the co-information \cite{McGill1954}. |
303 Some other information measures are indicated in the legend. | 353 Some other information measures are indicated in the legend. |
304 } | 354 } |
305 \end{fig} | 355 \end{fig} |
306 | |
307 \subsection{Entropy and information} | |
308 Let $X$ denote some variable whose value is initially unknown to our | |
309 hypothetical observer. We will treat $X$ mathematically as a random variable, | |
310 with a value to be drawn from some set (or \emph{alphabet}) $\A$ and a | |
311 probability distribution representing the observer's beliefs about the | |
312 true value of $X$. | |
313 In this case, the observer's uncertainty about $X$ can be quantified | |
314 as the entropy of the random variable $H(X)$. For a discrete variable | |
315 with probability mass function $p:\A \to [0,1]$, this is | |
316 \begin{equation} | |
317 H(X) = \sum_{x\in\A} -p(x) \log p(x) = \expect{-\log p(X)}, | |
318 \end{equation} | |
319 where $\expect{}$ is the expectation operator. The negative-log-probability | |
320 $\ell(x) = -\log p(x)$ of a particular value $x$ can usefully be thought of as | |
321 the \emph{surprisingness} of the value $x$ should it be observed, and | |
322 hence the entropy is the expected surprisingness. | |
323 | |
324 Now suppose that the observer receives some new data $\Data$ that | |
325 causes a revision of its beliefs about $X$. The \emph{information} | |
326 in this new data \emph{about} $X$ can be quantified as the | |
327 Kullback-Leibler (KL) divergence between the prior and posterior | |
328 distributions $p(x)$ and $p(x|\Data)$ respectively: | |
329 \begin{equation} | |
330 \mathcal{I}_{\Data\to X} = D(p_{X|\Data} || p_{X}) | |
331 = \sum_{x\in\A} p(x|\Data) \log \frac{p(x|\Data)}{p(x)}. | |
332 \end{equation} | |
333 If there are multiple variables $X_1, X_2$ | |
334 \etc which the observer believes to be dependent, then the observation of | |
335 one may change its beliefs and hence yield information about the | |
336 others. | |
337 The relationships between the various joint entropies, conditional | |
338 entropies, mutual informations and conditional mutual informations | |
339 can be visualised in Venn diagram-like \emph{information diagrams} | |
340 or I-diagrams \cite{Yeung1991}, for example, the three-variable | |
341 I-diagram in \figrf{venn-example}. | |
342 | 356 |
343 | 357 |
344 \subsection{Entropy and information in sequences} | 358 \subsection{Entropy and information in sequences} |
345 | 359 |
346 Suppose that $(\ldots,X_{-1},X_0,X_1,\ldots)$ is a stationary sequence of | 360 Suppose that $(\ldots,X_{-1},X_0,X_1,\ldots)$ is a stationary sequence of |
475 in uncertainty about the future on learning $X_t$, given the past. | 489 in uncertainty about the future on learning $X_t$, given the past. |
476 Due to the symmetry of the mutual information, it can also be written | 490 Due to the symmetry of the mutual information, it can also be written |
477 as | 491 as |
478 \begin{equation} | 492 \begin{equation} |
479 % \IXZ_t | 493 % \IXZ_t |
480 I(X_t;\fut{X}_t|\past{X}_t) = H(X_t|\past{X}_t) - H(X_t|\fut{X}_t,\past{X}_t). | 494 I(X_0;\fut{X}_0|\past{X}_0) = h_\mu - r_\mu, |
481 % \label{<++>} | 495 % \label{<++>} |
482 \end{equation} | 496 \end{equation} |
483 % If $X$ is stationary, then | 497 % If $X$ is stationary, then |
484 Now, in the shift-invariant case, $H(X_t|\past{X}_t)$ | 498 where $r_\mu = H(X_0|\fut{X}_0,\past{X}_0)$, |
485 is the familiar entropy rate $h_\mu$, but $H(X_t|\fut{X}_t,\past{X}_t)$, | 499 is the \emph{residual} \cite{AbdallahPlumbley2010}, |
486 the conditional entropy of one variable given \emph{all} the others | 500 or \emph{erasure} \cite{VerduWeissman2006} entropy rate. |
487 in the sequence, future as well as past, is what | |
488 we called the \emph{residual entropy rate} $r_\mu$ in \cite{AbdallahPlumbley2010}, | |
489 but was previously identified by Verd{\'u} and Weissman \cite{VerduWeissman2006} as the | |
490 \emph{erasure entropy rate}. | |
491 Thus, the PIR is the difference between | |
492 the entropy rate and the erasure entropy rate: $b_\mu = h_\mu - r_\mu$. | |
493 These relationships are illustrated in \Figrf{predinfo-bg}, along with | 501 These relationships are illustrated in \Figrf{predinfo-bg}, along with |
494 several of the information measures we have discussed so far. | 502 several of the information measures we have discussed so far. |
495 | 503 |
504 | |
505 \subsection{Other sequential information measures} | |
506 | |
507 James et al \cite{JamesEllisonCrutchfield2011} study the predictive information | |
508 rate and also examine some related measures. In particular they identify the | |
509 $\sigma_\mu$, the difference between the multi-information rate and the excess | |
510 entropy, as an interesting quantity that measures the predictive benefit of | |
511 model-building (that is, maintaining an internal state summarising past | |
512 observations in order to make better predictions). They also identify | |
513 $w_\mu = \rho_\mu + b_{\mu}$, which they call the \emph{local exogenous | |
514 information} rate. | |
515 | |
516 \subsection{First order Markov chains} | |
517 These are the simplest non-trivial models to which information dynamics methods | |
518 can be applied. In \cite{AbdallahPlumbley2009} we, showed that the predictive information | |
519 rate can be expressed simply in terms of the entropy rate of the Markov chain. | |
520 If we let $a$ denote the transition matrix of the Markov chain, and $h_a$ it's | |
521 entropy rate, then its predictive information rate $b_a$ is | |
522 \begin{equation} | |
523 b_a = h_{a^2} - h_a, | |
524 \end{equation} | |
525 where $a^2 = aa$, the transition matrix squared, is the transition matrix | |
526 of the `skip one' Markov chain obtained by leaving out every other observation. | |
527 | |
528 \subsection{Higher order Markov chains} | |
529 Second and higher order Markov chains can be treated in a similar way by transforming | |
530 to a first order representation of the high order Markov chain. If we are dealing | |
531 with an $N$th order model, this is done forming a new alphabet of possible observations | |
532 consisting of all possible $N$-tuples of symbols from the base alphabet. An observation | |
533 in this new model represents a block of $N$ observations from the base model. The next | |
534 observation represents the block of $N$ obtained by shift the previous block along | |
535 by one step. The new Markov of chain is parameterised by a sparse $K^N\times K^N$ | |
536 transition matrix $\hat{a}$. | |
537 \begin{equation} | |
538 b_{\hat{a}} = h_{\hat{a}^{N+1}} - N h_{\hat{a}}, | |
539 \end{equation} | |
540 where $\hat{a}^{N+1}$ is the $N+1$th power of the transition matrix. | |
541 | |
496 | 542 |
497 \begin{fig}{wundt} | 543 \begin{fig}{wundt} |
498 \raisebox{-4em}{\colfig[0.43]{wundt}} | 544 \raisebox{-4em}{\colfig[0.43]{wundt}} |
499 % {\ \shortstack{{\Large$\longrightarrow$}\\ {\scriptsize\emph{exposure}}}\ } | 545 % {\ \shortstack{{\Large$\longrightarrow$}\\ {\scriptsize\emph{exposure}}}\ } |
500 {\ {\large$\longrightarrow$}\ } | 546 {\ {\large$\longrightarrow$}\ } |
504 perceived value. Repeated exposure sometimes results | 550 perceived value. Repeated exposure sometimes results |
505 in a move to the left along the curve \cite{Berlyne71}. | 551 in a move to the left along the curve \cite{Berlyne71}. |
506 } | 552 } |
507 \end{fig} | 553 \end{fig} |
508 | 554 |
509 \subsection{Other sequential information measures} | |
510 | |
511 James et al \cite{JamesEllisonCrutchfield2011} study the predictive information | |
512 rate and also examine some related measures. In particular they identify the | |
513 $\sigma_\mu$, the difference between the multi-information rate and the excess | |
514 entropy, as an interesting quantity that measures the predictive benefit of | |
515 model-building (that is, maintaining an internal state summarising past | |
516 observations in order to make better predictions). They also identify | |
517 $w_\mu = \rho_\mu + b_{\mu}$, which they call the \emph{local exogenous | |
518 information} rate. | |
519 | |
520 \subsection{First order Markov chains} | |
521 These are the simplest non-trivial models to which information dynamics methods | |
522 can be applied. In \cite{AbdallahPlumbley2009} we, showed that the predictive information | |
523 rate can be expressed simply in terms of the entropy rate of the Markov chain. | |
524 If we let $a$ denote the transition matrix of the Markov chain, and $h_a$ it's | |
525 entropy rate, then its predictive information rate $b_a$ is | |
526 \begin{equation} | |
527 b_a = h_{a^2} - h_a, | |
528 \end{equation} | |
529 where $a^2 = aa$, the transition matrix squared, is the transition matrix | |
530 of the `skip one' Markov chain obtained by leaving out every other observation. | |
531 | |
532 \subsection{Higher order Markov chains} | |
533 Second and higher order Markov chains can be treated in a similar way by transforming | |
534 to a first order representation of the high order Markov chain. If we are dealing | |
535 with an $N$th order model, this is done forming a new alphabet of possible observations | |
536 consisting of all possible $N$-tuples of symbols from the base alphabet. An observation | |
537 in this new model represents a block of $N$ observations from the base model. The next | |
538 observation represents the block of $N$ obtained by shift the previous block along | |
539 by one step. The new Markov of chain is parameterised by a sparse $K^N\times K^N$ | |
540 transition matrix $\hat{a}$. | |
541 \begin{equation} | |
542 b_{\hat{a}} = h_{\hat{a}^{N+1}} - N h_{\hat{a}}, | |
543 \end{equation} | |
544 where $\hat{a}^{N+1}$ is the $N+1$th power of the transition matrix. | |
545 | |
546 | |
547 | 555 |
548 \section{Information Dynamics in Analysis} | 556 \section{Information Dynamics in Analysis} |
549 | 557 |
550 \subsection{Musicological Analysis} | 558 \subsection{Musicological Analysis} |
551 refer to the work with the analysis of minimalist pieces | 559 In \cite{AbdallahPlumbley2009}, methods based on the theory described above |
560 were used to analysis two pieces of music in the minimalist style | |
561 by Philip Glass: \emph{Two Pages} (1969) and \emph{Gradus} (1968). | |
562 The analysis was done using a first-order Markov chain model, with the | |
563 enhancement that the transition matrix of the model was allowed to | |
564 evolve dynamically as the notes were processed, and was estimated (in | |
565 a Bayesian way) as a \emph{distribution} over possible transition matrices, | |
566 rather than a point estimate. [Bayesian surprise, other component of IPI]. | |
552 | 567 |
553 \begin{fig}{twopages} | 568 \begin{fig}{twopages} |
554 \colfig[0.96]{matbase/fig9471} % update from mbc paper | 569 \colfig[0.96]{matbase/fig9471} % update from mbc paper |
555 % \colfig[0.97]{matbase/fig72663}\\ % later update from mbc paper (Keith's new picks) | 570 % \colfig[0.97]{matbase/fig72663}\\ % later update from mbc paper (Keith's new picks) |
556 \vspace*{1em} | 571 \vspace*{1em} |
639 Markov chains, by a random sampling method. In figure \ref{InfoDynEngine} we see | 654 Markov chains, by a random sampling method. In figure \ref{InfoDynEngine} we see |
640 a representation of how these matrixes are distributed in the 3d statistical | 655 a representation of how these matrixes are distributed in the 3d statistical |
641 space; each one of these points corresponds to a transition | 656 space; each one of these points corresponds to a transition |
642 matrix.\emph{self-plagiarised} | 657 matrix.\emph{self-plagiarised} |
643 | 658 |
644 \begin{figure} | |
645 \centering | |
646 \includegraphics[width=\linewidth]{figs/mtriscat} | |
647 \caption{The population of transition matrices distributed along three axes of | |
648 redundancy, entropy rate and predictive information rate (all measured in bits). | |
649 The concentrations of points along the redundancy axis correspond | |
650 to Markov chains which are roughly periodic with periods of 2 (redundancy 1 bit), | |
651 3, 4, \etc all the way to period 8 (redundancy 3 bits). The colour of each point | |
652 represents its PIR---note that the highest values are found at intermediate entropy | |
653 and redundancy, and that the distribution as a whole makes a curved triangle. Although | |
654 not visible in this plot, it is largely hollow in the middle. | |
655 \label{InfoDynEngine}} | |
656 \end{figure} | |
657 | |
658 | 659 |
659 When we look at the distribution of transition matrixes plotted in this space, | 660 When we look at the distribution of transition matrixes plotted in this space, |
660 we see that it forms an arch shape that is fairly thin. It thus becomes a | 661 we see that it forms an arch shape that is fairly thin. It thus becomes a |
661 reasonable approximation to pretend that it is just a sheet in two dimensions; | 662 reasonable approximation to pretend that it is just a sheet in two dimensions; |
662 and so we stretch out this curved arc into a flat triangle. It is this triangular | 663 and so we stretch out this curved arc into a flat triangle. It is this triangular |
667 system, or as an interactive installation, it involves a mapping to this statistical | 668 system, or as an interactive installation, it involves a mapping to this statistical |
668 space. When the user, through the interface, selects a position within the | 669 space. When the user, through the interface, selects a position within the |
669 triangle, the corresponding transition matrix is returned. Figure \ref{TheTriangle} | 670 triangle, the corresponding transition matrix is returned. Figure \ref{TheTriangle} |
670 shows how the triangle maps to different measures of redundancy, entropy rate | 671 shows how the triangle maps to different measures of redundancy, entropy rate |
671 and predictive information rate.\emph{self-plagiarised} | 672 and predictive information rate.\emph{self-plagiarised} |
672 \begin{figure} | 673 |
673 \centering | |
674 \includegraphics[width=0.9\linewidth]{figs/TheTriangle.pdf} | |
675 \caption{The Melody Triangle\label{TheTriangle}} | |
676 \end{figure} | |
677 Each corner corresponds to three different extremes of predictability and | 674 Each corner corresponds to three different extremes of predictability and |
678 unpredictability, which could be loosely characterised as `periodicity', `noise' | 675 unpredictability, which could be loosely characterised as `periodicity', `noise' |
679 and `repetition'. Melodies from the `noise' corner have no discernible pattern; | 676 and `repetition'. Melodies from the `noise' corner have no discernible pattern; |
680 they have high entropy rate, low predictive information rate and low redundancy. | 677 they have high entropy rate, low predictive information rate and low redundancy. |
681 These melodies are essentially totally random. A melody along the `periodicity' | 678 These melodies are essentially totally random. A melody along the `periodicity' |
686 dom. Or, conversely, that are predictable, but not entirely so. This triangular | 683 dom. Or, conversely, that are predictable, but not entirely so. This triangular |
687 space allows for an intuitive explorationof expectation and surprise in temporal | 684 space allows for an intuitive explorationof expectation and surprise in temporal |
688 sequences based on a simple model of how one might guess the next event given | 685 sequences based on a simple model of how one might guess the next event given |
689 the previous one.\emph{self-plagiarised} | 686 the previous one.\emph{self-plagiarised} |
690 | 687 |
688 \begin{figure} | |
689 \centering | |
690 \includegraphics[width=\linewidth]{figs/mtriscat} | |
691 \caption{The population of transition matrices distributed along three axes of | |
692 redundancy, entropy rate and predictive information rate (all measured in bits). | |
693 The concentrations of points along the redundancy axis correspond | |
694 to Markov chains which are roughly periodic with periods of 2 (redundancy 1 bit), | |
695 3, 4, \etc all the way to period 8 (redundancy 3 bits). The colour of each point | |
696 represents its PIR---note that the highest values are found at intermediate entropy | |
697 and redundancy, and that the distribution as a whole makes a curved triangle. Although | |
698 not visible in this plot, it is largely hollow in the middle. | |
699 \label{InfoDynEngine}} | |
700 \end{figure} | |
701 | |
691 | 702 |
692 | 703 |
693 Any number of interfaces could be developed for the Melody Triangle. We have | 704 Any number of interfaces could be developed for the Melody Triangle. We have |
694 developed two; a standard screen based interface where a user moves tokens with | 705 developed two; a standard screen based interface where a user moves tokens with |
695 a mouse in and around a triangle on screen, and a multi-user interactive | 706 a mouse in and around a triangle on screen, and a multi-user interactive |
716 | 727 |
717 Additionally the Melody Triangle serves as an effective tool for experimental investigations into musical preference and their relationship to the information dynamics models. | 728 Additionally the Melody Triangle serves as an effective tool for experimental investigations into musical preference and their relationship to the information dynamics models. |
718 | 729 |
719 %As the Melody Triangle essentially operates on a stream of symbols, it it is possible to apply the melody triangle to the design of non-sonic content. | 730 %As the Melody Triangle essentially operates on a stream of symbols, it it is possible to apply the melody triangle to the design of non-sonic content. |
720 | 731 |
732 \begin{figure} | |
733 \centering | |
734 \includegraphics[width=0.9\linewidth]{figs/TheTriangle.pdf} | |
735 \caption{The Melody Triangle\label{TheTriangle}} | |
736 \end{figure} | |
737 | |
721 \section{Musical Preference and Information Dynamics} | 738 \section{Musical Preference and Information Dynamics} |
722 We carried out a preliminary study that sought to identify any correlation between | 739 We carried out a preliminary study that sought to identify any correlation between |
723 aesthetic preference and the information theoretical measures of the Melody | 740 aesthetic preference and the information theoretical measures of the Melody |
724 Triangle. In this study participants were asked to use the screen based interface | 741 Triangle. In this study participants were asked to use the screen based interface |
725 but it was simplified so that all they could do was move tokens around. To help | 742 but it was simplified so that all they could do was move tokens around. To help |