annotate notes/em.txt @ 167:416b555df3b2 finetune

More on returning fine tuning (but we're treating different shifts of the same pitch as different notes at the moment which is not right)
author Chris Cannam
date Tue, 20 May 2014 17:49:07 +0100
parents f1f8c84339d0
children
rev   line source
Chris@19 1
Chris@19 2 I agree with you - having a look at a model that does not support
Chris@19 3 convolution would help. You'll find attached 'hnmf.m', which is
Chris@19 4 essentially the same model without convolution. So the simplified model
Chris@19 5 is: P(w,t) = P(t) \sum_{s,p} P(w|p,s)P(p|t)P(s|p,t)
Chris@19 6
Chris@19 7 Also, a more recent (and much more efficient) version of the CMJ system
Chris@19 8 converts the model from a convolutive to a linear one, but still keeping
Chris@19 9 the shift-invariance support. That is achieved by having a pre-extracted
Chris@19 10 4-D dictionary that also supports templates that are pre-shifted across
Chris@19 11 log-frequency (so that the system would not need to compute the
Chris@19 12 convolutions during the EM step). I have uploaded the source code on
Chris@19 13 SoundSoftware [i], and you can find the related paper in [ii]. This
Chris@19 14 system has the exact same performance with the CMJ one, but is much
Chris@19 15 easier to understand/implement, and is over 50 times faster.
Chris@19 16
Chris@19 17 [i] https://code.soundsoftware.ac.uk/projects/amt_mssiplca_fast
Chris@19 18 [ii] http://www.ecmlpkdd2013.org/wp-content/uploads/2013/09/MLMU_benetos.pdf
Chris@19 19
Chris@19 20 > In eqn 12,
Chris@19 21 > Pt(p) =
Chris@19 22 > sum[w,f,s] ( P(p,f,s|w,t) Vw,t ) /
Chris@19 23 > sum[p,w,f,s] ( P(p,f,s|w,t) Vw,t )
Chris@19 24 >
Chris@19 25 > P(p,f,s|w,t) is the result of the E-step (and a time-frequency
Chris@19 26 > distribution), and Vw,t is the input spectrogram (also a
Chris@19 27 > time-frequency distribution), right?
Chris@19 28
Chris@19 29 Right! Basically, P(p,f,s|w,t) is a 5-dimensional matrix, essentially
Chris@19 30 the model without the sums (the sums convert P(p,f,s|w,t) into a 2-D
Chris@19 31 matrix P(w,t)).
Chris@19 32
Chris@19 33 > So I read this as something like: update the pitch probability
Chris@19 34 > distribution for time t so that its value for a pitch p is the ratio
Chris@19 35 > of the sum of the expression P(p,f,s|w,t) Vw,t for *that* pitch
Chris@19 36 > variable to the sum of the same expression across *all* pitch
Chris@19 37 > variables.
Chris@19 38
Chris@19 39 The equation you put essentially takes the 5-dimensional quantity
Chris@19 40 P(p,f,s|w,t) Vw,t and marginalises it to P(p,t), i.e. it sums over all
Chris@19 41 other dimensions. All these 'unknown' parameters, e.g. P(s|p,t), are
Chris@19 42 generated from this 5-dimensional posterior distribution.
Chris@19 43
Chris@19 44 > But what does it mean to refer to P(p,f,s|w,t) for a single pitch
Chris@19 45 > variable, given that P(p,f,s|w,t) is just a time-frequency
Chris@19 46 > distribution? There doesn't seem to be any dependence on p in it. I
Chris@19 47 > think this is where I'm missing the (hopefully obvious) fundamental
Chris@19 48 > thing.
Chris@19 49
Chris@19 50 P(p,f,s|w,t) is not a time-frequency distribution; it is a 5-dimensional
Chris@19 51 posterior distribution of the 3 unknown model parameters given time and
Chris@19 52 frequency.
Chris@19 53
Chris@19 54 The basic concept of EM is that you have a latent variable in your
Chris@19 55 model, e.g. p; in the E-step, you compute the posterior given the
Chris@19 56 known/input data (e.g. P(p|w,t)). For the M-step, you compute the
Chris@19 57 complete likelihood given the original input, P(p|w,t)Vwt; and you
Chris@19 58 marginalise over the variables that you don't care about, e.g. if you
Chris@19 59 want to find P(p|t), you compute \sum_w P(p|w,t)V_wt; finally, you
Chris@19 60 normalise that result according to your model, so if your model has a
Chris@19 61 P(p|t) component, you normalise so that P(p) for a given timeframe sums
Chris@19 62 to one (this is the denominator in the equation you showed).
Chris@19 63