Chris@19: Chris@19: I agree with you - having a look at a model that does not support Chris@19: convolution would help. You'll find attached 'hnmf.m', which is Chris@19: essentially the same model without convolution. So the simplified model Chris@19: is: P(w,t) = P(t) \sum_{s,p} P(w|p,s)P(p|t)P(s|p,t) Chris@19: Chris@19: Also, a more recent (and much more efficient) version of the CMJ system Chris@19: converts the model from a convolutive to a linear one, but still keeping Chris@19: the shift-invariance support. That is achieved by having a pre-extracted Chris@19: 4-D dictionary that also supports templates that are pre-shifted across Chris@19: log-frequency (so that the system would not need to compute the Chris@19: convolutions during the EM step). I have uploaded the source code on Chris@19: SoundSoftware [i], and you can find the related paper in [ii]. This Chris@19: system has the exact same performance with the CMJ one, but is much Chris@19: easier to understand/implement, and is over 50 times faster. Chris@19: Chris@19: [i] https://code.soundsoftware.ac.uk/projects/amt_mssiplca_fast Chris@19: [ii] http://www.ecmlpkdd2013.org/wp-content/uploads/2013/09/MLMU_benetos.pdf Chris@19: Chris@19: > In eqn 12, Chris@19: > Pt(p) = Chris@19: > sum[w,f,s] ( P(p,f,s|w,t) Vw,t ) / Chris@19: > sum[p,w,f,s] ( P(p,f,s|w,t) Vw,t ) Chris@19: > Chris@19: > P(p,f,s|w,t) is the result of the E-step (and a time-frequency Chris@19: > distribution), and Vw,t is the input spectrogram (also a Chris@19: > time-frequency distribution), right? Chris@19: Chris@19: Right! Basically, P(p,f,s|w,t) is a 5-dimensional matrix, essentially Chris@19: the model without the sums (the sums convert P(p,f,s|w,t) into a 2-D Chris@19: matrix P(w,t)). Chris@19: Chris@19: > So I read this as something like: update the pitch probability Chris@19: > distribution for time t so that its value for a pitch p is the ratio Chris@19: > of the sum of the expression P(p,f,s|w,t) Vw,t for *that* pitch Chris@19: > variable to the sum of the same expression across *all* pitch Chris@19: > variables. Chris@19: Chris@19: The equation you put essentially takes the 5-dimensional quantity Chris@19: P(p,f,s|w,t) Vw,t and marginalises it to P(p,t), i.e. it sums over all Chris@19: other dimensions. All these 'unknown' parameters, e.g. P(s|p,t), are Chris@19: generated from this 5-dimensional posterior distribution. Chris@19: Chris@19: > But what does it mean to refer to P(p,f,s|w,t) for a single pitch Chris@19: > variable, given that P(p,f,s|w,t) is just a time-frequency Chris@19: > distribution? There doesn't seem to be any dependence on p in it. I Chris@19: > think this is where I'm missing the (hopefully obvious) fundamental Chris@19: > thing. Chris@19: Chris@19: P(p,f,s|w,t) is not a time-frequency distribution; it is a 5-dimensional Chris@19: posterior distribution of the 3 unknown model parameters given time and Chris@19: frequency. Chris@19: Chris@19: The basic concept of EM is that you have a latent variable in your Chris@19: model, e.g. p; in the E-step, you compute the posterior given the Chris@19: known/input data (e.g. P(p|w,t)). For the M-step, you compute the Chris@19: complete likelihood given the original input, P(p|w,t)Vwt; and you Chris@19: marginalise over the variables that you don't care about, e.g. if you Chris@19: want to find P(p|t), you compute \sum_w P(p|w,t)V_wt; finally, you Chris@19: normalise that result according to your model, so if your model has a Chris@19: P(p|t) component, you normalise so that P(p) for a given timeframe sums Chris@19: to one (this is the denominator in the equation you showed). Chris@19: