| wolffd@0 | 1 function [centres, options, post, errlog] = kmeans(centres, data, options) | 
| wolffd@0 | 2 %KMEANS	Trains a k means cluster model. | 
| wolffd@0 | 3 % | 
| wolffd@0 | 4 %	Description | 
| wolffd@0 | 5 %	 CENTRES = KMEANS(CENTRES, DATA, OPTIONS) uses the batch K-means | 
| wolffd@0 | 6 %	algorithm to set the centres of a cluster model. The matrix DATA | 
| wolffd@0 | 7 %	represents the data which is being clustered, with each row | 
| wolffd@0 | 8 %	corresponding to a vector. The sum of squares error function is used. | 
| wolffd@0 | 9 %	The point at which a local minimum is achieved is returned as | 
| wolffd@0 | 10 %	CENTRES.  The error value at that point is returned in OPTIONS(8). | 
| wolffd@0 | 11 % | 
| wolffd@0 | 12 %	[CENTRES, OPTIONS, POST, ERRLOG] = KMEANS(CENTRES, DATA, OPTIONS) | 
| wolffd@0 | 13 %	also returns the cluster number (in a one-of-N encoding) for each | 
| wolffd@0 | 14 %	data point in POST and a log of the error values after each cycle in | 
| wolffd@0 | 15 %	ERRLOG.    The optional parameters have the following | 
| wolffd@0 | 16 %	interpretations. | 
| wolffd@0 | 17 % | 
| wolffd@0 | 18 %	OPTIONS(1) is set to 1 to display error values; also logs error | 
| wolffd@0 | 19 %	values in the return argument ERRLOG. If OPTIONS(1) is set to 0, then | 
| wolffd@0 | 20 %	only warning messages are displayed.  If OPTIONS(1) is -1, then | 
| wolffd@0 | 21 %	nothing is displayed. | 
| wolffd@0 | 22 % | 
| wolffd@0 | 23 %	OPTIONS(2) is a measure of the absolute precision required for the | 
| wolffd@0 | 24 %	value of CENTRES at the solution.  If the absolute difference between | 
| wolffd@0 | 25 %	the values of CENTRES between two successive steps is less than | 
| wolffd@0 | 26 %	OPTIONS(2), then this condition is satisfied. | 
| wolffd@0 | 27 % | 
| wolffd@0 | 28 %	OPTIONS(3) is a measure of the precision required of the error | 
| wolffd@0 | 29 %	function at the solution.  If the absolute difference between the | 
| wolffd@0 | 30 %	error functions between two successive steps is less than OPTIONS(3), | 
| wolffd@0 | 31 %	then this condition is satisfied. Both this and the previous | 
| wolffd@0 | 32 %	condition must be satisfied for termination. | 
| wolffd@0 | 33 % | 
| wolffd@0 | 34 %	OPTIONS(14) is the maximum number of iterations; default 100. | 
| wolffd@0 | 35 % | 
| wolffd@0 | 36 %	See also | 
| wolffd@0 | 37 %	GMMINIT, GMMEM | 
| wolffd@0 | 38 % | 
| wolffd@0 | 39 | 
| wolffd@0 | 40 %	Copyright (c) Ian T Nabney (1996-2001) | 
| wolffd@0 | 41 | 
| wolffd@0 | 42 [ndata, data_dim] = size(data); | 
| wolffd@0 | 43 [ncentres, dim] = size(centres); | 
| wolffd@0 | 44 | 
| wolffd@0 | 45 if dim ~= data_dim | 
| wolffd@0 | 46   error('Data dimension does not match dimension of centres') | 
| wolffd@0 | 47 end | 
| wolffd@0 | 48 | 
| wolffd@0 | 49 if (ncentres > ndata) | 
| wolffd@0 | 50   error('More centres than data') | 
| wolffd@0 | 51 end | 
| wolffd@0 | 52 | 
| wolffd@0 | 53 % Sort out the options | 
| wolffd@0 | 54 if (options(14)) | 
| wolffd@0 | 55   niters = options(14); | 
| wolffd@0 | 56 else | 
| wolffd@0 | 57   niters = 100; | 
| wolffd@0 | 58 end | 
| wolffd@0 | 59 | 
| wolffd@0 | 60 store = 0; | 
| wolffd@0 | 61 if (nargout > 3) | 
| wolffd@0 | 62   store = 1; | 
| wolffd@0 | 63   errlog = zeros(1, niters); | 
| wolffd@0 | 64 end | 
| wolffd@0 | 65 | 
| wolffd@0 | 66 % Check if centres and posteriors need to be initialised from data | 
| wolffd@0 | 67 if (options(5) == 1) | 
| wolffd@0 | 68   % Do the initialisation | 
| wolffd@0 | 69   perm = randperm(ndata); | 
| wolffd@0 | 70   perm = perm(1:ncentres); | 
| wolffd@0 | 71 | 
| wolffd@0 | 72   % Assign first ncentres (permuted) data points as centres | 
| wolffd@0 | 73   centres = data(perm, :); | 
| wolffd@0 | 74 end | 
| wolffd@0 | 75 % Matrix to make unit vectors easy to construct | 
| wolffd@0 | 76 id = eye(ncentres); | 
| wolffd@0 | 77 | 
| wolffd@0 | 78 % Main loop of algorithm | 
| wolffd@0 | 79 for n = 1:niters | 
| wolffd@0 | 80 | 
| wolffd@0 | 81   % Save old centres to check for termination | 
| wolffd@0 | 82   old_centres = centres; | 
| wolffd@0 | 83 | 
| wolffd@0 | 84   % Calculate posteriors based on existing centres | 
| wolffd@0 | 85   d2 = dist2(data, centres); | 
| wolffd@0 | 86   % Assign each point to nearest centre | 
| wolffd@0 | 87   [minvals, index] = min(d2', [], 1); | 
| wolffd@0 | 88   post = id(index,:); | 
| wolffd@0 | 89 | 
| wolffd@0 | 90   num_points = sum(post, 1); | 
| wolffd@0 | 91   % Adjust the centres based on new posteriors | 
| wolffd@0 | 92   for j = 1:ncentres | 
| wolffd@0 | 93     if (num_points(j) > 0) | 
| wolffd@0 | 94       centres(j,:) = sum(data(find(post(:,j)),:), 1)/num_points(j); | 
| wolffd@0 | 95     end | 
| wolffd@0 | 96   end | 
| wolffd@0 | 97 | 
| wolffd@0 | 98   % Error value is total squared distance from cluster centres | 
| wolffd@0 | 99   e = sum(minvals); | 
| wolffd@0 | 100   if store | 
| wolffd@0 | 101     errlog(n) = e; | 
| wolffd@0 | 102   end | 
| wolffd@0 | 103   if options(1) > 0 | 
| wolffd@0 | 104     fprintf(1, 'Cycle %4d  Error %11.6f\n', n, e); | 
| wolffd@0 | 105   end | 
| wolffd@0 | 106 | 
| wolffd@0 | 107   if n > 1 | 
| wolffd@0 | 108     % Test for termination | 
| wolffd@0 | 109     if max(max(abs(centres - old_centres))) < options(2) & ... | 
| wolffd@0 | 110         abs(old_e - e) < options(3) | 
| wolffd@0 | 111       options(8) = e; | 
| wolffd@0 | 112       return; | 
| wolffd@0 | 113     end | 
| wolffd@0 | 114   end | 
| wolffd@0 | 115   old_e = e; | 
| wolffd@0 | 116 end | 
| wolffd@0 | 117 | 
| wolffd@0 | 118 % If we get here, then we haven't terminated in the given number of | 
| wolffd@0 | 119 % iterations. | 
| wolffd@0 | 120 options(8) = e; | 
| wolffd@0 | 121 if (options(1) >= 0) | 
| wolffd@0 | 122   disp(maxitmess); | 
| wolffd@0 | 123 end | 
| wolffd@0 | 124 |