mi@0: #!/usr/bin/python mi@0: # mi@0: # Copyright (C) Christian Thurau, 2010. mi@0: # Licensed under the GNU General Public License (GPL). mi@0: # http://www.gnu.org/licenses/gpl.txt mi@0: """ mi@0: PyMF K-means clustering (unary-convex matrix factorization). mi@0: Copyright (C) Christian Thurau, 2010. GNU General Public License (GPL). mi@0: """ mi@0: mi@0: mi@0: mi@0: import numpy as np mi@0: mi@0: import dist mi@0: from nmf import NMF mi@0: mi@0: __all__ = ["Cmeans"] mi@0: mi@0: class Cmeans(NMF): mi@0: """ mi@0: cmeans(data, num_bases=4) mi@0: mi@0: mi@0: Fuzzy c-means soft clustering. Factorize a data matrix into two matrices s.t. mi@0: F = | data - W*H | is minimal. H is restricted to convexity (columns mi@0: sum to 1) W is simply the weighted mean over the corresponding samples in mi@0: data. Note that the objective function is based on distances (?), hence the mi@0: Frobenius norm is probably not a good quality measure. mi@0: mi@0: Parameters mi@0: ---------- mi@0: data : array_like, shape (_data_dimension, _num_samples) mi@0: the input data mi@0: num_bases: int, optional mi@0: Number of bases to compute (column rank of W and row rank of H). mi@0: 4 (default) mi@0: mi@0: mi@0: Attributes mi@0: ---------- mi@0: W : "data_dimension x num_bases" matrix of basis vectors mi@0: H : "num bases x num_samples" matrix of coefficients mi@0: ferr : frobenius norm (after calling .factorize()) mi@0: mi@0: Example mi@0: ------- mi@0: Applying C-means to some rather stupid data set: mi@0: mi@0: >>> import numpy as np mi@0: >>> from cmeans import Cmeans mi@0: >>> data = np.array([[1.0, 0.0, 2.0], [0.0, 1.0, 1.0]]) mi@0: >>> cmeans_mdl = Cmeans(data, num_bases=2, niter=10) mi@0: >>> cmeans_mdl.initialization() mi@0: >>> cmeans_mdl.factorize() mi@0: mi@0: The basis vectors are now stored in cmeans_mdl.W, the coefficients in cmeans_mdl.H. mi@0: To compute coefficients for an existing set of basis vectors simply copy W mi@0: to cmeans_mdl.W, and set compute_w to False: mi@0: mi@0: >>> data = np.array([[1.5], [1.2]]) mi@0: >>> W = [[1.0, 0.0], [0.0, 1.0]] mi@0: >>> cmeans_mdl = Cmeans(data, num_bases=2) mi@0: >>> cmeans_mdl.initialization() mi@0: >>> cmeans_mdl.W = W mi@0: >>> cmeans_mdl.factorize(compute_w=False, niter=50) mi@0: mi@0: The result is a set of coefficients kmeans_mdl.H, s.t. data = W * kmeans_mdl.H. mi@0: """ mi@0: mi@0: def update_h(self): mi@0: # assign samples to best matching centres ... mi@0: m = 1.75 mi@0: tmp_dist = dist.pdist(self.W, self.data, metric='l2') + self._EPS mi@0: self.H[:,:] = 0.0 mi@0: mi@0: for i in range(self._num_bases): mi@0: for k in range(self._num_bases): mi@0: self.H[i,:] += (tmp_dist[i,:]/tmp_dist[k,:])**(2.0/(m-1)) mi@0: mi@0: self.H = np.where(self.H>0, 1.0/self.H, 0) mi@0: mi@0: def update_w(self): mi@0: for i in range(self._num_bases): mi@0: tmp = (self.H[i:i+1,:] * self.data).sum(axis=1) mi@0: self.W[:,i] = tmp/(self.H[i,:].sum() + self._EPS)