annotate toolboxes/MIRtoolbox1.3.2/somtoolbox/som_demo4.m @ 0:cc4b1211e677 tip

initial commit to HG from Changeset: 646 (e263d8a21543) added further path and more save "camirversion.m"
author Daniel Wolff
date Fri, 19 Aug 2016 13:07:06 +0200
parents
children
rev   line source
Daniel@0 1
Daniel@0 2 %SOM_DEMO4 Data analysis using the SOM.
Daniel@0 3
Daniel@0 4 % Contributed to SOM Toolbox 2.0, February 11th, 2000 by Juha Vesanto
Daniel@0 5 % http://www.cis.hut.fi/projects/somtoolbox/
Daniel@0 6
Daniel@0 7 % Version 1.0beta juuso 071197
Daniel@0 8 % Version 2.0beta juuso 090200 070600
Daniel@0 9
Daniel@0 10 clf reset;
Daniel@0 11 f0 = gcf;
Daniel@0 12 echo on
Daniel@0 13
Daniel@0 14
Daniel@0 15
Daniel@0 16
Daniel@0 17 clc
Daniel@0 18 % ==========================================================
Daniel@0 19 % SOM_DEMO4 - DATA ANALYSIS USING THE SOM
Daniel@0 20 % ==========================================================
Daniel@0 21
Daniel@0 22 % In this demo, the IRIS data set is analysed using SOM. First, the
Daniel@0 23 % data is read from ascii file (please make sure they can be found
Daniel@0 24 % from the current path), normalized, and a map is
Daniel@0 25 % trained. Since the data also had labels, the map is labelled.
Daniel@0 26
Daniel@0 27 try,
Daniel@0 28 sD = som_read_data('iris.data');
Daniel@0 29 catch
Daniel@0 30 echo off
Daniel@0 31
Daniel@0 32 warning('File ''iris.data'' not found. Using simulated data instead.')
Daniel@0 33
Daniel@0 34 D = randn(50,4);
Daniel@0 35 D(:,1) = D(:,1)+5; D(:,2) = D(:,2)+3.5;
Daniel@0 36 D(:,3) = D(:,3)/2+1.5; D(:,4) = D(:,4)/2+0.3;
Daniel@0 37 D2 = randn(100,4); D2(:,2) = sort(D2(:,2));
Daniel@0 38 D2(:,1) = D2(:,1)+6.5; D2(:,2) = D2(:,2)+2.8;
Daniel@0 39 D2(:,3) = D2(:,3)+5; D2(:,4) = D2(:,4)/2+1.5;
Daniel@0 40 sD = som_data_struct([D; D2],'name','iris (simulated)',...
Daniel@0 41 'comp_names',{'SepalL','SepalW','PetalL','PetalW'});
Daniel@0 42 sD = som_label(sD,'add',[1:50]','Setosa');
Daniel@0 43 sD = som_label(sD,'add',[51:100]','Versicolor');
Daniel@0 44 sD = som_label(sD,'add',[101:150]','Virginica');
Daniel@0 45
Daniel@0 46 echo on
Daniel@0 47 end
Daniel@0 48
Daniel@0 49 sD = som_normalize(sD,'var');
Daniel@0 50 sM = som_make(sD);
Daniel@0 51 sM = som_autolabel(sM,sD,'vote');
Daniel@0 52
Daniel@0 53 pause % Strike any key to visualize the map...
Daniel@0 54
Daniel@0 55 clc
Daniel@0 56 % VISUAL INSPECTION OF THE MAP
Daniel@0 57 % ============================
Daniel@0 58
Daniel@0 59 % The first step in the analysis of the map is visual inspection.
Daniel@0 60 % Here is the U-matrix, component planes and labels (you may
Daniel@0 61 % need to enlarge the figure in order to make out the labels).
Daniel@0 62
Daniel@0 63 som_show(sM,'umat','all','comp',[1:4],'empty','Labels','norm','d');
Daniel@0 64 som_show_add('label',sM.labels,'textsize',8,'textcolor','r','subplot',6);
Daniel@0 65
Daniel@0 66 % From this first visualization, one can see that:
Daniel@0 67 % - there are essentially two clusters
Daniel@0 68 % - PetalL and PetalW are highly correlated
Daniel@0 69 % - SepalL is somewhat correlated to PetalL and PetalW
Daniel@0 70 % - one cluster corresponds to the Setosa species and exhibits
Daniel@0 71 % small petals and short but wide sepals
Daniel@0 72 % - the other cluster corresponds to Virginica and Versicolor
Daniel@0 73 % such that Versicolor has smaller leaves (both sepal and
Daniel@0 74 % petal) than Virginica
Daniel@0 75 % - inside both clusters, SepalL and SepalW are highly correlated
Daniel@0 76
Daniel@0 77 pause % Strike any key to continue...
Daniel@0 78
Daniel@0 79 % Next, the projection of the data set is investigated. A
Daniel@0 80 % principle component projection is made for the data, and applied
Daniel@0 81 % to the map. The colormap is done by spreading a colormap on the
Daniel@0 82 % projection. Distance matrix information is extracted from the
Daniel@0 83 % U-matrix, and it is modified by knowledge of zero-hits
Daniel@0 84 % (interpolative) units. Finally, three visualizations are shown:
Daniel@0 85 % the color code, with clustering information and the number of
Daniel@0 86 % hits in each unit, the projection and the labels.
Daniel@0 87
Daniel@0 88 echo off
Daniel@0 89
Daniel@0 90 f1=figure;
Daniel@0 91 [Pd,V,me,l] = pcaproj(sD,2); Pm = pcaproj(sM,V,me); % PC-projection
Daniel@0 92 Code = som_colorcode(Pm); % color coding
Daniel@0 93 hits = som_hits(sM,sD); % hits
Daniel@0 94 U = som_umat(sM); % U-matrix
Daniel@0 95 Dm = U(1:2:size(U,1),1:2:size(U,2)); % distance matrix
Daniel@0 96 Dm = 1-Dm(:)/max(Dm(:)); Dm(find(hits==0)) = 0; % clustering info
Daniel@0 97
Daniel@0 98 subplot(1,3,1)
Daniel@0 99 som_cplane(sM,Code,Dm);
Daniel@0 100 hold on
Daniel@0 101 som_grid(sM,'Label',cellstr(int2str(hits)),...
Daniel@0 102 'Line','none','Marker','none','Labelcolor','k');
Daniel@0 103 hold off
Daniel@0 104 title('Color code')
Daniel@0 105
Daniel@0 106 subplot(1,3,2)
Daniel@0 107 som_grid(sM,'Coord',Pm,'MarkerColor',Code,'Linecolor','k');
Daniel@0 108 hold on, plot(Pd(:,1),Pd(:,2),'k+'), hold off, axis tight, axis equal
Daniel@0 109 title('PC projection')
Daniel@0 110
Daniel@0 111 subplot(1,3,3)
Daniel@0 112 som_cplane(sM,'none')
Daniel@0 113 hold on
Daniel@0 114 som_grid(sM,'Label',sM.labels,'Labelsize',8,...
Daniel@0 115 'Line','none','Marker','none','Labelcolor','r');
Daniel@0 116 hold off
Daniel@0 117 title('Labels')
Daniel@0 118
Daniel@0 119 echo on
Daniel@0 120
Daniel@0 121 % From these figures one can see that:
Daniel@0 122 % - the projection confirms the existence of two different clusters
Daniel@0 123 % - interpolative units seem to divide the Virginica
Daniel@0 124 % flowers into two classes, the difference being in the size of
Daniel@0 125 % sepal leaves
Daniel@0 126
Daniel@0 127 pause % Strike any key to continue...
Daniel@0 128
Daniel@0 129 % Finally, perhaps the most informative figure of all: simple
Daniel@0 130 % scatter plots and histograms of all variables. The species
Daniel@0 131 % information is coded as a fifth variable: 1 for Setosa, 2 for
Daniel@0 132 % Versicolor and 3 for Virginica. Original data points are in the
Daniel@0 133 % upper triangle, map prototype values on the lower triangle, and
Daniel@0 134 % histograms on the diagonal: black for the data set and red for
Daniel@0 135 % the map prototype values. The color coding of the data samples
Daniel@0 136 % has been copied from the map (from the BMU of each sample). Note
Daniel@0 137 % that the variable values have been denormalized.
Daniel@0 138
Daniel@0 139 echo off
Daniel@0 140
Daniel@0 141 % denormalize and add species information
Daniel@0 142 names = sD.comp_names; names{end+1} = 'species';
Daniel@0 143 D = som_denormalize(sD.data,sD); dlen = size(D,1);
Daniel@0 144 s = zeros(dlen,1)+NaN; s(strcmp(sD.labels,'Setosa'))=1;
Daniel@0 145 s(strcmp(sD.labels,'Versicolor'))=2; s(strcmp(sD.labels,'Virginica'))=3;
Daniel@0 146 D = [D, s];
Daniel@0 147 M = som_denormalize(sM.codebook,sM); munits = size(M,1);
Daniel@0 148 s = zeros(munits,1)+NaN; s(strcmp(sM.labels,'Setosa'))=1;
Daniel@0 149 s(strcmp(sM.labels,'Versicolor'))=2; s(strcmp(sM.labels,'Virginica'))=3;
Daniel@0 150 M = [M, s];
Daniel@0 151
Daniel@0 152 f2=figure;
Daniel@0 153
Daniel@0 154 % color coding copied from the map
Daniel@0 155 bmus = som_bmus(sM,sD); Code_data = Code(bmus,:);
Daniel@0 156
Daniel@0 157 k=1;
Daniel@0 158 for i=1:5, for j=1:5,
Daniel@0 159 if i<j, i1=i; i2=j; else i1=j; i2=i; end
Daniel@0 160 subplot(5,5,k); cla
Daniel@0 161 if i<j,
Daniel@0 162 som_grid('rect',[dlen 1],'coord',D(:,[i1 i2]),...
Daniel@0 163 'Line','none','MarkerColor',Code_data,'Markersize',2);
Daniel@0 164 title(sprintf('%s vs. %s',names{i1},names{i2}))
Daniel@0 165 elseif i>j,
Daniel@0 166 som_grid(sM,'coord',M(:,[i1 i2]),...
Daniel@0 167 'markersize',2,'MarkerColor',Code);
Daniel@0 168 title(sprintf('%s vs. %s',names{i1},names{i2}))
Daniel@0 169 else
Daniel@0 170 if i1<5, b = 10; else b = 3; end
Daniel@0 171 [nd,x] = hist(D(:,i1),b); nd=nd/sum(nd);
Daniel@0 172 nm = hist(M(:,i1),x); nm = nm/sum(nm);
Daniel@0 173 h=bar(x,nd,0.8); set(h,'EdgeColor','none','FaceColor','k');
Daniel@0 174 hold on
Daniel@0 175 h=bar(x,nm,0.3); set(h,'EdgeColor','none','FaceColor','r');
Daniel@0 176 hold off
Daniel@0 177 title(names{i1})
Daniel@0 178 end
Daniel@0 179 k=k+1;
Daniel@0 180 end
Daniel@0 181 end
Daniel@0 182
Daniel@0 183 echo on
Daniel@0 184
Daniel@0 185 % This visualization shows quite a lot of information:
Daniel@0 186 % distributions of single and pairs of variables both in the data
Daniel@0 187 % and in the map. If the number of variables was even slightly
Daniel@0 188 % more, it would require a really big display to be convenient to
Daniel@0 189 % use.
Daniel@0 190
Daniel@0 191 % From this visualization we can conform many of the earlier
Daniel@0 192 % conclusions, for example:
Daniel@0 193 % - there are two clusters: 'Setosa' (blue, dark green) and
Daniel@0 194 % 'Virginica'/'Versicolor' (light green, yellow, reds)
Daniel@0 195 % (see almost any of the subplots)
Daniel@0 196 % - PetalL and PetalW have a high linear correlation (see
Daniel@0 197 % subplots 4,3 and 3,4)
Daniel@0 198 % - SepalL is correlated (at least in the bigger cluster) with
Daniel@0 199 % PetalL and PetalW (in subplots 1,3 1,4 3,1 and 4,1)
Daniel@0 200 % - SepalL and SepalW have a clear linear correlation, but it
Daniel@0 201 % is slightly different for the two main clusters (in
Daniel@0 202 % subplots 2,1 and 1,2)
Daniel@0 203
Daniel@0 204 pause % Strike any key to cluster the map...
Daniel@0 205
Daniel@0 206 close(f1), close(f2), figure(f0), clf
Daniel@0 207
Daniel@0 208 clc
Daniel@0 209 % CLUSTERING OF THE MAP
Daniel@0 210 % =====================
Daniel@0 211
Daniel@0 212 % Visual inspection already hinted that there are at least two
Daniel@0 213 % clusters in the data, and that the properties of the clusters are
Daniel@0 214 % different from each other (esp. relation of SepalL and
Daniel@0 215 % SepalW). For further investigation, the map needs to be
Daniel@0 216 % partitioned.
Daniel@0 217
Daniel@0 218 % Here, the KMEANS_CLUSTERS function is used to find an initial
Daniel@0 219 % partitioning. The plot shows the Davies-Boulding clustering
Daniel@0 220 % index, which is minimized with best clustering.
Daniel@0 221
Daniel@0 222 subplot(1,3,1)
Daniel@0 223 [c,p,err,ind] = kmeans_clusters(sM, 7); % find at most 7 clusters
Daniel@0 224 plot(1:length(ind),ind,'x-')
Daniel@0 225 [dummy,i] = min(ind)
Daniel@0 226 cl = p{i};
Daniel@0 227
Daniel@0 228 % The Davies-Boulding index seems to indicate that there are
Daniel@0 229 % two clusters on the map. Here is the clustering info
Daniel@0 230 % calculated previously and the partitioning result:
Daniel@0 231
Daniel@0 232 subplot(1,3,2)
Daniel@0 233 som_cplane(sM,Code,Dm)
Daniel@0 234 subplot(1,3,3)
Daniel@0 235 som_cplane(sM,cl)
Daniel@0 236
Daniel@0 237 % You could use also function SOM_SELECT to manually make or modify
Daniel@0 238 % the partitioning.
Daniel@0 239
Daniel@0 240 % After this, the analysis would proceed with summarization of the
Daniel@0 241 % results, and analysis of each cluster one at a time.
Daniel@0 242 % Unfortunately, you have to do that yourself. The SOM Toolbox does
Daniel@0 243 % not, yet, have functions for those purposes.
Daniel@0 244
Daniel@0 245 pause % Strike any key to continue...
Daniel@0 246
Daniel@0 247
Daniel@0 248 clf
Daniel@0 249 clc
Daniel@0 250 % MODELING
Daniel@0 251 % ========
Daniel@0 252
Daniel@0 253 % One can also build models on top of the SOM. Typically, these
Daniel@0 254 % models are simple local or nearest-neighbor models.
Daniel@0 255
Daniel@0 256 % Here, SOM is used for probability density estimation. Each map
Daniel@0 257 % prototype is the center of a gaussian kernel, the parameters
Daniel@0 258 % of which are estimated from the data. The gaussian mixture
Daniel@0 259 % model is estimated with function SOM_ESTIMATE_GMM and the
Daniel@0 260 % probabilities can be calculated with SOM_PROBABILITY_GMM.
Daniel@0 261
Daniel@0 262 [K,P] = som_estimate_gmm(sM,sD);
Daniel@0 263 [pd,Pdm,pmd] = som_probability_gmm(sD,sM,K,P);
Daniel@0 264
Daniel@0 265 % Here is the probability density function value for the first data
Daniel@0 266 % sample (x=sD.data(:,1)) in terms of each map unit (m):
Daniel@0 267
Daniel@0 268 som_cplane(sM,Pdm(:,1))
Daniel@0 269 colorbar
Daniel@0 270 title('p(x|m)')
Daniel@0 271
Daniel@0 272 pause % Strike any key to continue...
Daniel@0 273
Daniel@0 274 % Here, SOM is used for classification. Although the SOM can be
Daniel@0 275 % used for classification as such, one has to remember that it does
Daniel@0 276 % not utilize class information at all, and thus its results are
Daniel@0 277 % inherently suboptimal. However, with small modifications, the
Daniel@0 278 % network can take the class into account. The function
Daniel@0 279 % SOM_SUPERVISED does this.
Daniel@0 280
Daniel@0 281 % Learning vector quantization (LVQ) is an algorithm that is very
Daniel@0 282 % similar to the SOM in many aspects. However, it is specifically
Daniel@0 283 % designed for classification. In the SOM Toolbox, there are
Daniel@0 284 % functions LVQ1 and LVQ3 that implement two versions of this
Daniel@0 285 % algorithm.
Daniel@0 286
Daniel@0 287 % Here, the function SOM_SUPERVISED is used to create a classifier
Daniel@0 288 % for IRIS data set:
Daniel@0 289
Daniel@0 290 sM = som_supervised(sD,'small');
Daniel@0 291
Daniel@0 292 som_show(sM,'umat','all');
Daniel@0 293 som_show_add('label',sM.labels,'TextSize',8,'TextColor','r')
Daniel@0 294
Daniel@0 295 sD2 = som_label(sD,'clear','all');
Daniel@0 296 sD2 = som_autolabel(sD2,sM); % classification
Daniel@0 297 ok = strcmp(sD2.labels,sD.labels); % errors
Daniel@0 298 100*(1-sum(ok)/length(ok)) % error percentage (%)
Daniel@0 299
Daniel@0 300 echo off
Daniel@0 301
Daniel@0 302
Daniel@0 303
Daniel@0 304
Daniel@0 305
Daniel@0 306
Daniel@0 307