Mercurial > hg > camir-aes2014
comparison toolboxes/MIRtoolbox1.3.2/somtoolbox/som_demo4.m @ 0:e9a9cd732c1e tip
first hg version after svn
author | wolffd |
---|---|
date | Tue, 10 Feb 2015 15:05:51 +0000 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:e9a9cd732c1e |
---|---|
1 | |
2 %SOM_DEMO4 Data analysis using the SOM. | |
3 | |
4 % Contributed to SOM Toolbox 2.0, February 11th, 2000 by Juha Vesanto | |
5 % http://www.cis.hut.fi/projects/somtoolbox/ | |
6 | |
7 % Version 1.0beta juuso 071197 | |
8 % Version 2.0beta juuso 090200 070600 | |
9 | |
10 clf reset; | |
11 f0 = gcf; | |
12 echo on | |
13 | |
14 | |
15 | |
16 | |
17 clc | |
18 % ========================================================== | |
19 % SOM_DEMO4 - DATA ANALYSIS USING THE SOM | |
20 % ========================================================== | |
21 | |
22 % In this demo, the IRIS data set is analysed using SOM. First, the | |
23 % data is read from ascii file (please make sure they can be found | |
24 % from the current path), normalized, and a map is | |
25 % trained. Since the data also had labels, the map is labelled. | |
26 | |
27 try, | |
28 sD = som_read_data('iris.data'); | |
29 catch | |
30 echo off | |
31 | |
32 warning('File ''iris.data'' not found. Using simulated data instead.') | |
33 | |
34 D = randn(50,4); | |
35 D(:,1) = D(:,1)+5; D(:,2) = D(:,2)+3.5; | |
36 D(:,3) = D(:,3)/2+1.5; D(:,4) = D(:,4)/2+0.3; | |
37 D2 = randn(100,4); D2(:,2) = sort(D2(:,2)); | |
38 D2(:,1) = D2(:,1)+6.5; D2(:,2) = D2(:,2)+2.8; | |
39 D2(:,3) = D2(:,3)+5; D2(:,4) = D2(:,4)/2+1.5; | |
40 sD = som_data_struct([D; D2],'name','iris (simulated)',... | |
41 'comp_names',{'SepalL','SepalW','PetalL','PetalW'}); | |
42 sD = som_label(sD,'add',[1:50]','Setosa'); | |
43 sD = som_label(sD,'add',[51:100]','Versicolor'); | |
44 sD = som_label(sD,'add',[101:150]','Virginica'); | |
45 | |
46 echo on | |
47 end | |
48 | |
49 sD = som_normalize(sD,'var'); | |
50 sM = som_make(sD); | |
51 sM = som_autolabel(sM,sD,'vote'); | |
52 | |
53 pause % Strike any key to visualize the map... | |
54 | |
55 clc | |
56 % VISUAL INSPECTION OF THE MAP | |
57 % ============================ | |
58 | |
59 % The first step in the analysis of the map is visual inspection. | |
60 % Here is the U-matrix, component planes and labels (you may | |
61 % need to enlarge the figure in order to make out the labels). | |
62 | |
63 som_show(sM,'umat','all','comp',[1:4],'empty','Labels','norm','d'); | |
64 som_show_add('label',sM.labels,'textsize',8,'textcolor','r','subplot',6); | |
65 | |
66 % From this first visualization, one can see that: | |
67 % - there are essentially two clusters | |
68 % - PetalL and PetalW are highly correlated | |
69 % - SepalL is somewhat correlated to PetalL and PetalW | |
70 % - one cluster corresponds to the Setosa species and exhibits | |
71 % small petals and short but wide sepals | |
72 % - the other cluster corresponds to Virginica and Versicolor | |
73 % such that Versicolor has smaller leaves (both sepal and | |
74 % petal) than Virginica | |
75 % - inside both clusters, SepalL and SepalW are highly correlated | |
76 | |
77 pause % Strike any key to continue... | |
78 | |
79 % Next, the projection of the data set is investigated. A | |
80 % principle component projection is made for the data, and applied | |
81 % to the map. The colormap is done by spreading a colormap on the | |
82 % projection. Distance matrix information is extracted from the | |
83 % U-matrix, and it is modified by knowledge of zero-hits | |
84 % (interpolative) units. Finally, three visualizations are shown: | |
85 % the color code, with clustering information and the number of | |
86 % hits in each unit, the projection and the labels. | |
87 | |
88 echo off | |
89 | |
90 f1=figure; | |
91 [Pd,V,me,l] = pcaproj(sD,2); Pm = pcaproj(sM,V,me); % PC-projection | |
92 Code = som_colorcode(Pm); % color coding | |
93 hits = som_hits(sM,sD); % hits | |
94 U = som_umat(sM); % U-matrix | |
95 Dm = U(1:2:size(U,1),1:2:size(U,2)); % distance matrix | |
96 Dm = 1-Dm(:)/max(Dm(:)); Dm(find(hits==0)) = 0; % clustering info | |
97 | |
98 subplot(1,3,1) | |
99 som_cplane(sM,Code,Dm); | |
100 hold on | |
101 som_grid(sM,'Label',cellstr(int2str(hits)),... | |
102 'Line','none','Marker','none','Labelcolor','k'); | |
103 hold off | |
104 title('Color code') | |
105 | |
106 subplot(1,3,2) | |
107 som_grid(sM,'Coord',Pm,'MarkerColor',Code,'Linecolor','k'); | |
108 hold on, plot(Pd(:,1),Pd(:,2),'k+'), hold off, axis tight, axis equal | |
109 title('PC projection') | |
110 | |
111 subplot(1,3,3) | |
112 som_cplane(sM,'none') | |
113 hold on | |
114 som_grid(sM,'Label',sM.labels,'Labelsize',8,... | |
115 'Line','none','Marker','none','Labelcolor','r'); | |
116 hold off | |
117 title('Labels') | |
118 | |
119 echo on | |
120 | |
121 % From these figures one can see that: | |
122 % - the projection confirms the existence of two different clusters | |
123 % - interpolative units seem to divide the Virginica | |
124 % flowers into two classes, the difference being in the size of | |
125 % sepal leaves | |
126 | |
127 pause % Strike any key to continue... | |
128 | |
129 % Finally, perhaps the most informative figure of all: simple | |
130 % scatter plots and histograms of all variables. The species | |
131 % information is coded as a fifth variable: 1 for Setosa, 2 for | |
132 % Versicolor and 3 for Virginica. Original data points are in the | |
133 % upper triangle, map prototype values on the lower triangle, and | |
134 % histograms on the diagonal: black for the data set and red for | |
135 % the map prototype values. The color coding of the data samples | |
136 % has been copied from the map (from the BMU of each sample). Note | |
137 % that the variable values have been denormalized. | |
138 | |
139 echo off | |
140 | |
141 % denormalize and add species information | |
142 names = sD.comp_names; names{end+1} = 'species'; | |
143 D = som_denormalize(sD.data,sD); dlen = size(D,1); | |
144 s = zeros(dlen,1)+NaN; s(strcmp(sD.labels,'Setosa'))=1; | |
145 s(strcmp(sD.labels,'Versicolor'))=2; s(strcmp(sD.labels,'Virginica'))=3; | |
146 D = [D, s]; | |
147 M = som_denormalize(sM.codebook,sM); munits = size(M,1); | |
148 s = zeros(munits,1)+NaN; s(strcmp(sM.labels,'Setosa'))=1; | |
149 s(strcmp(sM.labels,'Versicolor'))=2; s(strcmp(sM.labels,'Virginica'))=3; | |
150 M = [M, s]; | |
151 | |
152 f2=figure; | |
153 | |
154 % color coding copied from the map | |
155 bmus = som_bmus(sM,sD); Code_data = Code(bmus,:); | |
156 | |
157 k=1; | |
158 for i=1:5, for j=1:5, | |
159 if i<j, i1=i; i2=j; else i1=j; i2=i; end | |
160 subplot(5,5,k); cla | |
161 if i<j, | |
162 som_grid('rect',[dlen 1],'coord',D(:,[i1 i2]),... | |
163 'Line','none','MarkerColor',Code_data,'Markersize',2); | |
164 title(sprintf('%s vs. %s',names{i1},names{i2})) | |
165 elseif i>j, | |
166 som_grid(sM,'coord',M(:,[i1 i2]),... | |
167 'markersize',2,'MarkerColor',Code); | |
168 title(sprintf('%s vs. %s',names{i1},names{i2})) | |
169 else | |
170 if i1<5, b = 10; else b = 3; end | |
171 [nd,x] = hist(D(:,i1),b); nd=nd/sum(nd); | |
172 nm = hist(M(:,i1),x); nm = nm/sum(nm); | |
173 h=bar(x,nd,0.8); set(h,'EdgeColor','none','FaceColor','k'); | |
174 hold on | |
175 h=bar(x,nm,0.3); set(h,'EdgeColor','none','FaceColor','r'); | |
176 hold off | |
177 title(names{i1}) | |
178 end | |
179 k=k+1; | |
180 end | |
181 end | |
182 | |
183 echo on | |
184 | |
185 % This visualization shows quite a lot of information: | |
186 % distributions of single and pairs of variables both in the data | |
187 % and in the map. If the number of variables was even slightly | |
188 % more, it would require a really big display to be convenient to | |
189 % use. | |
190 | |
191 % From this visualization we can conform many of the earlier | |
192 % conclusions, for example: | |
193 % - there are two clusters: 'Setosa' (blue, dark green) and | |
194 % 'Virginica'/'Versicolor' (light green, yellow, reds) | |
195 % (see almost any of the subplots) | |
196 % - PetalL and PetalW have a high linear correlation (see | |
197 % subplots 4,3 and 3,4) | |
198 % - SepalL is correlated (at least in the bigger cluster) with | |
199 % PetalL and PetalW (in subplots 1,3 1,4 3,1 and 4,1) | |
200 % - SepalL and SepalW have a clear linear correlation, but it | |
201 % is slightly different for the two main clusters (in | |
202 % subplots 2,1 and 1,2) | |
203 | |
204 pause % Strike any key to cluster the map... | |
205 | |
206 close(f1), close(f2), figure(f0), clf | |
207 | |
208 clc | |
209 % CLUSTERING OF THE MAP | |
210 % ===================== | |
211 | |
212 % Visual inspection already hinted that there are at least two | |
213 % clusters in the data, and that the properties of the clusters are | |
214 % different from each other (esp. relation of SepalL and | |
215 % SepalW). For further investigation, the map needs to be | |
216 % partitioned. | |
217 | |
218 % Here, the KMEANS_CLUSTERS function is used to find an initial | |
219 % partitioning. The plot shows the Davies-Boulding clustering | |
220 % index, which is minimized with best clustering. | |
221 | |
222 subplot(1,3,1) | |
223 [c,p,err,ind] = kmeans_clusters(sM, 7); % find at most 7 clusters | |
224 plot(1:length(ind),ind,'x-') | |
225 [dummy,i] = min(ind) | |
226 cl = p{i}; | |
227 | |
228 % The Davies-Boulding index seems to indicate that there are | |
229 % two clusters on the map. Here is the clustering info | |
230 % calculated previously and the partitioning result: | |
231 | |
232 subplot(1,3,2) | |
233 som_cplane(sM,Code,Dm) | |
234 subplot(1,3,3) | |
235 som_cplane(sM,cl) | |
236 | |
237 % You could use also function SOM_SELECT to manually make or modify | |
238 % the partitioning. | |
239 | |
240 % After this, the analysis would proceed with summarization of the | |
241 % results, and analysis of each cluster one at a time. | |
242 % Unfortunately, you have to do that yourself. The SOM Toolbox does | |
243 % not, yet, have functions for those purposes. | |
244 | |
245 pause % Strike any key to continue... | |
246 | |
247 | |
248 clf | |
249 clc | |
250 % MODELING | |
251 % ======== | |
252 | |
253 % One can also build models on top of the SOM. Typically, these | |
254 % models are simple local or nearest-neighbor models. | |
255 | |
256 % Here, SOM is used for probability density estimation. Each map | |
257 % prototype is the center of a gaussian kernel, the parameters | |
258 % of which are estimated from the data. The gaussian mixture | |
259 % model is estimated with function SOM_ESTIMATE_GMM and the | |
260 % probabilities can be calculated with SOM_PROBABILITY_GMM. | |
261 | |
262 [K,P] = som_estimate_gmm(sM,sD); | |
263 [pd,Pdm,pmd] = som_probability_gmm(sD,sM,K,P); | |
264 | |
265 % Here is the probability density function value for the first data | |
266 % sample (x=sD.data(:,1)) in terms of each map unit (m): | |
267 | |
268 som_cplane(sM,Pdm(:,1)) | |
269 colorbar | |
270 title('p(x|m)') | |
271 | |
272 pause % Strike any key to continue... | |
273 | |
274 % Here, SOM is used for classification. Although the SOM can be | |
275 % used for classification as such, one has to remember that it does | |
276 % not utilize class information at all, and thus its results are | |
277 % inherently suboptimal. However, with small modifications, the | |
278 % network can take the class into account. The function | |
279 % SOM_SUPERVISED does this. | |
280 | |
281 % Learning vector quantization (LVQ) is an algorithm that is very | |
282 % similar to the SOM in many aspects. However, it is specifically | |
283 % designed for classification. In the SOM Toolbox, there are | |
284 % functions LVQ1 and LVQ3 that implement two versions of this | |
285 % algorithm. | |
286 | |
287 % Here, the function SOM_SUPERVISED is used to create a classifier | |
288 % for IRIS data set: | |
289 | |
290 sM = som_supervised(sD,'small'); | |
291 | |
292 som_show(sM,'umat','all'); | |
293 som_show_add('label',sM.labels,'TextSize',8,'TextColor','r') | |
294 | |
295 sD2 = som_label(sD,'clear','all'); | |
296 sD2 = som_autolabel(sD2,sM); % classification | |
297 ok = strcmp(sD2.labels,sD.labels); % errors | |
298 100*(1-sum(ok)/length(ok)) % error percentage (%) | |
299 | |
300 echo off | |
301 | |
302 | |
303 | |
304 | |
305 | |
306 | |
307 |