wolffd@0: wolffd@0: %SOM_DEMO4 Data analysis using the SOM. wolffd@0: wolffd@0: % Contributed to SOM Toolbox 2.0, February 11th, 2000 by Juha Vesanto wolffd@0: % http://www.cis.hut.fi/projects/somtoolbox/ wolffd@0: wolffd@0: % Version 1.0beta juuso 071197 wolffd@0: % Version 2.0beta juuso 090200 070600 wolffd@0: wolffd@0: clf reset; wolffd@0: f0 = gcf; wolffd@0: echo on wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: clc wolffd@0: % ========================================================== wolffd@0: % SOM_DEMO4 - DATA ANALYSIS USING THE SOM wolffd@0: % ========================================================== wolffd@0: wolffd@0: % In this demo, the IRIS data set is analysed using SOM. First, the wolffd@0: % data is read from ascii file (please make sure they can be found wolffd@0: % from the current path), normalized, and a map is wolffd@0: % trained. Since the data also had labels, the map is labelled. wolffd@0: wolffd@0: try, wolffd@0: sD = som_read_data('iris.data'); wolffd@0: catch wolffd@0: echo off wolffd@0: wolffd@0: warning('File ''iris.data'' not found. Using simulated data instead.') wolffd@0: wolffd@0: D = randn(50,4); wolffd@0: D(:,1) = D(:,1)+5; D(:,2) = D(:,2)+3.5; wolffd@0: D(:,3) = D(:,3)/2+1.5; D(:,4) = D(:,4)/2+0.3; wolffd@0: D2 = randn(100,4); D2(:,2) = sort(D2(:,2)); wolffd@0: D2(:,1) = D2(:,1)+6.5; D2(:,2) = D2(:,2)+2.8; wolffd@0: D2(:,3) = D2(:,3)+5; D2(:,4) = D2(:,4)/2+1.5; wolffd@0: sD = som_data_struct([D; D2],'name','iris (simulated)',... wolffd@0: 'comp_names',{'SepalL','SepalW','PetalL','PetalW'}); wolffd@0: sD = som_label(sD,'add',[1:50]','Setosa'); wolffd@0: sD = som_label(sD,'add',[51:100]','Versicolor'); wolffd@0: sD = som_label(sD,'add',[101:150]','Virginica'); wolffd@0: wolffd@0: echo on wolffd@0: end wolffd@0: wolffd@0: sD = som_normalize(sD,'var'); wolffd@0: sM = som_make(sD); wolffd@0: sM = som_autolabel(sM,sD,'vote'); wolffd@0: wolffd@0: pause % Strike any key to visualize the map... wolffd@0: wolffd@0: clc wolffd@0: % VISUAL INSPECTION OF THE MAP wolffd@0: % ============================ wolffd@0: wolffd@0: % The first step in the analysis of the map is visual inspection. wolffd@0: % Here is the U-matrix, component planes and labels (you may wolffd@0: % need to enlarge the figure in order to make out the labels). wolffd@0: wolffd@0: som_show(sM,'umat','all','comp',[1:4],'empty','Labels','norm','d'); wolffd@0: som_show_add('label',sM.labels,'textsize',8,'textcolor','r','subplot',6); wolffd@0: wolffd@0: % From this first visualization, one can see that: wolffd@0: % - there are essentially two clusters wolffd@0: % - PetalL and PetalW are highly correlated wolffd@0: % - SepalL is somewhat correlated to PetalL and PetalW wolffd@0: % - one cluster corresponds to the Setosa species and exhibits wolffd@0: % small petals and short but wide sepals wolffd@0: % - the other cluster corresponds to Virginica and Versicolor wolffd@0: % such that Versicolor has smaller leaves (both sepal and wolffd@0: % petal) than Virginica wolffd@0: % - inside both clusters, SepalL and SepalW are highly correlated wolffd@0: wolffd@0: pause % Strike any key to continue... wolffd@0: wolffd@0: % Next, the projection of the data set is investigated. A wolffd@0: % principle component projection is made for the data, and applied wolffd@0: % to the map. The colormap is done by spreading a colormap on the wolffd@0: % projection. Distance matrix information is extracted from the wolffd@0: % U-matrix, and it is modified by knowledge of zero-hits wolffd@0: % (interpolative) units. Finally, three visualizations are shown: wolffd@0: % the color code, with clustering information and the number of wolffd@0: % hits in each unit, the projection and the labels. wolffd@0: wolffd@0: echo off wolffd@0: wolffd@0: f1=figure; wolffd@0: [Pd,V,me,l] = pcaproj(sD,2); Pm = pcaproj(sM,V,me); % PC-projection wolffd@0: Code = som_colorcode(Pm); % color coding wolffd@0: hits = som_hits(sM,sD); % hits wolffd@0: U = som_umat(sM); % U-matrix wolffd@0: Dm = U(1:2:size(U,1),1:2:size(U,2)); % distance matrix wolffd@0: Dm = 1-Dm(:)/max(Dm(:)); Dm(find(hits==0)) = 0; % clustering info wolffd@0: wolffd@0: subplot(1,3,1) wolffd@0: som_cplane(sM,Code,Dm); wolffd@0: hold on wolffd@0: som_grid(sM,'Label',cellstr(int2str(hits)),... wolffd@0: 'Line','none','Marker','none','Labelcolor','k'); wolffd@0: hold off wolffd@0: title('Color code') wolffd@0: wolffd@0: subplot(1,3,2) wolffd@0: som_grid(sM,'Coord',Pm,'MarkerColor',Code,'Linecolor','k'); wolffd@0: hold on, plot(Pd(:,1),Pd(:,2),'k+'), hold off, axis tight, axis equal wolffd@0: title('PC projection') wolffd@0: wolffd@0: subplot(1,3,3) wolffd@0: som_cplane(sM,'none') wolffd@0: hold on wolffd@0: som_grid(sM,'Label',sM.labels,'Labelsize',8,... wolffd@0: 'Line','none','Marker','none','Labelcolor','r'); wolffd@0: hold off wolffd@0: title('Labels') wolffd@0: wolffd@0: echo on wolffd@0: wolffd@0: % From these figures one can see that: wolffd@0: % - the projection confirms the existence of two different clusters wolffd@0: % - interpolative units seem to divide the Virginica wolffd@0: % flowers into two classes, the difference being in the size of wolffd@0: % sepal leaves wolffd@0: wolffd@0: pause % Strike any key to continue... wolffd@0: wolffd@0: % Finally, perhaps the most informative figure of all: simple wolffd@0: % scatter plots and histograms of all variables. The species wolffd@0: % information is coded as a fifth variable: 1 for Setosa, 2 for wolffd@0: % Versicolor and 3 for Virginica. Original data points are in the wolffd@0: % upper triangle, map prototype values on the lower triangle, and wolffd@0: % histograms on the diagonal: black for the data set and red for wolffd@0: % the map prototype values. The color coding of the data samples wolffd@0: % has been copied from the map (from the BMU of each sample). Note wolffd@0: % that the variable values have been denormalized. wolffd@0: wolffd@0: echo off wolffd@0: wolffd@0: % denormalize and add species information wolffd@0: names = sD.comp_names; names{end+1} = 'species'; wolffd@0: D = som_denormalize(sD.data,sD); dlen = size(D,1); wolffd@0: s = zeros(dlen,1)+NaN; s(strcmp(sD.labels,'Setosa'))=1; wolffd@0: s(strcmp(sD.labels,'Versicolor'))=2; s(strcmp(sD.labels,'Virginica'))=3; wolffd@0: D = [D, s]; wolffd@0: M = som_denormalize(sM.codebook,sM); munits = size(M,1); wolffd@0: s = zeros(munits,1)+NaN; s(strcmp(sM.labels,'Setosa'))=1; wolffd@0: s(strcmp(sM.labels,'Versicolor'))=2; s(strcmp(sM.labels,'Virginica'))=3; wolffd@0: M = [M, s]; wolffd@0: wolffd@0: f2=figure; wolffd@0: wolffd@0: % color coding copied from the map wolffd@0: bmus = som_bmus(sM,sD); Code_data = Code(bmus,:); wolffd@0: wolffd@0: k=1; wolffd@0: for i=1:5, for j=1:5, wolffd@0: if ij, wolffd@0: som_grid(sM,'coord',M(:,[i1 i2]),... wolffd@0: 'markersize',2,'MarkerColor',Code); wolffd@0: title(sprintf('%s vs. %s',names{i1},names{i2})) wolffd@0: else wolffd@0: if i1<5, b = 10; else b = 3; end wolffd@0: [nd,x] = hist(D(:,i1),b); nd=nd/sum(nd); wolffd@0: nm = hist(M(:,i1),x); nm = nm/sum(nm); wolffd@0: h=bar(x,nd,0.8); set(h,'EdgeColor','none','FaceColor','k'); wolffd@0: hold on wolffd@0: h=bar(x,nm,0.3); set(h,'EdgeColor','none','FaceColor','r'); wolffd@0: hold off wolffd@0: title(names{i1}) wolffd@0: end wolffd@0: k=k+1; wolffd@0: end wolffd@0: end wolffd@0: wolffd@0: echo on wolffd@0: wolffd@0: % This visualization shows quite a lot of information: wolffd@0: % distributions of single and pairs of variables both in the data wolffd@0: % and in the map. If the number of variables was even slightly wolffd@0: % more, it would require a really big display to be convenient to wolffd@0: % use. wolffd@0: wolffd@0: % From this visualization we can conform many of the earlier wolffd@0: % conclusions, for example: wolffd@0: % - there are two clusters: 'Setosa' (blue, dark green) and wolffd@0: % 'Virginica'/'Versicolor' (light green, yellow, reds) wolffd@0: % (see almost any of the subplots) wolffd@0: % - PetalL and PetalW have a high linear correlation (see wolffd@0: % subplots 4,3 and 3,4) wolffd@0: % - SepalL is correlated (at least in the bigger cluster) with wolffd@0: % PetalL and PetalW (in subplots 1,3 1,4 3,1 and 4,1) wolffd@0: % - SepalL and SepalW have a clear linear correlation, but it wolffd@0: % is slightly different for the two main clusters (in wolffd@0: % subplots 2,1 and 1,2) wolffd@0: wolffd@0: pause % Strike any key to cluster the map... wolffd@0: wolffd@0: close(f1), close(f2), figure(f0), clf wolffd@0: wolffd@0: clc wolffd@0: % CLUSTERING OF THE MAP wolffd@0: % ===================== wolffd@0: wolffd@0: % Visual inspection already hinted that there are at least two wolffd@0: % clusters in the data, and that the properties of the clusters are wolffd@0: % different from each other (esp. relation of SepalL and wolffd@0: % SepalW). For further investigation, the map needs to be wolffd@0: % partitioned. wolffd@0: wolffd@0: % Here, the KMEANS_CLUSTERS function is used to find an initial wolffd@0: % partitioning. The plot shows the Davies-Boulding clustering wolffd@0: % index, which is minimized with best clustering. wolffd@0: wolffd@0: subplot(1,3,1) wolffd@0: [c,p,err,ind] = kmeans_clusters(sM, 7); % find at most 7 clusters wolffd@0: plot(1:length(ind),ind,'x-') wolffd@0: [dummy,i] = min(ind) wolffd@0: cl = p{i}; wolffd@0: wolffd@0: % The Davies-Boulding index seems to indicate that there are wolffd@0: % two clusters on the map. Here is the clustering info wolffd@0: % calculated previously and the partitioning result: wolffd@0: wolffd@0: subplot(1,3,2) wolffd@0: som_cplane(sM,Code,Dm) wolffd@0: subplot(1,3,3) wolffd@0: som_cplane(sM,cl) wolffd@0: wolffd@0: % You could use also function SOM_SELECT to manually make or modify wolffd@0: % the partitioning. wolffd@0: wolffd@0: % After this, the analysis would proceed with summarization of the wolffd@0: % results, and analysis of each cluster one at a time. wolffd@0: % Unfortunately, you have to do that yourself. The SOM Toolbox does wolffd@0: % not, yet, have functions for those purposes. wolffd@0: wolffd@0: pause % Strike any key to continue... wolffd@0: wolffd@0: wolffd@0: clf wolffd@0: clc wolffd@0: % MODELING wolffd@0: % ======== wolffd@0: wolffd@0: % One can also build models on top of the SOM. Typically, these wolffd@0: % models are simple local or nearest-neighbor models. wolffd@0: wolffd@0: % Here, SOM is used for probability density estimation. Each map wolffd@0: % prototype is the center of a gaussian kernel, the parameters wolffd@0: % of which are estimated from the data. The gaussian mixture wolffd@0: % model is estimated with function SOM_ESTIMATE_GMM and the wolffd@0: % probabilities can be calculated with SOM_PROBABILITY_GMM. wolffd@0: wolffd@0: [K,P] = som_estimate_gmm(sM,sD); wolffd@0: [pd,Pdm,pmd] = som_probability_gmm(sD,sM,K,P); wolffd@0: wolffd@0: % Here is the probability density function value for the first data wolffd@0: % sample (x=sD.data(:,1)) in terms of each map unit (m): wolffd@0: wolffd@0: som_cplane(sM,Pdm(:,1)) wolffd@0: colorbar wolffd@0: title('p(x|m)') wolffd@0: wolffd@0: pause % Strike any key to continue... wolffd@0: wolffd@0: % Here, SOM is used for classification. Although the SOM can be wolffd@0: % used for classification as such, one has to remember that it does wolffd@0: % not utilize class information at all, and thus its results are wolffd@0: % inherently suboptimal. However, with small modifications, the wolffd@0: % network can take the class into account. The function wolffd@0: % SOM_SUPERVISED does this. wolffd@0: wolffd@0: % Learning vector quantization (LVQ) is an algorithm that is very wolffd@0: % similar to the SOM in many aspects. However, it is specifically wolffd@0: % designed for classification. In the SOM Toolbox, there are wolffd@0: % functions LVQ1 and LVQ3 that implement two versions of this wolffd@0: % algorithm. wolffd@0: wolffd@0: % Here, the function SOM_SUPERVISED is used to create a classifier wolffd@0: % for IRIS data set: wolffd@0: wolffd@0: sM = som_supervised(sD,'small'); wolffd@0: wolffd@0: som_show(sM,'umat','all'); wolffd@0: som_show_add('label',sM.labels,'TextSize',8,'TextColor','r') wolffd@0: wolffd@0: sD2 = som_label(sD,'clear','all'); wolffd@0: sD2 = som_autolabel(sD2,sM); % classification wolffd@0: ok = strcmp(sD2.labels,sD.labels); % errors wolffd@0: 100*(1-sum(ok)/length(ok)) % error percentage (%) wolffd@0: wolffd@0: echo off wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: