wolffd@0: function sData = som_read_data(filename, varargin) wolffd@0: wolffd@0: %SOM_READ_DATA Read data from an ascii file in SOM_PAK format. wolffd@0: % wolffd@0: % sD = som_read_data(filename, dim, [missing]) wolffd@0: % sD = som_read_data(filename, [missing]) wolffd@0: % wolffd@0: % sD = som_read_data('system.data'); wolffd@0: % sD = som_read_data('system.data',10); wolffd@0: % sD = som_read_data('system.data','*'); wolffd@0: % sD = som_read_data('system.data',10,'*'); wolffd@0: % wolffd@0: % Input and output arguments ([]'s are optional): wolffd@0: % filename (string) input file name wolffd@0: % dim (scalar) input space dimension wolffd@0: % [missing] (string) string which indicates a missing component wolffd@0: % value, 'NaN' by default wolffd@0: % wolffd@0: % sD (struct) data struct wolffd@0: % wolffd@0: % Reads data from an ascii file. The file must be in SOM_PAK format, wolffd@0: % except that it may lack the input space dimension from the first wolffd@0: % line. wolffd@0: % wolffd@0: % For more help, try 'type som_read_data' or check out online documentation. wolffd@0: % See also SOM_WRITE_DATA, SOM_READ_COD, SOM_WRITE_COD, SOM_DATA_STRUCT. wolffd@0: wolffd@0: %%%%%%%%%%%%% DETAILED DESCRIPTION %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% wolffd@0: % wolffd@0: % som_read_data wolffd@0: % wolffd@0: % PURPOSE wolffd@0: % wolffd@0: % Reads data from an ascii file in SOM_PAK format. wolffd@0: % wolffd@0: % SYNTAX wolffd@0: % wolffd@0: % sD = som_read_data(filename) wolffd@0: % sD = som_read_data(..., dim) wolffd@0: % sD = som_read_data(..., 'missing') wolffd@0: % sD = som_read_data(..., dim, 'missing') wolffd@0: % wolffd@0: % DESCRIPTION wolffd@0: % wolffd@0: % This function is offered for compatibility with SOM_PAK, a SOM software wolffd@0: % package in C. It reads data from a file in SOM_PAK format. wolffd@0: % wolffd@0: % The SOM_PAK data file format is as follows. The first line must wolffd@0: % contain the input space dimension and nothing else. The following wolffd@0: % lines are comment lines, empty lines or data lines. Unlike programs wolffd@0: % in SOM_PAK, this function can also determine the input dimension wolffd@0: % from the first data lines, if the input space dimension line is wolffd@0: % missing. Note that the SOM_PAK format is not fully supported: data wolffd@0: % vector 'weight' and 'fixed' properties are ignored (they are treated wolffd@0: % as labels). wolffd@0: % wolffd@0: % Each data line contains one data vector and its labels. From the beginning wolffd@0: % of the line, first are values of the vector components separated by wolffd@0: % whitespaces, then labels also separated by whitespaces. If there are wolffd@0: % missing values in the vector, the missing value marker needs to be wolffd@0: % specified as the last input argument ('NaN' by default). The missing wolffd@0: % values are stored as NaNs in the data struct. wolffd@0: % wolffd@0: % Comment lines start with '#'. Comment lines as well as empty lines are wolffd@0: % ignored, except if the comment lines that start with '#n' or '#l'. In that wolffd@0: % case the line should contain names of the vector components or label names wolffd@0: % separated by whitespaces. wolffd@0: % wolffd@0: % NOTE: The minimum value Matlab is able to deal with (realmax) wolffd@0: % should not appear in the input file. This is because function sscanf is wolffd@0: % not able to read NaNs: the NaNs are in the read phase converted to value wolffd@0: % realmax. wolffd@0: % wolffd@0: % REQUIRED INPUT ARGUMENTS wolffd@0: % wolffd@0: % filename (string) input filename wolffd@0: % wolffd@0: % OPTIONAL INPUT ARGUMENTS wolffd@0: % wolffd@0: % dim (scalar) input space dimension wolffd@0: % missing (string) string used to denote missing components (NaNs); wolffd@0: % default is 'NaN' wolffd@0: % wolffd@0: % OUTPUT ARGUMENTS wolffd@0: % wolffd@0: % sD (struct) the resulting data struct wolffd@0: % wolffd@0: % EXAMPLES wolffd@0: % wolffd@0: % The basic usage is: wolffd@0: % sD = som_read_data('system.data'); wolffd@0: % wolffd@0: % If you know the input space dimension beforehand, and the file does wolffd@0: % not contain it on the first line, it helps if you specify it as the wolffd@0: % second argument: wolffd@0: % sD = som_read_data('system.data',9); wolffd@0: % wolffd@0: % If the missing components in the data are marked with some other wolffd@0: % characters than with 'NaN', you can specify it with the last argument: wolffd@0: % sD = som_read_data('system.data',9,'*') wolffd@0: % sD = som_read_data('system.data','NaN') wolffd@0: % wolffd@0: % Here's an example data file: wolffd@0: % wolffd@0: % 5 wolffd@0: % #n one two three four five wolffd@0: % #l ID wolffd@0: % 10 2 3 4 5 1stline label wolffd@0: % 0.4 0.3 0.2 0.5 0.1 2ndline label1 label2 wolffd@0: % # comment line: missing components are indicated by 'x':s wolffd@0: % 1 x 1 x 1 3rdline missing_components wolffd@0: % x 1 2 2 2 wolffd@0: % x x x x x 5thline emptyline wolffd@0: % wolffd@0: % SEE ALSO wolffd@0: % wolffd@0: % som_write_data Writes data structs/matrices to a file in SOM_PAK format. wolffd@0: % som_read_cod Read a map from a file in SOM_PAK format. wolffd@0: % som_write_cod Writes data struct into a file in SOM_PAK format. wolffd@0: % som_data_struct Creates data structs. wolffd@0: wolffd@0: % Copyright (c) 1997-2000 by the SOM toolbox programming team. wolffd@0: % http://www.cis.hut.fi/projects/somtoolbox/ wolffd@0: wolffd@0: % Version 1.0beta ecco 221097 wolffd@0: % Version 2.0beta ecco 060899, juuso 151199 wolffd@0: wolffd@0: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% wolffd@0: %% check arguments wolffd@0: wolffd@0: error(nargchk(1, 3, nargin)) % check no. of input args is correct wolffd@0: wolffd@0: dont_care = 'NaN'; % default don't care string wolffd@0: comment_start = '#'; % the char a SOM_PAK command line starts with wolffd@0: comp_name_line = '#n'; % string denoting a special command line, wolffd@0: % which contains names of each component wolffd@0: label_name_line = '#l'; % string denoting a special command line, wolffd@0: % which contains names of each label wolffd@0: block_size = 1000; % block size used in file read wolffd@0: wolffd@0: kludge = num2str(realmax, 100); % used in sscanf wolffd@0: wolffd@0: wolffd@0: % open input file wolffd@0: wolffd@0: fid = fopen(filename); wolffd@0: if fid < 0 wolffd@0: error(['Cannot open ' filename]); wolffd@0: end wolffd@0: wolffd@0: % process input arguments wolffd@0: wolffd@0: if nargin == 2 wolffd@0: if isstr(varargin{1}) wolffd@0: dont_care = varargin{1}; wolffd@0: else wolffd@0: dim = varargin{1}; wolffd@0: end wolffd@0: elseif nargin == 3 wolffd@0: dim = varargin{1}; wolffd@0: dont_care = varargin{2}; wolffd@0: end wolffd@0: wolffd@0: % if the data dimension is not specified, find out what it is wolffd@0: wolffd@0: if nargin == 1 | (nargin == 2 & isstr(varargin{1})) wolffd@0: wolffd@0: fpos1 = ftell(fid); c1 = 0; % read first non-comment line wolffd@0: while c1 == 0, wolffd@0: line1 = strrep(fgetl(fid), dont_care, kludge); wolffd@0: [l1, c1] = sscanf(line1, '%f '); wolffd@0: end wolffd@0: wolffd@0: fpos2 = ftell(fid); c2 = 0; % read second non-comment line wolffd@0: while c2 == 0, wolffd@0: line2 = strrep(fgetl(fid), dont_care, kludge); wolffd@0: [l2, c2] = sscanf(line2, '%f '); wolffd@0: end wolffd@0: wolffd@0: if (c1 == 1 & c2 ~= 1) | (c1 == c2 & c1 == 1 & l1 == 1) wolffd@0: dim = l1; wolffd@0: fseek(fid, fpos2, -1); wolffd@0: elseif (c1 == c2) wolffd@0: dim = c1; wolffd@0: fseek(fid, fpos1, -1); wolffd@0: warning on wolffd@0: warning(['Automatically determined data dimension is ' ... wolffd@0: num2str(dim) '. Is it correct?']); wolffd@0: else wolffd@0: error(['Invalid header line: ' line1]); wolffd@0: end wolffd@0: end wolffd@0: wolffd@0: % check the dimension is valid wolffd@0: wolffd@0: if dim < 1 | dim ~= round(dim) wolffd@0: error(['Illegal data dimension: ' num2str(dim)]); wolffd@0: end wolffd@0: wolffd@0: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% wolffd@0: %% read data wolffd@0: wolffd@0: sData = som_data_struct(zeros(1, dim), 'name', filename); wolffd@0: lnum = 0; % data vector counter wolffd@0: data_temp = zeros(block_size, dim); wolffd@0: labs_temp = cell(block_size, 1); wolffd@0: comp_names = sData.comp_names; wolffd@0: label_names = sData.label_names; wolffd@0: form = [repmat('%g',[1 dim-1]) '%g%[^ \t]']; wolffd@0: wolffd@0: limit = block_size; wolffd@0: while 1, wolffd@0: li = fgetl(fid); % read next line wolffd@0: if ~isstr(li), break, end; % is this the end of file? wolffd@0: wolffd@0: % all missing vectors are replaced by value realmax because wolffd@0: % sscanf is not able to read NaNs wolffd@0: li = strrep(li, dont_care, kludge); wolffd@0: [data, c, err, n] = sscanf(li, form); wolffd@0: if c < dim % if there were less numbers than dim on the input file line wolffd@0: if c == 0 wolffd@0: if strncmp(li, comp_name_line, 2) % component name line? wolffd@0: li = strrep(li(3:end), kludge, dont_care); i = 0; c = 1; wolffd@0: while c wolffd@0: [s, c, e, n] = sscanf(li, '%s%[^ \t]'); wolffd@0: if ~isempty(s), i = i + 1; comp_names{i} = s; li = li(n:end); end wolffd@0: end wolffd@0: wolffd@0: if i ~= dim wolffd@0: error(['Illegal number of component names: ' num2str(i) ... wolffd@0: ' (dimension is ' num2str(dim) ')']); wolffd@0: end wolffd@0: elseif strncmp(li, label_name_line, 2) % label name line? wolffd@0: li = strrep(li(3:end), kludge, dont_care); i = 0; c = 1; wolffd@0: while c wolffd@0: [s, c, e, n] = sscanf(li, '%s%[^ \t]'); wolffd@0: if ~isempty(s), i = i + 1; label_names{i} = s; li = li(n:end); end wolffd@0: end wolffd@0: elseif ~strncmp(li, comment_start, 1) % not a comment, is it error? wolffd@0: [s, c, e, n] = sscanf(li, '%s%[^ \t]'); wolffd@0: if c wolffd@0: error(['Invalid vector on input file data line ' ... wolffd@0: num2str(lnum+1) ': [' deblank(li) ']']), wolffd@0: end wolffd@0: end wolffd@0: else wolffd@0: error(['Only ' num2str(c) ' vector components on input file data line ' ... wolffd@0: num2str(lnum+1) ' (dimension is ' num2str(dim) ')']); wolffd@0: end wolffd@0: wolffd@0: else wolffd@0: wolffd@0: lnum = lnum + 1; % this was a line containing data vector wolffd@0: data_temp(lnum, 1:dim) = data'; % add data to struct wolffd@0: wolffd@0: if lnum == limit % reserve more memory if necessary wolffd@0: data_temp(lnum+1:lnum+block_size, 1:dim) = zeros(block_size, dim); wolffd@0: [dummy nl] = size(labs_temp); wolffd@0: labs_temp(lnum+1:lnum+block_size,1:nl) = cell(block_size, nl); wolffd@0: limit = limit + block_size; wolffd@0: end wolffd@0: wolffd@0: % read labels wolffd@0: wolffd@0: if n < length(li) wolffd@0: li = strrep(li(n:end), kludge, dont_care); i = 0; n = 1; c = 1; wolffd@0: while c wolffd@0: [s, c, e, n_new] = sscanf(li(n:end), '%s%[^ \t]'); wolffd@0: if c, i = i + 1; labs_temp{lnum, i} = s; n = n + n_new - 1; end wolffd@0: end wolffd@0: end wolffd@0: end wolffd@0: end wolffd@0: wolffd@0: % close input file wolffd@0: if fclose(fid) < 0, error(['Cannot close file ' filename]); wolffd@0: else fprintf(2, '\rdata read ok \n'); end wolffd@0: wolffd@0: % set values wolffd@0: data_temp(data_temp == realmax) = NaN; wolffd@0: sData.data = data_temp(1:lnum,:); wolffd@0: sData.labels = labs_temp(1:lnum,:); wolffd@0: sData.comp_names = comp_names; wolffd@0: sData.label_names = label_names; wolffd@0: wolffd@0: return; wolffd@0: wolffd@0: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%