changeset 6:e2337cd691b1 tip

Finishing writing the matlab code to replicate all observations made in the article. Added the article to the repository. Renamed the two main scripts ("1-get_mirex_estimates.rb" and "2-generate_smith2013_ismir.m") to not have dashes (since this was annoying within Matlab) Added new Michael Jackson figure.
author Jordan Smith <jordan.smith@eecs.qmul.ac.uk>
date Wed, 05 Mar 2014 01:02:26 +0000
parents 8d896eec680e
children
files 1-get_mirex_estimates.rb 1_get_mirex_estimates.rb 2-generate_smith2013_ismir.m 2_generate_smith2013_ismir.m collect_all_public_annotations.m datacubes.mat do_correlation_analyses.m make_structure_image.m match_mirex_to_public_data_results.mat plots/MJ_dont_care.jpg readme.txt smith&chew2013-meta_analysis_of_mirex.pdf
diffstat 12 files changed, 398 insertions(+), 739 deletions(-) [+]
line wrap: on
line diff
--- a/1-get_mirex_estimates.rb	Sat Feb 22 21:25:43 2014 +0000
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,138 +0,0 @@
-require "CSV"
-require "open-uri"
-# require "simplexml"
-mirex_path = "/Users/jordan/Desktop/MIREX_data"    # EDIT THIS TO BE YOUR OWN DESIRED PATH.
-                                               # IT WILL NEED TO HOLD ROUGHLY 70 MB OF DATA.
-
-
-# tmp = File.open(filename,'w')
-#     tmptxt = []
-#     open(uri) {|f|
-#         f.each_line {|line| tmptxt.push(line)}
-#     }
-#     tmp.write(tmptxt)
-#     tmp.close
-#     
-
-def url_download(uri, filename=".")
-    open(filename, 'w') do |foo|
-        foo.print open(uri).read
-    end
-end
-
-def convert_file(filename)
-    ann_out_file = filename[0..-4] + "_gt.txt"
-    alg_out_file = filename[0..-4] + "_pred.txt"
-    ann_out = File.open(ann_out_file,'w')
-    alg_out = File.open(alg_out_file,'w')
-    text = File.open(filename,'r').readlines[1..-4].join("").split(/[\[\]]/)
-    text = File.open(filename,'r').readlines(sep=",").join("").split(/[\[\]]/)
-    ann = text[2].split(/[\{\}]/)
-    alg = text[4].split(/[\{\}]/)
-    ann_out.write(json_2_text(ann))
-    alg_out.write(json_2_text(alg))
-    ann_out.close
-    alg_out.close
-end
-
-def json_2_text(json)
-    txt = []
-    (1..json.length).step(2).to_a.each do |indx|
-        line = json[indx]
-        els = line.split(",")
-        # Make a LAB-style annotation (3-column):
-        # txt.push([els[0].split(" ")[-1].to_f, els[1].split(" ")[-1].to_f, els[2].split("\"")[-1]].join("\t"))
-        # Make a TXT-style annotation (2-column):
-        txt.push([els[0].split(" ")[-1].to_f, els[2].split("\"")[-1]].join("\t"))
-    end
-    txt.push([json[-1].split(",")[1].split(" ")[-1].to_f, "End"].join("\t"))
-    return txt.join("\n")
-end
-
-
-# # # #         PART 1:  DOWNLOAD ALL THE STRUCTURAL ANALYSIS EVALUTION DATA PUBLISHED BY MIREX
-
-# Define list of algorithms and datasets:
-algos = ["SP1", "SMGA2", "MHRAF1", "SMGA1", "SBV1", "KSP2", "OYZS1", "KSP3", "KSP1"]
-datasets = ["mrx09", "mrx10_1", "mrx10_2", "sal"]
-year = "2012"
-puts "Thanks for starting the script! Stay tuned for periodic updates."
-
-# Create appropriate directory tree and download CSV files:
-Dir.mkdir(mirex_path) unless File.directory?(mirex_path)
-puts("Downloading CSV files...\n")
-datasets.each do |dset|
-    # Make dataset directory:
-    dir_path = File.join(mirex_path,dset)
-    Dir.mkdir(dir_path) unless File.directory?(dir_path)
-    algos.each do |algo|
-        # Make algorithm directory:
-        algo_path = File.join(mirex_path,dset,algo)
-        Dir.mkdir(algo_path) unless File.directory?(algo_path)
-        # Download the CSV file to this directory:
-        algocsvpath = File.join(mirex_path,dset,algo,"per_track_results.csv")
-        csv_path = File.join(("http://nema.lis.illinois.edu/nema_out/mirex"+year),"/results/struct",dset,algo,"per_track_results.csv")
-        url_download(csv_path, algocsvpath)
-    end
-end
-
-puts "..done with that."
-
-puts "Now we will download all the files output by each algorithm. This could take a while depending on your connection."
-puts "Since this script points to " + datasets.length.to_s + " datasets and " + algos.length.to_s + " algorithms, you should expect to wait however long it takes between each of the next lines to appear, times " + (datasets.length*algos.length).to_s + "."
-
-# Read each CSV file and download all the json files it points to:
-datasets.each do |dset|
-    algos.each do |algo|
-        puts( "Starting to download "+dset+ " dataset for " + algo + " algorithm...\n")
-        algocsvpath = File.join(mirex_path,dset,algo,"per_track_results.csv")
-        csv_data = File.read(algocsvpath).split("\n")
-        header = csv_data.delete_at(0)
-        download_folder = File.join(mirex_path,dset,algo)
-        # For each line in the spreadsheet, extract the songid and download the corresponding json document.
-        csv_data.each do |line|
-            line = line.split(",")
-            song_id = line[1]
-            url = "http://nema.lis.illinois.edu/nema_out/mirex" + year + "/results/struct/" + dset + "/" + algo.downcase + "segments" + song_id.delete("_") + ".js"
-            download_path = File.join(download_folder,song_id + ".js")
-            # download_path = download_folder + "/" + song_id + ".js"
-            url_download(url, download_path)
-        end
-    end
-    puts("Done with " + dset + " dataset!\n")
-end
-
-puts "..done with that."
-
-puts "Now, a much faster step: turning all the json files you downloaded into simpler text files."
-# Scan for all the json files, and convert each one into two text files, one for the algorithm output, one for the annotation:
-all_json_files = Dir.glob(File.join(mirex_path,"*","*","*.js"))
-all_json_files.each do |file|
-    convert_file(file)
-    puts file
-end
-
-puts "..done with that."
-
-puts "Now, PART 2 of the script: we download all the zip files (from various websites) that contain the public collections of ground truth files. This will only take a couple minutes, depending on connection speed (it's about 4 MB total)."
-
-
-# # # #         PART 2:  GET (AND CONVERT) THE ANNOTATION DATA PUBLISHED BY OTHERS
-
-# Download and unzip all public annotations
-list_of_db_urls = ["https://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation/AIST.RWC-MDB-P-2001.CHORUS.zip", "https://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation/AIST.RWC-MDB-C-2001.CHORUS.zip", "https://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation/AIST.RWC-MDB-J-2001.CHORUS.zip", "https://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation/AIST.RWC-MDB-G-2001.CHORUS.zip", "http://www.music.mcgill.ca/~jordan/salami/releases/SALAMI_data_v1.2.zip", "http://www.ifs.tuwien.ac.at/mir/audiosegmentation/dl/ep_groundtruth_excl_Paulus.zip", "http://musicdata.gforge.inria.fr/IRISA.RWC-MDB-P-2012.SEMLAB_v003_full.zip", "http://musicdata.gforge.inria.fr/IRISA.RWC-MDB-P-2012.SEMLAB_v003_reduced.zip", "http://musicdata.gforge.inria.fr/IRISA.RWC-MDB-P-2001.BLOCKS_v001.zip", "http://www.isophonics.net/files/annotations/The%20Beatles%20Annotations.tar.gz", "http://www.isophonics.net/files/annotations/Carole%20King%20Annotations.tar.gz", "http://www.isophonics.net/files/annotations/Queen%20Annotations.tar.gz", "http://www.isophonics.net/files/annotations/Michael%20Jackson%20Annotations.tar.gz", "http://www.isophonics.net/files/annotations/Zweieck%20Annotations.tar.gz", "http://www.cs.tut.fi/sgn/arg/paulus/beatles_sections_TUT.zip", "http://www.iua.upf.edu/~perfe/annotations/sections/beatles/structure_Beatles.rar"]
-
-public_data_path = File.join(mirex_path,"public_data")
-Dir.mkdir(public_data_path) unless File.directory?(public_data_path)
-list_of_db_urls.each do |db_url|
-    open(File.join(public_data_path,File.basename(db_url)), 'wb') do |foo|
-      foo.print open(db_url).read
-    end
-end
-
-# # # #         NOW, PLEASE EXIT THE SCRIPT, AND UNZIP ALL THOSE PACKAGES.
-# # # #         WHEN YOU'RE DONE, GO ONTO THE PARENT MATLAB FILE TO RUN THE ANALYSES.
-puts "..done with that.\n\n"
-puts "Script apppears to have ended successfully. All files were downloaded and saved to " + public_data_path +"."
-puts "To continue please unpack all zip files, start MATLAB, and run 2-generate_smith2013_ismir.m. You can read more on README."
-puts "Important: be sure that the zip files unpack into the correct file structure. Again, see README for details."
\ No newline at end of file
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/1_get_mirex_estimates.rb	Wed Mar 05 01:02:26 2014 +0000
@@ -0,0 +1,138 @@
+require "CSV"
+require "open-uri"
+# require "simplexml"
+mirex_path = "/Users/jordan/Desktop/MIREX_data"    # EDIT THIS TO BE YOUR OWN DESIRED PATH.
+                                               # IT WILL NEED TO HOLD ROUGHLY 70 MB OF DATA.
+
+
+# tmp = File.open(filename,'w')
+#     tmptxt = []
+#     open(uri) {|f|
+#         f.each_line {|line| tmptxt.push(line)}
+#     }
+#     tmp.write(tmptxt)
+#     tmp.close
+#     
+
+def url_download(uri, filename=".")
+    open(filename, 'w') do |foo|
+        foo.print open(uri).read
+    end
+end
+
+def convert_file(filename)
+    ann_out_file = filename[0..-4] + "_gt.txt"
+    alg_out_file = filename[0..-4] + "_pred.txt"
+    ann_out = File.open(ann_out_file,'w')
+    alg_out = File.open(alg_out_file,'w')
+    text = File.open(filename,'r').readlines[1..-4].join("").split(/[\[\]]/)
+    text = File.open(filename,'r').readlines(sep=",").join("").split(/[\[\]]/)
+    ann = text[2].split(/[\{\}]/)
+    alg = text[4].split(/[\{\}]/)
+    ann_out.write(json_2_text(ann))
+    alg_out.write(json_2_text(alg))
+    ann_out.close
+    alg_out.close
+end
+
+def json_2_text(json)
+    txt = []
+    (1..json.length).step(2).to_a.each do |indx|
+        line = json[indx]
+        els = line.split(",")
+        # Make a LAB-style annotation (3-column):
+        # txt.push([els[0].split(" ")[-1].to_f, els[1].split(" ")[-1].to_f, els[2].split("\"")[-1]].join("\t"))
+        # Make a TXT-style annotation (2-column):
+        txt.push([els[0].split(" ")[-1].to_f, els[2].split("\"")[-1]].join("\t"))
+    end
+    txt.push([json[-1].split(",")[1].split(" ")[-1].to_f, "End"].join("\t"))
+    return txt.join("\n")
+end
+
+
+# # # #         PART 1:  DOWNLOAD ALL THE STRUCTURAL ANALYSIS EVALUTION DATA PUBLISHED BY MIREX
+
+# Define list of algorithms and datasets:
+algos = ["SP1", "SMGA2", "MHRAF1", "SMGA1", "SBV1", "KSP2", "OYZS1", "KSP3", "KSP1"]
+datasets = ["mrx09", "mrx10_1", "mrx10_2", "sal"]
+year = "2012"
+puts "Thanks for starting the script! Stay tuned for periodic updates."
+
+# Create appropriate directory tree and download CSV files:
+Dir.mkdir(mirex_path) unless File.directory?(mirex_path)
+puts("Downloading CSV files...\n")
+datasets.each do |dset|
+    # Make dataset directory:
+    dir_path = File.join(mirex_path,dset)
+    Dir.mkdir(dir_path) unless File.directory?(dir_path)
+    algos.each do |algo|
+        # Make algorithm directory:
+        algo_path = File.join(mirex_path,dset,algo)
+        Dir.mkdir(algo_path) unless File.directory?(algo_path)
+        # Download the CSV file to this directory:
+        algocsvpath = File.join(mirex_path,dset,algo,"per_track_results.csv")
+        csv_path = File.join(("http://nema.lis.illinois.edu/nema_out/mirex"+year),"/results/struct",dset,algo,"per_track_results.csv")
+        url_download(csv_path, algocsvpath)
+    end
+end
+
+puts "..done with that."
+
+puts "Now we will download all the files output by each algorithm. This could take a while depending on your connection."
+puts "Since this script points to " + datasets.length.to_s + " datasets and " + algos.length.to_s + " algorithms, you should expect to wait however long it takes between each of the next lines to appear, times " + (datasets.length*algos.length).to_s + "."
+
+# Read each CSV file and download all the json files it points to:
+datasets.each do |dset|
+    algos.each do |algo|
+        puts( "Starting to download "+dset+ " dataset for " + algo + " algorithm...\n")
+        algocsvpath = File.join(mirex_path,dset,algo,"per_track_results.csv")
+        csv_data = File.read(algocsvpath).split("\n")
+        header = csv_data.delete_at(0)
+        download_folder = File.join(mirex_path,dset,algo)
+        # For each line in the spreadsheet, extract the songid and download the corresponding json document.
+        csv_data.each do |line|
+            line = line.split(",")
+            song_id = line[1]
+            url = "http://nema.lis.illinois.edu/nema_out/mirex" + year + "/results/struct/" + dset + "/" + algo.downcase + "segments" + song_id.delete("_") + ".js"
+            download_path = File.join(download_folder,song_id + ".js")
+            # download_path = download_folder + "/" + song_id + ".js"
+            url_download(url, download_path)
+        end
+    end
+    puts("Done with " + dset + " dataset!\n")
+end
+
+puts "..done with that."
+
+puts "Now, a much faster step: turning all the json files you downloaded into simpler text files."
+# Scan for all the json files, and convert each one into two text files, one for the algorithm output, one for the annotation:
+all_json_files = Dir.glob(File.join(mirex_path,"*","*","*.js"))
+all_json_files.each do |file|
+    convert_file(file)
+    puts file
+end
+
+puts "..done with that."
+
+puts "Now, PART 2 of the script: we download all the zip files (from various websites) that contain the public collections of ground truth files. This will only take a couple minutes, depending on connection speed (it's about 4 MB total)."
+
+
+# # # #         PART 2:  GET (AND CONVERT) THE ANNOTATION DATA PUBLISHED BY OTHERS
+
+# Download and unzip all public annotations
+list_of_db_urls = ["https://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation/AIST.RWC-MDB-P-2001.CHORUS.zip", "https://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation/AIST.RWC-MDB-C-2001.CHORUS.zip", "https://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation/AIST.RWC-MDB-J-2001.CHORUS.zip", "https://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation/AIST.RWC-MDB-G-2001.CHORUS.zip", "http://www.music.mcgill.ca/~jordan/salami/releases/SALAMI_data_v1.2.zip", "http://www.ifs.tuwien.ac.at/mir/audiosegmentation/dl/ep_groundtruth_excl_Paulus.zip", "http://musicdata.gforge.inria.fr/IRISA.RWC-MDB-P-2012.SEMLAB_v003_full.zip", "http://musicdata.gforge.inria.fr/IRISA.RWC-MDB-P-2012.SEMLAB_v003_reduced.zip", "http://musicdata.gforge.inria.fr/IRISA.RWC-MDB-P-2001.BLOCKS_v001.zip", "http://www.isophonics.net/files/annotations/The%20Beatles%20Annotations.tar.gz", "http://www.isophonics.net/files/annotations/Carole%20King%20Annotations.tar.gz", "http://www.isophonics.net/files/annotations/Queen%20Annotations.tar.gz", "http://www.isophonics.net/files/annotations/Michael%20Jackson%20Annotations.tar.gz", "http://www.isophonics.net/files/annotations/Zweieck%20Annotations.tar.gz", "http://www.cs.tut.fi/sgn/arg/paulus/beatles_sections_TUT.zip", "http://www.iua.upf.edu/~perfe/annotations/sections/beatles/structure_Beatles.rar"]
+
+public_data_path = File.join(mirex_path,"public_data")
+Dir.mkdir(public_data_path) unless File.directory?(public_data_path)
+list_of_db_urls.each do |db_url|
+    open(File.join(public_data_path,File.basename(db_url)), 'wb') do |foo|
+      foo.print open(db_url).read
+    end
+end
+
+# # # #         NOW, PLEASE EXIT THE SCRIPT, AND UNZIP ALL THOSE PACKAGES.
+# # # #         WHEN YOU'RE DONE, GO ONTO THE PARENT MATLAB FILE TO RUN THE ANALYSES.
+puts "..done with that.\n\n"
+puts "Script apppears to have ended successfully. All files were downloaded and saved to " + public_data_path +"."
+puts "To continue please unpack all zip files, start MATLAB, and run 2-generate_smith2013_ismir.m. You can read more on README."
+puts "Important: be sure that the zip files unpack into the correct file structure. Again, see README for details."
\ No newline at end of file
--- a/2-generate_smith2013_ismir.m	Sat Feb 22 21:25:43 2014 +0000
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,95 +0,0 @@
-% This is a script that reproduces the results of the following ISMIR paper:
-%
-% Smith, J. B. L., and E. Chew. 2013. A meta-analysis of the MIREX Structural
-% Segmentation task. Proceedings of the International Society for Music
-% Information Retrieval Conference. Curitiba, Brazil. 251–6.
-%
-% DO read the README.txt before running this file. Among other things, it will
-% explain that you must download some data and place it in the correct file
-% tree before this script will work.
-%
-% DO read the comments in this script before running it. You may want to run
-% it piece by piece in order to understand exactly what it is doing.
-
-
-%%
-% STEP 0: Set up parameters.
-% Name the MIREX datasets and algorithms desired.
-dsets = {'mrx09','mrx10_1','mrx10_2','sal'};
-algos = {'KSP1','KSP2','KSP3','MHRAF1','OYZS1','SBV1','SMGA1','SMGA2','SP1'};
-
-% YOU MUST SET THE FOLLOWING PATH YOURSELF!
-% Set it to be the same as the path given at the top of '1-get_mirex_estimates.rb'.
-base_directory = '/Users/me/Desktop/MIREX_data';
-
-% You should get a copy of the evalution scripts in the Code.SoundSoftware
-% repository. Wherever you put it, set the following path accordingly:
-addpath('/Users/me/Desktop/whereiputmymatlabfiles/structural_analysis_evaluation')
-% You should also, clearly, add the current path (where this file is):
-addpath('.')
-
-% Check that we have access to the correct dependencies.
-
-if exist('compare_structures.m')~=2,
-    fprintf('I could not locate ''compare_structures.m'', part of the Structural Analysis Evaluation project. Please read the help for this file before proceeding.\n')
-end
-if exist('load_annotation.m')~=2,
-    fprintf('I could not locate ''load_annotation.m'', part of the Structural Analysis Evaluation project. Please read the help for this file before proceeding.\n')
-end
-if exist('collect_all_mirex_annotations')~=2,
-    fprintf('I could not locate ''collect_all_mirex_annotations.m'', which should be in the same folder as this file. Something really screwed up has happened, clearly! Please read the help for this file before proceeding.\n')
-end
-
-%%
-% STEP 1: Download data from MIREX website:
-%
-%   - Ground truth files
-%   - Algorithm output
-%   - Reported evaluation results
-% (See README.txt to see where to get this data.)
-
-
-%%
-% STEP 2: Import all this data into some Matlab structures.
-%
-% 1. Assemble MIREX ground truth file data in Matlab.
-[mirex_truth mirex_dset_origin] = collect_all_mirex_annotations(base_directory, dsets, algos);
-% 2. Assemble MIREX algorithm output data in Matlab.
-mirex_output = collect_all_mirex_algo_output_data(base_directory, dsets, algos);
-% 3. Assemble MIREX evaluation results in Matlab.
-mirex_results = collect_all_mirex_results(base_directory, dsets, algos);
-% 4. Download public repositories of annotations.
-% NB: You must do this manually, as per the README.
-% 5. Assemble public ground truth data in Matlab.
-[public_truth public_dset_origin] = collect_all_public_annotations(base_directory);
-
-
-%%
-% STEP 3: Match MIREX and public data.
-%
-% With this information, we can now easily search for matches between
-% MIREX and public ground truth.
-[pub2mir, mir2pub, P] = match_mirex_to_public_data(mirex_truth, public_truth, mirex_dset_origin, public_dset_origin);
-% If you have already done this, do not repeat this time-consuming step. Instead:
-% load match_mirex_to_public_data_results
-
-%%
-% STEP 4: Compile datacubes
-% 1. Compute extra evaluation measures using MIREX algorithm output.
-% 2. Compute extra features of the annotations (song length, mean segment length, etc.).
-% 3. Put it all together in a giant MEGADATACUBE.
-[datacube newcube extracube indexing_info] = compile_datacubes(mirex_truth, ...
-    mirex_dset_origin, public_truth, mirex_output, mirex_results, mir2pub);
-% If you have already done this, do not repeat this time-consuming step. Instead:
-% load datacubes
-megadatacube = [datacube newcube extracube];
-
-
-%%
-% Step 5: Do the statistics!
-% 1. Compute correlations between all these parameters.
-% 2. Display correlation figures.
-% 3. Display analysis result figure.
-do_correlation_analyses   % this one is just a script because it does not return any values.
-
-% You are now finished!
\ No newline at end of file
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/2_generate_smith2013_ismir.m	Wed Mar 05 01:02:26 2014 +0000
@@ -0,0 +1,99 @@
+% This is a script that reproduces the results of the following ISMIR paper:
+%
+% Smith, J. B. L., and E. Chew. 2013. A meta-analysis of the MIREX Structural
+% Segmentation task. Proceedings of the International Society for Music
+% Information Retrieval Conference. Curitiba, Brazil. 251–6.
+%
+% DO read the README.txt before running this file. Among other things, it will
+% explain that you must download some data and place it in the correct file
+% tree before this script will work.
+%
+% DO read the comments in this script before running it. You may want to run
+% it piece by piece in order to understand exactly what it is doing.
+
+
+%%
+% STEP 0: Set up parameters.
+% Name the MIREX datasets and algorithms desired.
+dsets = {'mrx09','mrx10_1','mrx10_2','sal'};
+algos = {'KSP1','KSP2','KSP3','MHRAF1','OYZS1','SBV1','SMGA1','SMGA2','SP1'};
+
+% YOU MUST SET THE FOLLOWING PATH YOURSELF!
+% Set it to be the same as the path given at the top of '1-get_mirex_estimates.rb'.
+base_directory = '/Users/me/Desktop/MIREX_data';
+base_directory = '/Users/jordan/Desktop/MIREX_data';
+
+
+% You should get a copy of the evalution scripts in the Code.SoundSoftware
+% repository. Wherever you put it, set the following path accordingly:
+addpath('/Users/me/Desktop/whereiputmymatlabfiles/structural_analysis_evaluation')
+addpath('/Users/jordan/Documents/structural_analysis_evaluation')
+
+% You should also, clearly, add the current path (where this file is):
+addpath('.')
+
+% Check that we have access to the correct dependencies.
+
+if exist('compare_structures.m')~=2,
+    fprintf('I could not locate ''compare_structures.m'', part of the Structural Analysis Evaluation project. Please read the help for this file before proceeding.\n')
+end
+if exist('load_annotation.m')~=2,
+    fprintf('I could not locate ''load_annotation.m'', part of the Structural Analysis Evaluation project. Please read the help for this file before proceeding.\n')
+end
+if exist('collect_all_mirex_annotations')~=2,
+    fprintf('I could not locate ''collect_all_mirex_annotations.m'', which should be in the same folder as this file. Something really screwed up has happened, clearly! Please read the help for this file before proceeding.\n')
+end
+
+%%
+% STEP 1: Download data from MIREX website:
+%
+%   - Ground truth files
+%   - Algorithm output
+%   - Reported evaluation results
+% (See README.txt to see where to get this data.)
+
+
+%%
+% STEP 2: Import all this data into some Matlab structures.
+%
+% 1. Assemble MIREX ground truth file data in Matlab.
+[mirex_truth mirex_dset_origin] = collect_all_mirex_annotations(base_directory, dsets, algos);
+% 2. Assemble MIREX algorithm output data in Matlab.
+mirex_output = collect_all_mirex_algo_output_data(base_directory, dsets, algos);
+% 3. Assemble MIREX evaluation results in Matlab.
+mirex_results = collect_all_mirex_results(base_directory, dsets, algos);
+% 4. Download public repositories of annotations.
+% NB: You must do this manually, as per the README.
+% 5. Assemble public ground truth data in Matlab.
+[public_truth public_dset_origin] = collect_all_public_annotations(base_directory);
+
+
+%%
+% STEP 3: Match MIREX and public data.
+%
+% With this information, we can now easily search for matches between
+% MIREX and public ground truth.
+[pub2mir, mir2pub, P] = match_mirex_to_public_data(mirex_truth, public_truth, mirex_dset_origin, public_dset_origin);
+% If you have already done this, do not repeat this time-consuming step. Instead:
+% load match_mirex_to_public_data_results
+
+%%
+% STEP 4: Compile datacubes
+% 1. Compute extra evaluation measures using MIREX algorithm output.
+% 2. Compute extra features of the annotations (song length, mean segment length, etc.).
+% 3. Put it all together in a giant MEGADATACUBE.
+[datacube newcube extracube indexing_info] = compile_datacubes(mirex_truth, ...
+    mirex_dset_origin, public_truth, mirex_output, mirex_results, mir2pub);
+megadatacube = [datacube newcube extracube];
+% If you have already done this, do not repeat this time-consuming step. Instead:
+% load datacubes
+
+
+%%
+% Step 5: Do the statistics!
+% 1. Compute correlations between all these parameters.
+% 2. Display correlation figures.
+% 3. Display analysis result figure.
+do_correlation_analyses   % this one is just a script because it does not return any values.
+
+% You are now finished!
\ No newline at end of file
--- a/collect_all_public_annotations.m	Sat Feb 22 21:25:43 2014 +0000
+++ b/collect_all_public_annotations.m	Wed Mar 05 01:02:26 2014 +0000
@@ -115,6 +115,7 @@
 end
 
 % Load EP data
+% NOTE WELL: if you encounter an error here, are you sure you moved the file ep_groundtruth_txt.zip to your public_data directory and unzipped it?
 [tmp all_files tmp1] = fileattrib(strcat(ep_dir,'/*.txt'));
 for j=1:length(all_files),
     if all_files(j).directory==0 & all_files(j).GroupRead==1,
@@ -177,6 +178,7 @@
 end
 
 % Load SALAMI data
+% NOTE WELL: if you encounter an error here, are you sure you unzipped the data.zip file *within* the SALAMI data file?
 [tmp all_files tmp1] = fileattrib(strcat(salami_dir,'/*'));
 for j=1:length(all_files),
     if all_files(j).directory == 0 & all_files(j).GroupRead==1,
Binary file datacubes.mat has changed
--- a/do_correlation_analyses.m	Sat Feb 22 21:25:43 2014 +0000
+++ b/do_correlation_analyses.m	Wed Mar 05 01:02:26 2014 +0000
@@ -47,98 +47,103 @@
 
 
 
-% Section 3.1: "Does this indicate that the algorithms are better at boundary precision than recall? In fact, the opposite is the case: average bp6 bp.5 was simply consistently worse for most algorithms."
-% For all algos:
+fprintf('Section 3.1: ''Does this indicate that the algorithms are better at boundary precision than recall? In fact, the opposite is the case: average bp6 bp.5 was simply consistently worse for most algorithms.''\n')
+fprintf('For all algos:\n')
 mean(median(megadatacube(:,indexing_info(2).manual_set([3 4 7 8]),:),3),1)
-% For each algo:
+fprintf('For each algo:\n')
 mean(megadatacube(:,indexing_info(2).manual_set([3 4 7 8]),:),1)
-% Recall (the second pair of values) surpass precision (the first pair of values) for most of the algorithm runs. There are two exceptions: algorithms 4 (R a little less than P) and 5 (P much better than R).
+fprintf('Recall (the second pair of values) surpass precision (the first pair of values) for most of the algorithm runs. There are two exceptions: algorithms 4 (R a little less than P) and 5 (P much better than R).\n')
 
+fprintf('Are the trends qualitatively similar across datasets? (Section 3.1: ''...the findings of this section were consistent across the datasets, albeit with some variation in significance levels.'')\n')
+fprintf('Fig 1a\n')
+fprintf('All the datasets:\n')
+figure(1),[asig pval a a_] = do_correlation(megadatacube, lab_measures, indexing_info(1).manual_set, [1:9], -1, 0, 1, -1, indexing_info(1).labels, 1);
+fprintf('Isophonics et al.:\n')
+figure(2),[asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,1), indexing_info(1).manual_set, [1:9], -1, 0, 1, -1, indexing_info(1).labels, 1);
+fprintf('RWC (AIST):\n')
+figure(3),[asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,3), indexing_info(1).manual_set, [1:9], -1, 0, 1, -1, indexing_info(1).labels, 1);
+fprintf('SALAMI:\n')
+figure(4),[asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,4), indexing_info(1).manual_set, [1:9], -1, 0, 1, -1, indexing_info(1).labels, 1);
+fprintf('Fig 1b\n')
+fprintf('All the datasets:\n')
+figure(1), [asig pval a a_] = do_correlation(megadatacube, lab_measures, indexing_info(1).manual_set, [1:9], -1, 1, 0, -1, indexing_info(1).labels, 1);
+fprintf('Isophonics et al.:\n')
+figure(2), [asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,1), indexing_info(1).manual_set, [1:9], -1, 1, 0, -1, indexing_info(1).labels, 1);
+fprintf('RWC (AIST):\n')
+figure(3), [asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,3), indexing_info(1).manual_set, [1:9], -1, 1, 0, -1, indexing_info(1).labels, 1);
+fprintf('SALAMI:\n')
+figure(4), [asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,4), indexing_info(1).manual_set, [1:9], -1, 1, 0, -1, indexing_info(1).labels, 1);
+fprintf('Fig 2a\n')
+fprintf('All the datasets:\n')
+figure(1), [asig pval a a_] = do_correlation(megadatacube, seg_measures, indexing_info(2).manual_set, [1:9], -1, 0, 1, -1, indexing_info(2).labels, 1);
+fprintf('Isophonics et al.:\n')
+figure(2), [asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,1), indexing_info(2).manual_set, [1:9], -1, 0, 1, -1, indexing_info(2).labels, 1);
+fprintf('RWC (INRIA):\n')
+figure(3), [asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,2), indexing_info(2).manual_set, [1:9], -1, 0, 1, -1, indexing_info(2).labels, 1);
+fprintf('RWC (AIST):\n')
+figure(4), [asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,3), indexing_info(2).manual_set, [1:9], -1, 0, 1, -1, indexing_info(2).labels, 1);
+fprintf('SALAMI:\n')
+figure(5), [asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,4), indexing_info(2).manual_set, [1:9], -1, 0, 1, -1, indexing_info(2).labels, 1);
+fprintf('Fig 2b\n')
+fprintf('All the datasets:\n')
+figure(1), [asig pval a a_] = do_correlation(megadatacube, seg_measures, indexing_info(2).manual_set, [1:9], -1, 1, 0, -1, indexing_info(2).labels, 1);
+fprintf('Isophonics et al.:\n')
+figure(2), [asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,1), indexing_info(2).manual_set, [1:9], -1, 1, 0, -1, indexing_info(2).labels, 1);
+fprintf('RWC (INRIA):\n')
+figure(3), [asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,2), indexing_info(2).manual_set, [1:9], -1, 1, 0, -1, indexing_info(2).labels, 1);
+fprintf('RWC (AIST):\n')
+figure(4), [asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,3), indexing_info(2).manual_set, [1:9], -1, 1, 0, -1, indexing_info(2).labels, 1);
+fprintf('SALAMI:\n')
+figure(5), [asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,4), indexing_info(2).manual_set, [1:9], -1, 1, 0, -1, indexing_info(2).labels, 1);
 
-% Are the trends qualitatively similar across datasets? (Section 3.1: "...the findings of this section were consistent across the datasets, albeit with some variation in significance levels.")
-% % % Fig 1a
-% All the datasets:
-figure,[asig pval a a_] = do_correlation(megadatacube, lab_measures, indexing_info(1).manual_set, [1:9], -1, 0, 1, -1, indexing_info(1).labels, 1);
-% Isophonics et al.:
-figure,[asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,1), indexing_info(1).manual_set, [1:9], -1, 0, 1, -1, indexing_info(1).labels, 1);
-% RWC (AIST):
-figure,[asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,3), indexing_info(1).manual_set, [1:9], -1, 0, 1, -1, indexing_info(1).labels, 1);
-% SALAMI:
-figure,[asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,4), indexing_info(1).manual_set, [1:9], -1, 0, 1, -1, indexing_info(1).labels, 1);
-% % % Fig 1b
-% All the datasets:
-figure, [asig pval a a_] = do_correlation(megadatacube, lab_measures, indexing_info(1).manual_set, [1:9], -1, 1, 0, -1, indexing_info(1).labels, 1);
-% Isophonics et al.:
-figure, [asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,1), indexing_info(1).manual_set, [1:9], -1, 1, 0, -1, indexing_info(1).labels, 1);
-% RWC (AIST):
-figure, [asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,3), indexing_info(1).manual_set, [1:9], -1, 1, 0, -1, indexing_info(1).labels, 1);
-% SALAMI:
-figure, [asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,4), indexing_info(1).manual_set, [1:9], -1, 1, 0, -1, indexing_info(1).labels, 1);
-% % % Fig 2a
-% All the datasets:
-figure, [asig pval a a_] = do_correlation(megadatacube, seg_measures, indexing_info(2).manual_set, [1:9], -1, 0, 1, -1, indexing_info(2).labels, 1);
-% Isophonics et al.:
-figure, [asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,1), indexing_info(2).manual_set, [1:9], -1, 0, 1, -1, indexing_info(2).labels, 1);
-% RWC (INRIA):
-figure, [asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,2), indexing_info(2).manual_set, [1:9], -1, 0, 1, -1, indexing_info(2).labels, 1);
-% RWC (AIST):
-figure, [asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,3), indexing_info(2).manual_set, [1:9], -1, 0, 1, -1, indexing_info(2).labels, 1);
-% SALAMI:
-figure, [asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,4), indexing_info(2).manual_set, [1:9], -1, 0, 1, -1, indexing_info(2).labels, 1);
-% % % Fig 2b
-% All the datasets:
-figure, [asig pval a a_] = do_correlation(megadatacube, seg_measures, indexing_info(2).manual_set, [1:9], -1, 1, 0, -1, indexing_info(2).labels, 1);
-% Isophonics et al.:
-figure, [asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,1), indexing_info(2).manual_set, [1:9], -1, 1, 0, -1, indexing_info(2).labels, 1);
-% RWC (INRIA):
-figure, [asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,2), indexing_info(2).manual_set, [1:9], -1, 1, 0, -1, indexing_info(2).labels, 1);
-% RWC (AIST):
-figure, [asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,3), indexing_info(2).manual_set, [1:9], -1, 1, 0, -1, indexing_info(2).labels, 1);
-% SALAMI:
-figure, [asig pval a a_] = do_correlation(megadatacube, ismember(mirex_dset_origin,4), indexing_info(2).manual_set, [1:9], -1, 1, 0, -1, indexing_info(2).labels, 1);
 
+fprintf('Section 3.2: ''While the middle half of the values of nsa [number of segments in annotation] ranges from 7 to 13 segments, the middle values for nse [number of segments for estimated description] for most algorithms range from 11 to 20 segments. The two exceptions are MHRAF and OYZS [algorithms 4 and 5], for which both msle and nse match the distributions seen in the annotations.''\n')
 
-% Section 3.2: "While the middle half of the values of nsa [number of segments in annotation] ranges from 7 and 13 segments, the middle values for nse [number of segments for estimated description] for most algorithms range from 11 to 20 segments. The two exceptions are MHRAF and OYZS [algorithms 4 and 5], for which both msle and nse match the distributions seen in the annotations."
-
-% Index 17 gives the number of segments in the annotation; 21 gives the number of segments in the estimated description of the algorithm.
-% Boxplot shows general trend of overestimating number of segments.
+fprintf('Index 17 gives the number of segments in the annotation; 21 gives the number of segments in the estimated description of the algorithm.\n')
+fprintf('Boxplot shows general trend of overestimating number of segments.\n')
 H = boxplot(megadatacube(:,[17 21],:))
-% Take the middle half of the data for annotated and estimated segments. Look at the range.
+fprintf('Take the middle half of the data for annotated and estimated segments. Look at the range.\n')
 
 tmp = sort(megadatacube(:,17,:));
 tmp = sort(tmp(:));
 tmp(round(length(tmp)/4)), tmp(3*round(length(tmp)/4))
-% The middle half of the annotated descriptions have 7 to 13 segments.
+fprintf('The middle half of the annotated descriptions have 7 to 13 segments.\n')
 
 tmp2 = sort(megadatacube(:,21,:));
 [tmp2(round(length(tmp2)/4),:,:), tmp2(round(length(tmp2)*3/4),:,:)]
-% Setting aside algorithms 4 and 5, the others all have middle ranges of roughly 11 to 24.
+fprintf('Setting aside algorithms 4 and 5, the others all have middle ranges of roughly 11 to 24.\n')
 tmp2 = sort(tmp2(:));
 tmp2(round(length(tmp2)/4)), tmp2(3*round(length(tmp2)/4))
-% Averaging the other algorithms together, the middle range is exactly 10 to 20.
+fprintf('Averaging the other algorithms together, the middle range is exactly 10 to 20.\n')
 
 
+% Of all the songs that have been pinpointed in one dataset or another, sort them by pw_f, and look at the best and worst performing songs.
+% PW_F is the 3rd element of the megadatacube.
+% Take the mean PW_F across all the algorithms.
+tmp = mean(megadatacube(:,3,:),3);
+% Now we will rank by song, and look at the origin of the top and bottom PW_Fs.
 
-
-
-do blah
-% % % % % % % % % % % % The rest of this is still under construction, so I have inserted an error in the previous line to halt the script.
-
-%   %   %   %   %   %   %   %   %   %   %   ENd OF REAL WORK AREA   %   %   %   %   %   %   %   %   %   %   %   %   %
-
-
+find(mirex_dset_origin==1)
 
 % Look at best 10 and worst 10 songs in each dataset, according to PW_F metric.
 % Average results across algorithms for this one.
+
+% First, we will not let the fact that many versions of some algorithms exist skew the results.
+% So, we replace algo no. 3 with the mean of algos 1, 2, 3 and 9 (KSP1, KSP2, KSP3, and SP1)
+tmp_datacube = datacube;
+tmp_datacube(:,:,3) = mean(tmp_datacube(:,:,[1:3,9]),3);
+% And replace algo 7 with algos 7 and 8 (SMGA1 and SMGA2)
+tmp_datacube(:,:,7) = mean(tmp_datacube(:,:,7:8),3);
+% Now there are just 5 unique algorithms:
 unique_algorithms = [3 4 5 6 7];
-tmp = datacube;
-tmp(:,:,3) = mean(tmp(:,:,[1:3,9]),3);
-tmp(:,:,7) = mean(tmp(:,:,7:8),3);
-tmp = mean(tmp(mirex_dset_origin==1,:,unique_algorithms),3);
-[tmp1 order] = sortrows(tmp,-3);
-order1 = lab_measures(order);
-pub_songids = mir2pub(order);
-values = tmp1((pub_songids>0),3);
+% Let TMP be the average performance across the algorithms of the main set of metrics (those in DATACUBE) for all the songs in the first dataset, i.e., Isophonics and Beatles.
+tmp_dc_results = mean(tmp_datacube(mirex_dset_origin==1,:,unique_algorithms),3);
+% Sort the algorithms in decreasing order of the third metric (which is PW_F)
+[tmp_dc_results order] = sortrows(tmp_dc_results,-3);
+% order1 = lab_measures(order);
+pub_songids = mir2pub(order);   % These are the matched IDs of the songs
+values = tmp_dc_results((pub_songids>0),3);  % We want the match to be >0 --- i.e., we only care about positively identified songs
+% Now scoop up all the filenames of the songs.
 filenames = {};
 for i=1:length(pub_songids),
     if pub_songids(i)>0,
@@ -146,439 +151,44 @@
     end
 end
 
-mirid = pub2mir(336);
-make_structure_image(mirid, miranns, MD, mirdset, X, MR)
-saveas(gcf,'./plots/MJ_dont_care.jpg')
-make_structure_image(121, miranns, MD, mirdset, X, MR)
-saveas(gcf,'./plots/play_the_game.jpg')
 
-% Plot difficulty by album:
+fprintf('Section 4: ''The piece with the highest median pwf is The Beatles'' ''''Her Majesty''''...''\n')
+fprintf('''The next-best Beatles song, ''''I Will'''', is an instance where both the states and sequences hypotheses apply well...''\n')
+fprintf('(Note: due to a change in the script, the song ''''Her Majesty'''' is no longer identified properly, and hence does not show up here in the results. Instead, the top two songs are ''''I Will'''' and ''''Penny Lane''''.')
+fprintf('%f, %s\n',tmp_dc_results(1,3),filenames{1})
+fprintf('%f, %s\n',tmp_dc_results(2,3),filenames{2})
 
+fprintf('Section 4: ''At the bottom is Jackson''s ''''They Don''t Care About Us''''.''\n')
+fprintf('%f, %s\n',tmp_dc_results(end,3),filenames{end})
 
-genres = {};
-subgenres = {};
-issalami = zeros(length(filenames),1);
-for i=1:length(filenames),
-    file = filenames{i};
-    if strfind(file,'SALAMI_data'),
-        issalami(i)=1;
-        salami_id = file(79:85);
-        salami_id = salami_id(1:strfind(salami_id,'/')-1);
-        salami_row = find(aaux.metadata{1}==str2num(salami_id));
-        genres{end+1} = cell2mat(aaux.metadata{15}(salami_row));
-        subgenres{end+1} = cell2mat(aaux.metadata{16}(salami_row));
+fprintf('Section 4: ''Conspicuously, 17 of the easiest 20 songs (again, with respect to pwf) are Beatles tunes, while only 2 of the most difficult 20 songs are---the rest being Michael Jackson, Queen and Carole King songs.''\n')
+fprintf('The easiest 20 songs:\n')
+for i=1:20,
+    fprintf('%s\n',filenames{i})
+end
+fprintf('The hardest 20 songs:\n')
+for i=1:20,
+    fprintf('%s\n',filenames{end+1-i})
+end
+
+values = tmp_dc_results(:,3);
+values = values(pub_songids>0);
+groups = public_dset_origin(pub_songids(pub_songids>0),:);
+artists = zeros(size(values));
+for i=1:length(values),
+    if groups(i,1)~=2,
+        artists(i) = 4; % Beatles
+    else
+        artists(i) = groups(i,2);
     end
 end
-gs = grp2idx(genres);
-subgs = grp2idx(subgenres);
-boxplot(values(find(issalami)),transpose(genres))
-axis([0.5 5.5 0 1])
-saveas(gcf,'salami_breakdown.png')
-boxplot(values(find(issalami)),transpose(subgenres),'colors',cmap(round(gs*63/6),:),'orientation','horizontal')
+% Kruskal-Wallis test:
+fprintf('Section 5: ''Taking the median pwf across the algorithms and comparing this value for the 274 annotations identified as one of these four artists, a Kruskal-Wallis test confirms that the groups differ.''\n')
+[P, anovatab, stats] = kruskalwallis(values, artists);
+fprintf('Section 5: ''A multiple comparison test reveals that pwf is significantly greater for the Beatles group than the three others.''\n')
+multcompare(stats)
+fprintf('Note: in the version created for the article, the Zweick songs were not identified, and these sentences refer to 4 artists when in fact these comparisons refer to 5 artists.\n')
 
-[tmp1 tmp2] = hist(subgs,max(subgs)-1);
-tmp1 = find(tmp1>5);  % do these subgenres only
-tmp1 = ismember(subgs,tmp1);
-tmp2 = find(issalami);
-boxplot(values(tmp2(tmp1)),transpose(subgenres(tmp1)),'colors',cmap(round(gs(tmp1)*63/6),:),'orientation','horizontal')
-
-
-
-
-
-% Look at scatter plots so that we can qualitatively attribute the correlations to things (e.g., low-precision variance).
-tmpcube = mean(datacube,3);
-for i=1:4,
-    for j=i+1:5,
-        subplot(5,5,i+(j-1)*5)
-        scatter(tmpcube(:,i),tmpcube(:,j),'x')
-    end
-end
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-clf,imagesc(a.*(abs(a)>.7))
-set(gca,'XTickLabel',[],'XTick',(1:50)-.5)
-set(gca,'YTickLabel',s,'YTick',(1:50))
-t = text((1:50)-.5,51*ones(1,50),s);
-set(t,'HorizontalAlignment','right','VerticalAlignment','top', 'Rotation',90);
-hold on
-for i=1:9,
-    plot([0 50],[i*5 i*5],'w')
-    plot([i*5 i*5],[0 50],'w')
-end
-
-% a = corr([datacube(1:300,:,1) newcube(1:300,:,1) newmetriccube(1:300,:,1)]);
-
-a = corr([datacube(lab_measures,:,1) newcube(lab_measures,:,1) newmetriccube(lab_measures,:,1)]);
-b = corr([datacube(seg_measures,:,1) newcube(seg_measures,:,1) newmetriccube(seg_measures,:,1)]);
-
-% Look at label measures only in this case.
-imagesc(sortrows(transpose(sortrows((abs(a)>0.7)))))
-[t1 t2] = (sortrows(transpose(sortrows((abs(a)>0.7)))));
-
-
-b = zeros(size(a));
-for j=[3,4,5,6,7,9],
-    b = b+corr([datacube(:,:,j) newcube(:,:,j) newmetriccube(:,:,j)]);
-end
-b=b/6;
-
-
-% Look at correlations among all figures, but pay attention to pvalues too.
-% Only plot those less than 0.05, with conservative bonferroni correction.
-megadatacube_l = [datacube(lab_measures,:,:) newcube(lab_measures,:,:) newmetriccube(lab_measures,:,:)];
-megadatacube_s = [datacube(seg_measures,:,:) newcube(seg_measures,:,:) newmetriccube(seg_measures,:,:)];
-% megadatacube_l = median(megadatacube_l(:,use_these_labels,:),3);
-% megadatacube_s = median(megadatacube_s(:,use_these_segs,:),3);
-
-
-
-megadatacube_all = median(megadatacube_l(:,[use_these_labels use_these_segs use_these_extras],:),3);
-megadatacube_all(:,16:17) = 1 - megadatacube_all(:,16:17);
-[al pval] = corr(megadatacube_all);
-m = length(al)*(length(al)-1)/2;
-imagesc(al.*((pval*m)<0.05))
-al_ = al.*((pval*m)<0.05);
-al_ = tril(al_ .* (abs(al_)>.5));
-imagesc(al_)
-for i=1:length(al_),
-    for j=1:length(al_),
-        if (al_(i,j)~=0) & (i~=j),
-            text(j-.35,i,num2str(al_(i,j),2))
-        end
-    end
-end
-% [bl pvbl] = corr(megadatacube_all,'type','Kendall');
-m = length(bl)*(length(bl)-1)/2;
-imagesc(bl.*((pvbl*m)<0.05))
-bl_ = bl.*((pvbl*m)<0.05);
-bl_ = tril(bl_) % .* (abs(bl_)>.0));
-imagesc(bl_)
-for i=1:length(bl_),
-    for j=1:length(bl_),
-        if (bl_(i,j)~=0) & (i~=j),
-            text(j-.35,i,num2str(bl_(i,j),2))
-        end
-    end
-end
-
-% Or, we could do this: Take all the computed Kendall taus, i.e., the non-diagonal elements of bl.
-taus = bl(find(bl<1));
-taus = taus-mean(taus);
-taus = taus/std(taus);
-P = normcdf(-abs(taus));
-ind = find(P<=0.05);
-taus = bl(find(bl<1));
-taus(ind)
-
-c = colormap;
-c(32,:) = [1 1 1];
-c(31,:) = [1 1 1];
-c = min(1,c*1.6);
-colormap(c)
-set(gca,'XTickLabel',[],'XTick',(1:length(al_))-.4)
-set(gca,'YTickLabel',s([use_these_labels use_these_segs use_these_extras]),'YTick',(1:length(al_)))
-t = text((1:length(al_))-.3,(length(al_)+1)*ones(1,length(al_))+.3,s([use_these_labels use_these_segs use_these_extras]));
-set(t,'HorizontalAlignment','right','VerticalAlignment','top', 'Rotation',90);
-axis([0 31 0 31])
-saveas(gcf,'./plots/all_correlations.jpg')
-
-s = {'S_o','S_u','pw_f','pw_p','pw_r','rand','bf1','bp1','br1','bf6','bp6','br6','mt2c','mc2t','ds','len','nsa','nla','msla','nspla','nse','nle','msle','nsple','ob','ol','pw_f_x','pw_p_x','pw_r_x','K','asp','acp','I_AE_x','H_EA_x','H_AE_x','S_o_x','S_u_x','rand','mt2c_x','mc2t_x','m','f','d_ae_x','d_ea_x','b_f1_x','b_p1_x','b_r1_x','b_f6_x','b_p6_x','b_r6_x'};
-s_type = [1,2,3,1,2,3,6,4,5,6,4,5,4,5, 7,7,7,7,7,7,7,7,7,7,7,7,3,1,2,3,2,1,3,1,2,1,2, 3,4,5,5,4,7,7,3,1,2,3,1,2];
-megadatacube_s(:,40:41,:) = 1 - megadatacube_s(:,40:41,:);
-megadatacube_s(:,51,:) = 2*megadatacube_s(:,38,:).*megadatacube_s(:,39,:)./(megadatacube_s(:,38,:)+megadatacube_s(:,39,:));
-% This makes a new 51st metric which is a combination of m and f.
-s_type(51) = 6;
-s{51} = 'mf';
-
-
-% [a pval] = corr(median([datacube(lab_measures,:,1) newcube(lab_measures,:,1) newmetriccube(lab_measures,:,1)],3));
-[a pval] = corr(mean(megadatacube_l,3));
-m = length(a)*(length(a)-1)/2;
-imagesc(a.*((pval*m)<0.05))
-a_ = a.*((pval*m)<0.05);
-c = colormap;
-c(32,:) = [1 1 1];
-colormap(c)
-
-% I want to make a claim about song length correlating to the algorithms or not. Let us make sure it is valid across all algorithms, and is not just applicable to the median:
-for j=1:9,
-    a = corr([datacube(lab_measures,:,j) newcube(lab_measures,:,j) newmetriccube(lab_measures,:,j)]);
-    a(16,[17 19 21 23])
-end
-
-% BoxPlot of the number of segments in each algorithm output
-boxplot(reshape(newcube(:,7,:),[length(newcube),9,1]))
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-% Now again, we will want to run the correlation study by taking medians across algorithms (do the metrics rank the songs the same way?) and medians across songs (do the metrics rank the algorithms the same way?).
-
-% Take the label metrics only, take median across songs:
-% tmpcube = median(megadatacube_l(:,sind_manual1,:),1);
-% tmpcube = transpose(reshape(tmpcube,size(tmpcube,2),size(tmpcube,3)));
-% [a pval] = corr(tmpcube,'type','Kendall');
-% m = length(a)*(length(a)-1)/2;
-% a.*((pval*m)<0.05); % This is the matrix of values that are significant.
-% Alternatively, we can plot all the metrics, treat them as random normal variables, and select only those that stand out.
-
-
-
-% [asig pval a] = do_correlation(megadatacube, songs, metrics, algos, algo_groups, merge_algos (1 = do, 0 = do not), merge_songs, merge_dsets, metric_labels)
-[asig pval a] = do_correlation(megadatacube, lab_measures, sind_manual1, [1:9], -1, 0, 1, -1, s_manual1)
-[asig pval a] = do_correlation(megadatacube, lab_measures, [use_these_labels use_these_segs], [1:9], -1, 0, 1, -1, s([use_these_labels use_these_segs]))
-
-[asig pval a] = do_correlation(megadatacube, lab_measures, [1:12], [1:9], -1, 0, 1, -1, s(1:12))
-
-
-[a pval] = corr(megadatacube_l(:,:,1),'type','Kendall');
-
-
-
-% Take the label metrics only, take median across algorithms:
-tmpcube = median(megadatacube_l(:,sind_manual1,:),3);
-[a pval] = corr(tmpcube); %,'type','Kendall');
-m = length(a)*(length(a)-1)/2;
-a.*((pval*m)<0.05); % This is the matrix of values that are significant.
-% However, with so many data points (over 1400) it is very easy to be significant...
-
-
-
-
-imagesc(a.*((pval*m)<0.05))
-al_ = al.*((pval*m)<0.05);
-al_ = tril(al_ .* (abs(al_)>.5));
-imagesc(al_)
-for i=1:length(al_),
-    for j=1:length(al_),
-        if (al_(i,j)~=0) & (i~=j),
-            text(j-.35,i,num2str(al_(i,j),2))
-        end
-    end
-end
-
-
-clf,imagesc(a.*(abs(a)>.7))
-set(gca,'XTickLabel',[],'XTick',(1:50)-.5)
-set(gca,'YTickLabel',s,'YTick',(1:50))
-t = text((1:50)-.5,51*ones(1,50),s);
-set(t,'HorizontalAlignment','right','VerticalAlignment','top', 'Rotation',90);
-hold on
-for i=1:9,
-    plot([0 50],[i*5 i*5],'w')
-    plot([i*5 i*5],[0 50],'w')
-end
-
-% a = corr([datacube(1:300,:,1) newcube(1:300,:,1) extracube(1:300,:,1)]);
-
-a = corr([datacube(lab_measures,:,1) newcube(lab_measures,:,1) extracube(lab_measures,:,1)]);
-b = corr([datacube(seg_measures,:,1) newcube(seg_measures,:,1) extracube(seg_measures,:,1)]);
-
-% Look at label measures only in this case.
-imagesc(sortrows(transpose(sortrows((abs(a)>0.7)))))
-[t1 t2] = (sortrows(transpose(sortrows((abs(a)>0.7)))));
-
-
-b = zeros(size(a));
-for j=[3,4,5,6,7,9],
-    b = b+corr([datacube(:,:,j) newcube(:,:,j) extracube(:,:,j)]);
-end
-b=b/6;
-
-
-% Look at correlations among all figures, but pay attention to pvalues too.
-% Only plot those less than 0.05, with conservative bonferroni correction.
-megadatacube_l = [datacube(lab_measures,:,:) newcube(lab_measures,:,:) extracube(lab_measures,:,:)];
-megadatacube_s = [datacube(seg_measures,:,:) newcube(seg_measures,:,:) extracube(seg_measures,:,:)];
-% megadatacube_l = median(megadatacube_l(:,use_these_labels,:),3);
-% megadatacube_s = median(megadatacube_s(:,use_these_segs,:),3);
-
-
-
-megadatacube_all = median(megadatacube_l(:,[use_these_labels use_these_segs use_these_extras],:),3);
-megadatacube_all(:,16:17) = 1 - megadatacube_all(:,16:17);
-[al pval] = corr(megadatacube_all);
-m = length(al)*(length(al)-1)/2;
-imagesc(al.*((pval*m)<0.05))
-al_ = al.*((pval*m)<0.05);
-al_ = tril(al_ .* (abs(al_)>.5));
-imagesc(al_)
-for i=1:length(al_),
-    for j=1:length(al_),
-        if (al_(i,j)~=0) & (i~=j),
-            text(j-.35,i,num2str(al_(i,j),2))
-        end
-    end
-end
-% [bl pvbl] = corr(megadatacube_all,'type','Kendall');
-m = length(bl)*(length(bl)-1)/2;
-imagesc(bl.*((pvbl*m)<0.05))
-bl_ = bl.*((pvbl*m)<0.05);
-bl_ = tril(bl_) % .* (abs(bl_)>.0));
-imagesc(bl_)
-for i=1:length(bl_),
-    for j=1:length(bl_),
-        if (bl_(i,j)~=0) & (i~=j),
-            text(j-.35,i,num2str(bl_(i,j),2))
-        end
-    end
-end
-
-% Or, we could do this: Take all the computed Kendall taus, i.e., the non-diagonal elements of bl.
-taus = bl(find(bl<1));
-taus = taus-mean(taus);
-taus = taus/std(taus);
-P = normcdf(-abs(taus));
-ind = find(P<=0.05);
-taus = bl(find(bl<1));
-taus(ind)
-
-c = colormap;
-c(32,:) = [1 1 1];
-c(31,:) = [1 1 1];
-c = min(1,c*1.6);
-colormap(c)
-set(gca,'XTickLabel',[],'XTick',(1:length(al_))-.4)
-set(gca,'YTickLabel',s([use_these_labels use_these_segs use_these_extras]),'YTick',(1:length(al_)))
-t = text((1:length(al_))-.3,(length(al_)+1)*ones(1,length(al_))+.3,s([use_these_labels use_these_segs use_these_extras]));
-set(t,'HorizontalAlignment','right','VerticalAlignment','top', 'Rotation',90);
-axis([0 31 0 31])
-saveas(gcf,'./plots/all_correlations.jpg')
-
-s = {'S_o','S_u','pw_f','pw_p','pw_r','rand','bf1','bp1','br1','bf6','bp6','br6','mt2c','mc2t','ds','len','nsa','nla','msla','nspla','nse','nle','msle','nsple','ob','ol','pw_f_x','pw_p_x','pw_r_x','K','asp','acp','I_AE_x','H_EA_x','H_AE_x','S_o_x','S_u_x','rand','mt2c_x','mc2t_x','m','f','d_ae_x','d_ea_x','b_f1_x','b_p1_x','b_r1_x','b_f6_x','b_p6_x','b_r6_x'};
-s_type = [1,2,3,1,2,3,6,4,5,6,4,5,4,5, 7,7,7,7,7,7,7,7,7,7,7,7,3,1,2,3,2,1,3,1,2,1,2, 3,4,5,5,4,7,7,3,1,2,3,1,2];
-megadatacube_s(:,38:39,:) = 1 - megadatacube_s(:,38:39,:);
-megadatacube_s(:,51,:) = 2*megadatacube_s(:,38,:).*megadatacube_s(:,39,:)./(megadatacube_s(:,38,:)+megadatacube_s(:,39,:));
-% This makes a new 51st metric which is a combination of m and f.
-s_type(51) = 6;
-s{51} = 'mf';
-
-
-% [a pval] = corr(median([datacube(lab_measures,:,1) newcube(lab_measures,:,1) extracube(lab_measures,:,1)],3));
-[a pval] = corr(mean(megadatacube_l,3));
-m = length(a)*(length(a)-1)/2;
-imagesc(a.*((pval*m)<0.05))
-a_ = a.*((pval*m)<0.05);
-c = colormap;
-c(32,:) = [1 1 1];
-colormap(c)
-
-% I want to make a claim about song length correlating to the algorithms or not. Let us make sure it is valid across all algorithms, and is not just applicable to the median:
-for j=1:9,
-    a = corr([datacube(lab_measures,:,j) newcube(lab_measures,:,j) extracube(lab_measures,:,j)]);
-    a(16,[17 19 21 23])
-end
-
-% BoxPlot of the number of segments in each algorithm output
-boxplot(reshape(newcube(:,7,:),[length(newcube),9,1]))
-
-% Look at best 10 and worst 10 songs in each dataset, according to PW_F metric.
-% Average results across algorithms for this one.
-unique_algorithms = [3 4 5 6 7];
-tmp = datacube;
-tmp(:,:,3) = mean(tmp(:,:,[1:3,9]),3);
-tmp(:,:,7) = mean(tmp(:,:,7:8),3);
-tmp = mean(tmp(lab_measures,:,unique_algorithms),3);
-[tmp1 order] = sortrows(tmp,-3);
-order1 = lab_measures(order);
-pub_songids = X.mir2pub(order1);
-values = tmp1((pub_songids>0),3);
-filenames = {};
-for i=1:length(pub_songids),
-    if pub_songids(i)>0,
-        filenames{end+1} = public_truth(pub_songids(i)).file;
-    end
-end
-
-mirid = pub2mir(336);
-make_structure_image(mirid, mirex_truth, mirex_output, mirex_dset_origin, X, mirex_results)
+% Create composite structure diagram for Michael Jackon song:
+make_structure_image(pub2mir(pub_songids(end)),mirex_truth, mirex_output, mirex_results, mirex_dset_origin)
 saveas(gcf,'./plots/MJ_dont_care.jpg')
-make_structure_image(121, mirex_truth, mirex_output, mirex_dset_origin, X, mirex_results)
-saveas(gcf,'./plots/play_the_game.jpg')
-
-% Plot difficulty by album:
-
-
-genres = {};
-subgenres = {};
-issalami = zeros(length(filenames),1);
-for i=1:length(filenames),
-    file = filenames{i};
-    if strfind(file,'SALAMI_data'),
-        issalami(i)=1;
-        salami_id = file(79:85);
-        salami_id = salami_id(1:strfind(salami_id,'/')-1);
-        salami_row = find(aaux.metadata{1}==str2num(salami_id));
-        genres{end+1} = cell2mat(aaux.metadata{15}(salami_row));
-        subgenres{end+1} = cell2mat(aaux.metadata{16}(salami_row));
-    end
-end
-gs = grp2idx(genres);
-subgs = grp2idx(subgenres);
-boxplot(values(find(issalami)),transpose(genres))
-axis([0.5 5.5 0 1])
-saveas(gcf,'salami_breakdown.png')
-boxplot(values(find(issalami)),transpose(subgenres),'colors',cmap(round(gs*63/6),:),'orientation','horizontal')
-
-[tmp1 tmp2] = hist(subgs,max(subgs)-1);
-tmp1 = find(tmp1>5);  % do these subgenres only
-tmp1 = ismember(subgs,tmp1);
-tmp2 = find(issalami);
-boxplot(values(tmp2(tmp1)),transpose(subgenres(tmp1)),'colors',cmap(round(gs(tmp1)*63/6),:),'orientation','horizontal')
\ No newline at end of file
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/make_structure_image.m	Wed Mar 05 01:02:26 2014 +0000
@@ -0,0 +1,42 @@
+function make_structure_image(mirid, mirex_truth, mirex_output, mirex_results, mirex_dset_origin)
+% Quick script to accept an input song number and output an image
+% of the annotation against the estimated descriptions.
+
+% Identify the dataset corresponding to the ID of that song.
+dset = mirex_dset_origin(mirid);
+% Provide the index into that dataset of the ID of that song.
+id_in_dset = mirid + 1 - find(mirex_dset_origin==dset,1);
+
+% Assemble all the song descriptions. Start with the annotation, then collect all the algorithm outputs.
+descs = {};
+descs{1} = mirex_truth(mirid);
+for i=1:9,
+    descs{i+1} = mirex_output(dset).algo(i).song(id_in_dset);
+end
+% descs = descs([1 2 3 4 10 5 6 7 8 9]);
+descorder = [1 2 3 4 6 7 8 9 10 5];
+
+cmap = colormap;
+
+gcf; clf, hold on
+text_extent = max(descs{1}.tim);
+for i=2:length(descs),
+    text_extent = max([text_extent, max(descs{i}.tim)]);
+end
+for i=1:length(descs),
+    s_colors = grp2idx(descs{i}.lab);
+    cmap_rows = round(s_colors*length(cmap)/max(s_colors));
+    y = descorder(i)*(-1);
+    for n=1:length(descs{i}.tim)-1,
+        pos = [descs{i}.tim(n), y, descs{i}.tim(n+1)-descs{i}.tim(n), 0.8];
+        cmap_row = cmap(cmap_rows(n),:);
+        rectangle('Position',pos,'FaceColor',cmap_row);
+    end
+    if i>1,
+        pwftmp = mirex_results(dset).algo(i-1).results(id_in_dset,3);
+        text(text_extent + 2,y+.5,num2str(pwftmp,2))
+    end
+end
+axis([0 text_extent+17 -10 0])
+s = {'Ground Truth','KSP1','KSP2','KSP3','SP1','MHRAF','OYZS','SBV','SMGA1','SMGA2'};
+set(gca,'YTickLabel',fliplr(s),'YTick',(-9.5:-0.5))
Binary file match_mirex_to_public_data_results.mat has changed
Binary file plots/MJ_dont_care.jpg has changed
--- a/readme.txt	Sat Feb 22 21:25:43 2014 +0000
+++ b/readme.txt	Wed Mar 05 01:02:26 2014 +0000
@@ -8,22 +8,23 @@
 
 1: You will need Ruby and Matlab and a connection to the Internet.
 
-2: You will need to download a version of the Structural Analysis Evaluation project, also hosted on SoundSoftware. You can do so here:
+2: Download a version of the Structural Analysis Evaluation project, also hosted on SoundSoftware. You can do so here:
 <https://code.soundsoftware.ac.uk/projects/structural_analysis_evaluation/repository>
 
-3: You will need to edit some of the Ruby and Matlab files you have downloaded, in order to point the program to the desired folders:
-   >> In "1-get_mirex_estimates.rb", set the path to download all the data
+3: Edit the directory locations of the Ruby and Matlab files you have downloaded, in order to point the program to the desired folders:
+   >> In "1_get_mirex_estimates.rb", set the path to download all the data
 NOTE: We recommend making this the "./mirex_data" path, since some of the data is already there!
-   >> In "2-generate_smith2013_ismir", set the exact same path
-   >> In "2-generate_smith2013_ismir", set the path for the "structural analysis evaluation" repository
+   >> In "2_generate_smith2013_ismir", set the exact same path
+   >> In "2_generate_smith2013_ismir", set the path for the "structural analysis evaluation" repository
 
-4. Run the Ruby script "1-get_mirex_estimates.rb" and wait a while for all the data to download. Alternatively, because this takes a long time, just unzip the contents of the included MIREX_DATA.zip file.
+4. Run the Ruby script "1_get_mirex_estimates.rb" and wait a while for all the data to download. [Alternatively, because this takes a long time, just unzip the contents of the included MIREX_DATA.zip file. It's all there. Running the Ruby script just allows you to literally reproduce my work.]
 
 5. Unzip all the folders that you obtained.
 	Note: in this version, one of the repositories, the Ewald Peiszer repository, is included already as a zip file ("ep_groundtruth_txt.zip"). If you set "./mirex_data" as the download path in Step 3, then just unzip it here. Otherwise, move it to wherever the rest of the zips are.
+	Note: the SALAMI data contains a zipfile, so after unzipping it, you will need to unzip one of its contents again.
 	Note: due to inconsistencies in how different zipping programs handle things, the folder structure upon unzipping may be inconsistent. Please look at the Ground Truth Directory map below and make sure your files unzip in the same way. If they don't, you'll have to move things around until the structure matches.
 
-6. Run the Matlab script "2-generate_smith2013_ismir" and wait for all the data to be assembled, and for the figures to be generated. They will appear in "./plots". This repository includes what those pictures *should* look like. Hopefully you overwrite them with exact replicas.
+6. Run the Matlab script "2_generate_smith2013_ismir" and wait for all the data to be assembled, and for the figures to be generated. They will appear in "./plots". This repository includes what those pictures *should* look like. Hopefully you overwrite them with exact replicas.
 
 7. You're done! Hey, that wasn't so bad.
 
@@ -37,7 +38,7 @@
 
 ===== Ground Truth Directory map =====
 
-When your ground truth is all downloaded and unzipped, it should look like this:
+When your ground truth is all downloaded and unzipped, it should look like this (relevant for Step 5 above):
 
 *
 |-- AIST.RWC-MDB-C-2001.CHORUS
Binary file smith&chew2013-meta_analysis_of_mirex.pdf has changed