# HG changeset patch # User mas01cr # Date 1217844034 0 # Node ID 74824093c1c48242e2c296c22fa84447278647d4 # Parent 8764d114ce80b306a858f43853b07dc1fbef89af Implement O((M+N) log(M+N)) duplicate key detection in batchinsert, rather than naive O(N^2). Note that I haven't measured the constants in those complexity expressions; I am anticipating that 40000 is a sufficiently large N for this to matter. Write a test case for duplicate keys, too. Use 0037, since no-one else seems to be writing tests, and everything is merged onto the trunk these days anyway. diff -r 8764d114ce80 -r 74824093c1c4 audioDB.h --- a/audioDB.h Fri Aug 01 16:21:51 2008 +0000 +++ b/audioDB.h Mon Aug 04 10:00:34 2008 +0000 @@ -10,6 +10,8 @@ #include #include #include +#include +#include #include #include #include diff -r 8764d114ce80 -r 74824093c1c4 insert.cpp --- a/insert.cpp Fri Aug 01 16:21:51 2008 +0000 +++ b/insert.cpp Mon Aug 04 10:00:34 2008 +0000 @@ -210,8 +210,14 @@ } char *thisTimesFileName = new char[MAXSTR]; char *thisPowerFileName = new char[MAXSTR]; - - do{ + + std::set s; + + for (unsigned k = 0; k < dbH->numFiles; k++) { + s.insert(fileTable + k*O2_FILETABLE_ENTRY_SIZE); + } + + do { filesIn->getline(thisFile,MAXSTR); if(key && key!=inFile) { keysIn->getline(thisKey,MAXSTR); @@ -238,18 +244,10 @@ error("batchinsert failed: no more room in database", thisFile); } - // Linear scan of filenames check for pre-existing feature - unsigned alreadyInserted=0; - - for(unsigned k=0; knumFiles; k++) - if(strncmp(fileTable + k*O2_FILETABLE_ENTRY_SIZE, thisKey, strlen(thisKey)+1)==0){ - alreadyInserted=1; - break; - } - - if(alreadyInserted) { + if(s.count(thisKey)) { VERB_LOG(0, "key already exists in database: %s\n", thisKey); } else { + s.insert(thisKey); // Make a track index table of features to file indexes unsigned numVectors = (statbuf.st_size-sizeof(int))/(sizeof(double)*dbH->dim); if(!numVectors) { diff -r 8764d114ce80 -r 74824093c1c4 tests/0037/run-test.sh --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/0037/run-test.sh Mon Aug 04 10:00:34 2008 +0000 @@ -0,0 +1,84 @@ +#! /bin/bash + +. ../test-utils.sh + +if [ -f testdb ]; then rm -f testdb; fi + +${AUDIODB} -d testdb -N + +intstring 2 > testfeature01 +floatstring 0 1 >> testfeature01 +floatstring 1 0 >> testfeature01 +intstring 2 > testfeature10 +floatstring 1 0 >> testfeature10 +floatstring 0 1 >> testfeature10 + +cat > testfeaturefiles < testquery +floatstring 0 0.5 >> testquery + +${AUDIODB} -d testdb -Q nsequence -l 1 -f testquery > testoutput +echo testfeature01 1 > test-expected-output +echo 0 0 0 >> test-expected-output +echo 2 0 1 >> test-expected-output +echo testfeature10 1 >> test-expected-output +echo 0 0 1 >> test-expected-output +echo 2 0 0 >> test-expected-output +cmp testoutput test-expected-output + +${AUDIODB} -d testdb -Q nsequence -l 1 -f testquery -n 2 > testoutput +cmp testoutput test-expected-output + +${AUDIODB} -d testdb -Q nsequence -l 1 -f testquery -n 5 > testoutput +cmp testoutput test-expected-output + +${AUDIODB} -d testdb -Q nsequence -l 1 -f testquery -n 1 > testoutput +echo testfeature01 0 > test-expected-output +echo 0 0 0 >> test-expected-output +echo testfeature10 0 >> test-expected-output +echo 0 0 1 >> test-expected-output +cmp testoutput test-expected-output + +echo "query point (0.5,0.0)" +intstring 2 > testquery +floatstring 0.5 0 >> testquery + +${AUDIODB} -d testdb -Q nsequence -l 1 -f testquery > testoutput +echo testfeature01 1 > test-expected-output +echo 0 0 1 >> test-expected-output +echo 2 0 0 >> test-expected-output +echo testfeature10 1 >> test-expected-output +echo 0 0 0 >> test-expected-output +echo 2 0 1 >> test-expected-output +cmp testoutput test-expected-output + +${AUDIODB} -d testdb -Q nsequence -l 1 -f testquery -n 2 > testoutput +cmp testoutput test-expected-output + +${AUDIODB} -d testdb -Q nsequence -l 1 -f testquery -n 5 > testoutput +cmp testoutput test-expected-output + +${AUDIODB} -d testdb -Q nsequence -l 1 -f testquery -n 1 > testoutput +echo testfeature01 0 > test-expected-output +echo 0 0 1 >> test-expected-output +echo testfeature10 0 >> test-expected-output +echo 0 0 0 >> test-expected-output +cmp testoutput test-expected-output + +exit 104 diff -r 8764d114ce80 -r 74824093c1c4 tests/0037/short-description --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/0037/short-description Mon Aug 04 10:00:34 2008 +0000 @@ -0,0 +1,1 @@ +batchinsert repeated key handling \ No newline at end of file