amine@2: .. auditok documentation. amine@2: amine@2: auditok, an AUDIo TOKenization module amine@2: ===================================== amine@2: amine@2: amine@2: **auditok** is a module that can be used as a generic tool for data amine@2: tokenization. Although its core motivation is **Acoustic Activity amine@2: Detection** (AAD) and extraction from audio streams (i.e. detect amine@2: where a noise/an acoustic activity occurs within an audio stream and amine@2: extract the corresponding portion of signal), it can easily be amine@2: adapted to other tasks. amine@2: amine@2: Globally speaking, it can be used to extract, from a sequence of amine@2: observations, all sub-sequences that meet a certain number of amine@2: criteria in terms of: amine@2: amine@2: 1. Minimum length of a **valid** token (i.e. sub-sequence) amine@2: 2. Maximum length of a valid token amine@2: 3. Maximum tolerated consecutive **non-valid** observations within amine@2: a valid token amine@2: amine@2: Examples of a non-valid observation are: a non-numeric ascii symbol amine@2: if you are interested in sub-sequences of numeric symbols, or a silent amine@2: audio window (of 10, 20 or 100 milliseconds for instance) if what amine@2: interests you are audio regions made up of a sequence of "noisy" amine@2: windows (whatever kind of noise: speech, baby cry, laughter, etc.). amine@2: amine@2: The most important component of `auditok` is the `StreamTokenizer` class. amine@2: An instance of this class encapsulates a `DataValidator` and can be amine@2: configured to detect the desired regions from a stream. amine@2: The `auditok.core.StreamTokenizer.tokenize` method accepts a `DataSource` amine@2: object that has a `read` method. Read data can be of any type accepted amine@2: by the `validator`. amine@2: amine@2: amine@2: As the main aim of this module is **Audio Activity Detection**, amine@2: it provides the `auditok.util.ADSFactory` factory class that makes amine@2: it very easy to create an `AudioDataSource` (a class that implements `DataSource`) amine@2: object, be that from: amine@2: amine@2: - A file on the disk amine@2: - A buffer of data amine@2: - The built-in microphone (requires PyAudio) amine@2: amine@2: amine@2: The `AudioDataSource` class inherits from `DataSource` and supplies amine@2: a higher abstraction level than `AudioSource` thanks to a bunch of amine@2: handy features: amine@2: amine@2: - Define a fixed-length of block_size (i.e. analysis window) amine@2: - Allow overlap between two consecutive analysis windows (hop_size < block_size). amine@2: This can be very important if your validator use the **spectral** amine@2: information of audio data instead of raw audio samples. amine@2: - Limit the amount (i.e. duration) of read data (very useful when reading amine@2: data from the microphone) amine@2: - Record and rewind data (also useful if you read data from the microphone amine@2: and you want to process it many times offline and/or save it) amine@2: amine@2: amine@2: Last but not least, the current version has only one audio window validator based on amine@2: signal energy. amine@2: amine@2: Requirements amine@2: ============ amine@2: amine@2: `auditok` requires `Pyaudio `_ amine@2: for audio acquisition and playback. amine@2: amine@2: amine@2: Illustrative examples with strings amine@2: ================================== amine@2: amine@2: Let us look at some examples using the `auditok.util.StringDataSource` class amine@2: created for test and illustration purposes. Imagine that each character of amine@2: `auditok.util.StringDataSource` data represent an audio slice of 100 ms for amine@2: example. In the following examples we will use upper case letters to represent amine@2: noisy audio slices (i.e. analysis windows or frames) and lower case letter for amine@2: silent frames. amine@2: amine@2: amine@2: Extract sub-sequences of consecutive upper case letters amine@2: ------------------------------------------------------- amine@2: amine@2: We want to extract sub-sequences of characters that have: amine@2: amine@2: - A minimu length of 1 (`min_length` = 1) amine@2: - A maximum length of 9999 (`max_length` = 9999) amine@2: - Zero consecutive lower case characters within them (`max_continuous_silence` = 0) amine@2: amine@2: We also create the `UpperCaseChecker` whose `read` method returns `True` if the amine@2: checked character is in upper case and `False` otherwise. amine@2: amine@2: .. code:: python amine@2: amine@2: from auditok import StreamTokenizer, StringDataSource, DataValidator amine@2: amine@2: class UpperCaseChecker(DataValidator): amine@2: def is_valid(self, frame): amine@2: return frame.isupper() amine@2: amine@2: dsource = StringDataSource("aaaABCDEFbbGHIJKccc") amine@2: tokenizer = StreamTokenizer(validator=UpperCaseChecker(), amine@2: min_length=1, max_length=9999, max_continuous_silence=0) amine@2: amine@2: tokenizer.tokenize(dsource) amine@2: amine@2: The output is a list of two tuples, each contains the extracted sub-sequence and its amine@2: start and end position in the original sequence respectively: amine@2: amine@2: amine@2: [(['A', 'B', 'C', 'D', 'E', 'F'], 3, 8), (['G', 'H', 'I', 'J', 'K'], 11, 15)] amine@2: amine@2: Tolerate up to two non-valid (lower case) letter within an extracted sequence amine@2: ----------------------------------------------------------------------------- amine@2: amine@2: To do so, we set `max_continuous_silence` =2: amine@2: amine@2: .. code:: python amine@2: amine@2: amine@2: from auditok import StreamTokenizer, StringDataSource, DataValidator amine@2: amine@2: class UpperCaseChecker(DataValidator): amine@2: def is_valid(self, frame): amine@2: return frame.isupper() amine@2: amine@2: dsource = StringDataSource("aaaABCDbbEFcGHIdddJKee") amine@2: tokenizer = StreamTokenizer(validator=UpperCaseChecker(), amine@2: min_length=1, max_length=9999, max_continuous_silence=2) amine@2: amine@2: tokenizer.tokenize(dsource) amine@2: amine@2: amine@2: output: amine@2: amine@2: .. code:: python amine@2: amine@2: [(['A', 'B', 'C', 'D', 'b', 'b', 'E', 'F', 'c', 'G', 'H', 'I', 'd', 'd'], 3, 16), (['J', 'K', 'e', 'e'], 18, 21)] amine@2: amine@2: Notice the trailing lower case letters "dd" and "ee" at the end of the two amine@2: tokens. The default behavior of `StreamTokenizer` is to keep the *trailing amine@2: silence* if it does'nt exceed `max_continuous_silence`. This can be changed amine@2: using the `DROP_TRAILING_SILENCE` mode (see next example). amine@2: amine@2: Remove trailing silence amine@2: ----------------------- amine@2: amine@2: Trailing silence can be useful for many sound recognition applications, including amine@2: speech recognition. Moreover, from the human auditory system point of view, trailing amine@2: low energy signal helps removing abrupt signal cuts. amine@2: amine@2: If you want to remove it anyway, you can do it by setting `mode` to `StreamTokenizer.DROP_TRAILING_SILENCE`: amine@2: amine@2: .. code:: python amine@2: amine@2: from auditok import StreamTokenizer, StringDataSource, DataValidator amine@2: amine@2: class UpperCaseChecker(DataValidator): amine@2: def is_valid(self, frame): amine@2: return frame.isupper() amine@2: amine@2: dsource = StringDataSource("aaaABCDbbEFcGHIdddJKee") amine@2: tokenizer = StreamTokenizer(validator=UpperCaseChecker(), amine@2: min_length=1, max_length=9999, max_continuous_silence=2, amine@2: mode=StreamTokenizer.DROP_TRAILING_SILENCE) amine@2: amine@2: tokenizer.tokenize(dsource) amine@2: amine@2: output: amine@2: amine@2: .. code:: python amine@2: amine@2: [(['A', 'B', 'C', 'D', 'b', 'b', 'E', 'F', 'c', 'G', 'H', 'I'], 3, 14), (['J', 'K'], 18, 19)] amine@2: amine@2: amine@2: Limit the length of detected tokens amine@2: ----------------------------------- amine@2: amine@2: Imagine that you just want to detect and recognize a small part of a long amine@2: acoustic event (e.g. engine noise, water flow, etc.) and avoid that that amine@2: event hogs the tokenizer and prevent it from feeding the event to the next amine@2: processing step (i.e. a sound recognizer). You can do this by: amine@2: amine@2: - limiting the length of a detected token. amine@2: amine@2: and amine@2: amine@2: - using a callback function as an argument to `StreamTokenizer.tokenize` amine@2: so that the tokenizer delivers a token as soon as it is detected. amine@2: amine@2: The following code limits the length of a token to 5: amine@2: amine@2: .. code:: python amine@2: amine@2: from auditok import StreamTokenizer, StringDataSource, DataValidator amine@2: amine@2: class UpperCaseChecker(DataValidator): amine@2: def is_valid(self, frame): amine@2: return frame.isupper() amine@2: amine@2: dsource = StringDataSource("aaaABCDEFGHIJKbbb") amine@2: tokenizer = StreamTokenizer(validator=UpperCaseChecker(), amine@2: min_length=1, max_length=5, max_continuous_silence=0) amine@2: amine@2: def print_token(data, start, end): amine@2: print("token = '{0}', starts at {1}, ends at {2}".format(''.join(data), start, end)) amine@2: amine@2: tokenizer.tokenize(dsource, callback=print_token) amine@2: amine@2: amine@2: output: amine@2: amine@2: "token = 'ABCDE', starts at 3, ends at 7" amine@2: "token = 'FGHIJ', starts at 8, ends at 12" amine@2: "token = 'K', starts at 13, ends at 13" amine@2: amine@2: amine@2: Using real audio data amine@2: ===================== amine@2: amine@2: In this section we will use `ADSFactory`, `AudioEnergyValidator` and `StreamTokenizer` amine@2: for an AAD demonstration using audio data. Before we get any, further it is worth amine@2: explaining a certain number of points. amine@2: amine@2: `ADSFactory.ads` method is called to create an `AudioDataSource` object that can be amine@2: passed to `StreamTokenizer.tokenize`. `ADSFactory.ads` accepts a number of keyword amine@2: arguments, of which none is mandatory. The returned `AudioDataSource` object can amine@2: however greatly differ depending on the passed arguments. Further details can be found amine@2: in the respective method documentation. Note however the following two calls that will amine@2: create an `AudioDataSource` that read data from an audio file and from the built-in amine@2: microphone respectively. amine@2: amine@2: .. code:: python amine@2: amine@2: from auditok import ADSFactory amine@2: amine@2: # Get an AudioDataSource from a file amine@2: file_ads = ADSFactory.ads(filename = "path/to/file/") amine@2: amine@2: # Get an AudioDataSource from the built-in microphone amine@2: # The returned object has the default values for sampling amine@2: # rate, sample width an number of channels. see method's amine@2: # documentation for customized values amine@2: mic_ads = ADSFactory.ads() amine@2: amine@2: For `StreamTkenizer`, parameters `min_length`, `max_length` and `max_continuous_silence` amine@2: are expressed in term of number of frames. If you want a `max_length` of *2 seconds* for amine@2: your detected sound events and your *analysis window* is *10 ms* long, you have to specify amine@2: a `max_length` of 200 (`int(2. / (10. / 1000)) == 200`). For a `max_continuous_silence` of *300 ms* amine@2: for instance, the value to pass to StreamTokenizer is 30 (`int(0.3 / (10. / 1000)) == 30`). amine@2: amine@2: amine@2: Where do you get the size of the **analysis window** from? amine@2: amine@2: amine@2: Well this is a parameter you pass to `ADSFactory.ads`. By default `ADSFactory.ads` uses amine@2: an analysis window of 10 ms. the number of samples that 10 ms of signal contain will amine@2: vary depending on the sampling rate of your audio source (file, microphone, etc.). amine@2: For a sampling rate of 16KHz (16000 samples per second), we have 160 samples for 10 ms. amine@2: Therefore you can use block sizes of 160, 320, 1600 for analysis windows of 10, 20 and 100 amine@2: ms respectively. amine@2: amine@2: .. code:: python amine@2: amine@2: from auditok import ADSFactory amine@2: amine@2: file_ads = ADSFactory.ads(filename = "path/to/file/", block_size = 160) amine@2: amine@2: file_ads = ADSFactory.ads(filename = "path/to/file/", block_size = 320) amine@2: amine@2: # If no sampling rate is specified, ADSFactory use 16KHz as the default amine@2: # rate for the microphone. If you want to use a window of 100 ms, use amine@2: # a block size of 1600 amine@2: mic_ads = ADSFactory.ads(block_size = 1600) amine@2: amine@2: So if your not sure what you analysis windows in seconds is, use the following: amine@2: amine@2: .. code:: python amine@2: amine@2: my_ads = ADSFactory.ads(...) amine@2: analysis_win_seconds = float(my_ads.get_block_size()) / my_ads.get_sampling_rate() amine@2: analysis_window_ms = analysis_win_seconds * 1000 amine@2: amine@2: # For a `max_continuous_silence` of 300 ms use: amine@2: max_continuous_silence = int(300. / analysis_window_ms) amine@2: amine@2: # Which is the same as amine@2: max_continuous_silence = int(0.3 / (analysis_window_ms / 1000)) amine@2: amine@2: amine@2: Examples amine@2: -------- amine@2: amine@2: Extract isolated phrases from an utterance amine@2: ------------------------------------------ amine@2: amine@2: We will build an `AudioDataSource` using a wave file from the database. amine@2: The file contains of isolated pronunciation of digits from 1 to 1 amine@2: in Arabic as well as breath-in/out between 2 and 3. The code will play the amine@2: original file then the detected sounds separately. Note that we use an amine@2: `energy_threshold` of 65, this parameter should be carefully chosen. It depends amine@2: on microphone quality, background noise and the amplitude of events you want to amine@2: detect. amine@2: amine@2: .. code:: python amine@2: amine@2: from auditok import ADSFactory, AudioEnergyValidator, StreamTokenizer, player_for, dataset amine@2: amine@2: # We set the `record` argument to True so that we can rewind the source amine@2: asource = ADSFactory.ads(filename=dataset.one_to_six_arabic_16000_mono_bc_noise, record=True) amine@2: amine@2: validator = AudioEnergyValidator(sample_width=asource.get_sample_width(), energy_threshold=65) amine@2: amine@2: # Defalut analysis window is 10 ms (float(asource.get_block_size()) / asource.get_sampling_rate()) amine@2: # min_length=20 : minimum length of a valid audio activity is 20 * 10 == 200 ms amine@2: # max_length=4000 : maximum length of a valid audio activity is 400 * 10 == 4000 ms == 4 seconds amine@2: # max_continuous_silence=30 : maximum length of a tolerated silence within a valid audio activity is 30 * 30 == 300 ms amine@2: tokenizer = StreamTokenizer(validator=validator, min_length=20, max_length=400, max_continuous_silence=30) amine@2: amine@2: asource.open() amine@2: tokens = tokenizer.tokenize(asource) amine@2: amine@2: # Play detected regions back amine@2: amine@2: player = player_for(asource) amine@2: amine@2: # Rewind and read the whole signal amine@2: asource.rewind() amine@2: original_signal = [] amine@2: amine@2: while True: amine@2: w = asource.read() amine@2: if w is None: amine@2: break amine@2: original_signal.append(w) amine@2: amine@2: original_signal = ''.join(original_signal) amine@2: amine@2: print("Playing the original file...") amine@2: player.play(original_signal) amine@2: amine@2: print("playing detected regions...") amine@2: for t in tokens: amine@2: print("Token starts at {0} and ends at {1}".format(t[1], t[2])) amine@2: data = ''.join(t[0]) amine@2: player.play(data) amine@2: amine@2: assert len(tokens) == 8 amine@2: amine@2: amine@2: The tokenizer extracts 8 audio regions from the signal, including all isolated digits amine@2: (from 1 to 6) as well as the 2-phase respiration of the subject. You might have noticed amine@2: that, in the original file, the last three digit are closer to each other than the amine@2: previous ones. If you wan them to be extracted as one single phrase, you can do so amine@2: by tolerating a larger continuous silence within a detection: amine@2: amine@2: .. code:: python amine@2: amine@2: tokenizer.max_continuous_silence = 50 amine@2: asource.rewind() amine@2: tokens = tokenizer.tokenize(asource) amine@2: amine@2: for t in tokens: amine@2: print("Token starts at {0} and ends at {1}".format(t[1], t[2])) amine@2: data = ''.join(t[0]) amine@2: player.play(data) amine@2: amine@2: assert len(tokens) == 6 amine@2: amine@2: amine@2: Trim leading and trailing silence amine@2: --------------------------------- amine@2: amine@2: The tokenizer in the following example is set up to remove the silence amine@2: that precedes the first acoustic activity or follows the last activity amine@2: in a record. It preserves whatever it founds between the two activities. amine@2: In other words, it removes the leading and trailing silence. amine@2: amine@2: Sampling rate is 44100 sample per second, we'll use an analysis window of 100 ms amine@2: (i.e. bloc_ksize == 4410) amine@2: amine@2: Energy threshold is 50. amine@2: amine@2: The tokenizer will start accumulating windows up from the moment it encounters amine@2: the first analysis window of an energy >= 50. ALL the following windows will be amine@2: kept regardless of their energy. At the end of the analysis, it will drop trailing amine@2: windows with an energy below 50. amine@2: amine@2: This is an interesting example because the audio file we're analyzing contains a very amine@2: brief noise that occurs within the leading silence. We certainly do want our tokenizer amine@2: to stop at this point and considers whatever it comes after as a useful signal. amine@2: To force the tokenizer to ignore that brief event we use two other parameters `init_min` amine@2: ans `init_max_silence`. By `init_min` = 3 and `init_max_silence` = 1 we tell the tokenizer amine@2: that a valid event must start with at least 3 noisy windows, between which there amine@2: is at most 1 silent window. amine@2: amine@2: Still with this configuration we can get the tokenizer detect that noise as a valid event amine@2: (if it actually contains 3 consecutive noisy frames). To circummvent this we use an enough amine@2: large analysis window (here of 100 ms) to ensure that the brief noise be surrounded by a much amine@2: longer silence and hence the energy of the overall analysis window will be below 50. amine@2: amine@2: When using a shorter analysis window (of 10ms for instance, block_size == 441), the brief amine@2: noise contributes more to energy calculation which yields an energy of over 50 for the window. amine@2: Again we can deal with this situation by using a higher energy threshold (55 for example). amine@2: amine@2: .. code:: python amine@2: amine@2: from auditok import ADSFactory, AudioEnergyValidator, StreamTokenizer, player_for, dataset amine@2: import pyaudio amine@2: amine@2: # record = True so that we'll be able to rewind the source. amine@2: asource = ADSFactory.ads(filename=dataset.was_der_mensch_saet_mono_44100_lead_trail_silence, amine@2: record=True, block_size=4410) amine@2: asource.open() amine@2: amine@2: original_signal = [] amine@2: # Read the whole signal amine@2: while True: amine@2: w = asource.read() amine@2: if w is None: amine@2: break amine@2: original_signal.append(w) amine@2: amine@2: original_signal = ''.join(original_signal) amine@2: amine@2: # rewind source amine@2: asource.rewind() amine@2: amine@2: # Create a validator with an energy threshold of 50 amine@2: validator = AudioEnergyValidator(sample_width=asource.get_sample_width(), energy_threshold=50) amine@2: amine@2: # Create a tokenizer with an unlimited token length and continuous silence within a token amine@2: # Note the DROP_TRAILING_SILENCE mode that will ensure removing trailing silence amine@2: trimmer = StreamTokenizer(validator, min_length = 20, max_length=99999999, init_min=3, init_max_silence=1, max_continuous_silence=9999999, mode=StreamTokenizer.DROP_TRAILING_SILENCE) amine@2: amine@2: amine@2: tokens = trimmer.tokenize(asource) amine@2: amine@2: # Make sure we only have one token amine@2: assert len(tokens) == 1, "Should have detected one single token" amine@2: amine@2: trimmed_signal = ''.join(tokens[0][0]) amine@2: amine@2: player = player_for(asource) amine@2: amine@2: print("Playing original signal (with leading and trailing silence)...") amine@2: player.play(original_signal) amine@2: print("Playing trimmed signal...") amine@2: player.play(trimmed_signal) amine@2: amine@2: amine@2: Online audio signal processing amine@2: ------------------------------ amine@2: amine@2: In the next example, audio data is directely acquired from the built-in microphone. amine@2: The `tokenize` method is passed a callback function so that audio activities amine@2: are delivered as soon as they are detected. Each detected activity is played amine@2: back using the build-in audio output device. amine@2: amine@2: As mentionned before , Signal energy is strongly related to many factors such amine@2: microphone sensitivity, background noise (including noise inherent to the hardware), amine@2: distance and your operating system sound settings. Try a lower `energy_threshold` amine@2: if your noise does not seem to be detected and a higher threshold if you notice amine@2: an over detection (echo method prints a detection where you have made no noise). amine@2: amine@2: .. code:: python amine@2: amine@2: from auditok import ADSFactory, AudioEnergyValidator, StreamTokenizer, player_for amine@2: import pyaudio amine@2: amine@2: # record = True so that we'll be able to rewind the source. amine@2: # max_time = 10: read 10 seconds from the microphone amine@2: asource = ADSFactory.ads(record=True, max_time=10) amine@2: amine@2: validator = AudioEnergyValidator(sample_width=asource.get_sample_width(), energy_threshold=50) amine@2: tokenizer = StreamTokenizer(validator=validator, min_length=20, max_length=250, max_continuous_silence=30) amine@2: amine@2: player = player_for(asource) amine@2: amine@2: def echo(data, start, end): amine@2: print("Acoustic activity at: {0}--{1}".format(start, end)) amine@2: player.play(''.join(data)) amine@2: amine@2: asource.open() amine@2: amine@2: tokenizer.tokenize(asource, callback=echo) amine@2: amine@2: If you want to re-run the tokenizer after changing of one or many parameters, use the following code: amine@2: amine@2: .. code:: python amine@2: amine@2: asource.rewind() amine@2: # change energy threshold for example amine@2: tokenizer.validator.set_energy_threshold(55) amine@2: tokenizer.tokenize(asource, callback=echo) amine@2: amine@2: In case you want to play the whole recorded signal back use: amine@2: amine@2: .. code:: python amine@2: amine@2: player.play(asource.get_audio_source().get_data_buffer()) amine@2: amine@2: amine@2: Contributing amine@2: ============ amine@2: **auditok** is on `GitHub `_. You're welcome to fork it and contribute. amine@2: amine@2: amine@2: Amine SEHILI amine@2: September 2015 amine@2: amine@2: License amine@2: ======= amine@2: amine@2: This package is published under GNU GPL Version 3.