Mercurial > hg > auditok
changeset 33:d28d94bf6b39
doc update
author | Amine Sehili <amine.sehili@gmail.com> |
---|---|
date | Wed, 02 Dec 2015 11:10:54 +0100 |
parents | ea905bc19458 |
children | 3609351078ce |
files | auditok/__init__.py auditok/core.py auditok/dataset.py auditok/io.py auditok/util.py doc/apireference.rst doc/apitutorial.rst doc/cmdline.rst doc/conf.py doc/index.rst |
diffstat | 10 files changed, 44 insertions(+), 516 deletions(-) [+] |
line wrap: on
line diff
--- a/auditok/__init__.py Tue Dec 01 20:13:02 2015 +0100 +++ b/auditok/__init__.py Wed Dec 02 11:10:54 2015 +0100 @@ -1,485 +1,10 @@ """ -`auditok` is a module that can be used as a generic tool for data -tokenization. Although its core motivation is **Acoustic Activity -Detection** (AAD) and extraction from audio streams (i.e. detect -where a noise/an acoustic activity occurs within an audio stream and -extract the corresponding portion of signal), it can easily be -adapted to other tasks. +:author: -Globally speaking, it can be used to extract, from a sequence of -observations, all sub-sequences that meet a certain number of -criteria in terms of: +Amine SEHILI <amine.sehili@gmail.com> +September-December, 2015 -1. Minimum length of a **valid** token (i.e. sub-sequence) -2. Maximum length of a valid token -3. Maximum tolerated consecutive **non-valid** observations within - a valid token - -Examples of a non-valid observation are: a non-numeric ascii symbol -if you are interested in sub-sequences of numeric symbols, or a silent -audio window (of 10, 20 or 100 milliseconds for instance) if what -interests you are audio regions made up of a sequence of ``noisy'' -windows (whatever kind of noise: speech, baby cry, laughter, etc.). - -The most important component of `auditok` is the `auditok.core.StreamTokenizer` class. -An instance of this class encapsulates a `DataValidator` and can be -configured to detect the desired regions from a stream. -The `StreamTokenizer.tokenize` method accepts a `DataSource` -object that has a `read` method. Read data can be of any type accepted -by the `validator`. - - -As the main aim of this module is **Audio Activity Detection**, -it provides the `auditok.util.ADSFactory` factory class that makes -it very easy to create an `AudioDataSource` (a class that implements `DataSource`) - object, be that from: - - - A file on the disk - - A buffer of data - - The built-in microphone (requires PyAudio) - - -The `AudioDataSource` class inherits from `DataSource` and supplies -a higher abstraction level than `AudioSource` thanks to a bunch of -handy features: - - - Define a fixed-length of block_size (i.e. analysis window) - - Allow overlap between two consecutive analysis windows (hop_size < block_size). - This can be very important if your validator use the **spectral** - information of audio data instead of raw audio samples. - - Limit the amount (i.e. duration) of read data (very useful when reading - data from the microphone) - - Record and rewind data (also useful if you read data from the microphone - and you want to process it many times off-line and/or save it) - - -Last but not least, the current version has only one audio window validator based on -signal energy. - -Requirements: ------------- -`auditok` requires [Pyaudio](http://people.csail.mit.edu/hubert/pyaudio/) for audio acquisition and playback. - - - -Illustrative examples with strings ----------------------------------- -Let us look at some examples using the `auditok.util.StringDataSource` class -created for test and illustration purposes. Imagine that each character of -`auditok.util.StringDataSource` data represent an audio slice of 100 ms for -example. In the following examples we will use upper case letters to represent -noisy audio slices (i.e. analysis windows or frames) and lower case letter for -silent frames. - -## Extract sub-sequences of consecutive upper case letters -We want to extract sub-sequences of characters that have: - - - A minimum length of 1 (`min_length` = 1) - - A maximum length of 9999 (`max_length` = 9999) - - Zero consecutive lower case characters within them (`max_continuous_silence` = 0) - -We also create the `UpperCaseChecker` whose `read` method returns `True` if the -checked character is in upper case and `False` otherwise. - - #!python - from auditok import StreamTokenizer, StringDataSource, DataValidator - - class UpperCaseChecker(DataValidator): - def is_valid(self, frame): - return frame.isupper() - - dsource = StringDataSource("aaaABCDEFbbGHIJKccc") - tokenizer = StreamTokenizer(validator=UpperCaseChecker(), - min_length=1, max_length=9999, max_continuous_silence=0) - - tokenizer.tokenize(dsource) - -The output is a list of two tuples, each contains the extracted sub-sequence and its -start and end position in the original sequence respectively: - - #!python - [(['A', 'B', 'C', 'D', 'E', 'F'], 3, 8), (['G', 'H', 'I', 'J', 'K'], 11, 15)] - -## Tolerate up to 2 non-valid (lower case) letters within an extracted sequence - -To do so, we set `max_continuous_silence`=2: - - #!python - from auditok import StreamTokenizer, StringDataSource, DataValidator - - class UpperCaseChecker(DataValidator): - def is_valid(self, frame): - return frame.isupper() - - dsource = StringDataSource("aaaABCDbbEFcGHIdddJKee") - tokenizer = StreamTokenizer(validator=UpperCaseChecker(), - min_length=1, max_length=9999, max_continuous_silence=2) - - tokenizer.tokenize(dsource) - -output: - - #!python - [(['A', 'B', 'C', 'D', 'b', 'b', 'E', 'F', 'c', 'G', 'H', 'I', 'd', 'd'], 3, 16), (['J', 'K', 'e', 'e'], 18, 21)] - -Notice the tailing lower case letters "dd" and "ee" at the end of the two -tokens. The default behavior of `StreamTokenizer` is to keep the *tailing -silence* if it doesn't exceed `max_continuous_silence`. This can be changed -using the `DROP_TAILING_SILENCE` mode (see next example). - -## Remove tailing silence - -Tailing silence can be useful for many sound recognition applications, including -speech recognition. Moreover, from the human auditory system point of view, tailing -low energy signal helps removing abrupt signal cuts. - -If you want to remove it anyway, you can do it by setting `mode` to `StreamTokenizer.DROP_TAILING_SILENCE`: - - - #!python - from auditok import StreamTokenizer, StringDataSource, DataValidator - - class UpperCaseChecker(DataValidator): - def is_valid(self, frame): - return frame.isupper() - - dsource = StringDataSource("aaaABCDbbEFcGHIdddJKee") - tokenizer = StreamTokenizer(validator=UpperCaseChecker(), - min_length=1, max_length=9999, max_continuous_silence=2, - mode=StreamTokenizer.DROP_TAILING_SILENCE) - - tokenizer.tokenize(dsource) - -output: - - #!python - [(['A', 'B', 'C', 'D', 'b', 'b', 'E', 'F', 'c', 'G', 'H', 'I'], 3, 14), (['J', 'K'], 18, 19)] - - -## Limit the length of detected tokens - -Imagine that you just want to detect and recognize a small part of a long -acoustic event (e.g. engine noise, water flow, etc.) and avoid that that -event hogs the tokenizer and prevent it from feeding the event to the next -processing step (i.e. a sound recognizer). You can do this by: - - - limiting the length of a detected token. - - and - - - using a callback function as an argument to `StreamTokenizer.tokenize` - so that the tokenizer delivers a token as soon as it is detected. - -The following code limits the length of a token to 5: - - #!python - from auditok import StreamTokenizer, StringDataSource, DataValidator - - class UpperCaseChecker(DataValidator): - def is_valid(self, frame): - return frame.isupper() - - dsource = StringDataSource("aaaABCDEFGHIJKbbb") - tokenizer = StreamTokenizer(validator=UpperCaseChecker(), - min_length=1, max_length=5, max_continuous_silence=0) - - def print_token(data, start, end): - print("token = '{0}', starts at {1}, ends at {2}".format(''.join(data), start, end)) - - tokenizer.tokenize(dsource, callback=print_token) - - -output: - - "token = 'ABCDE', starts at 3, ends at 7" - "token = 'FGHIJ', starts at 8, ends at 12" - "token = 'K', starts at 13, ends at 13" - - -Using real audio data -------------------------------- -In this section we will use `ADSFactory`, `AudioEnergyValidator` and `StreamTokenizer` -for an AAD demonstration using audio data. Before we get any, further it is worth -explaining a certain number of points. - -`ADSFactory.ads` method is called to create an `AudioDataSource` object that can be -passed to `StreamTokenizer.tokenize`. `ADSFactory.ads` accepts a number of keyword -arguments, of which none is mandatory. The returned `AudioDataSource` object can -however greatly differ depending on the passed arguments. Further details can be found -in the respective method documentation. Note however the following two calls that will -create an `AudioDataSource` that read data from an audio file and from the built-in -microphone respectively. - - #!python - from auditok import ADSFactory - - # Get an AudioDataSource from a file - file_ads = ADSFactory.ads(filename = "path/to/file/") - - # Get an AudioDataSource from the built-in microphone - # The returned object has the default values for sampling - # rate, sample width an number of channels. see method's - # documentation for customized values - mic_ads = ADSFactory.ads() - -For `StreamTkenizer`, parameters `min_length`, `max_length` and `max_continuous_silence` -are expressed in term of number of frames. If you want a `max_length` of *2 seconds* for -your detected sound events and your *analysis window* is *10 ms* long, you have to specify -a `max_length` of 200 (`int(2. / (10. / 1000)) == 200`). For a `max_continuous_silence` of *300 ms* -for instance, the value to pass to StreamTokenizer is 30 (`int(0.3 / (10. / 1000)) == 30`). - - -Where do you get the size of the **analysis window** from? - - -Well this is a parameter you pass to `ADSFactory.ads`. By default `ADSFactory.ads` uses -an analysis window of 10 ms. the number of samples that 10 ms of signal contain will -vary depending on the sampling rate of your audio source (file, microphone, etc.). -For a sampling rate of 16KHz (16000 samples per second), we have 160 samples for 10 ms. -Therefore you can use block sizes of 160, 320, 1600 for analysis windows of 10, 20 and 100 -ms respectively. - - #!python - from auditok import ADSFactory - - file_ads = ADSFactory.ads(filename = "path/to/file/", block_size = 160) - - file_ads = ADSFactory.ads(filename = "path/to/file/", block_size = 320) - - # If no sampling rate is specified, ADSFactory use 16KHz as the default - # rate for the microphone. If you want to use a window of 100 ms, use - # a block size of 1600 - mic_ads = ADSFactory.ads(block_size = 1600) - -So if your not sure what you analysis windows in seconds is, use the following: - - #!python - my_ads = ADSFactory.ads(...) - analysis_win_seconds = float(my_ads.get_block_size()) / my_ads.get_sampling_rate() - analysis_window_ms = analysis_win_seconds * 1000 - - # For a `max_continuous_silence` of 300 ms use: - max_continuous_silence = int(300. / analysis_window_ms) - - # Which is the same as - max_continuous_silence = int(0.3 / (analysis_window_ms / 1000)) - - -Examples --------- - -## Extract isolated phrases from an utterance -We will build an `AudioDataSource` using a wave file from the database. -The file contains of isolated pronunciation of digits from 1 to 1 -in Arabic as well as breath-in/out between 2 and 3. The code will play the - original file then the detected sounds separately. Note that we use an -`energy_threshold` of 65, this parameter should be carefully chosen. It depends -on microphone quality, background noise and the amplitude of events you want to -detect. - - #!python - from auditok import ADSFactory, AudioEnergyValidator, StreamTokenizer, player_for, dataset - - # We set the `record` argument to True so that we can rewind the source - asource = ADSFactory.ads(filename=dataset.one_to_six_arabic_16000_mono_bc_noise, record=True) - - validator = AudioEnergyValidator(sample_width=asource.get_sample_width(), energy_threshold=65) - - # Defalut analysis window is 10 ms (float(asource.get_block_size()) / asource.get_sampling_rate()) - # min_length=20 : minimum length of a valid audio activity is 20 * 10 == 200 ms - # max_length=4000 : maximum length of a valid audio activity is 400 * 10 == 4000 ms == 4 seconds - # max_continuous_silence=30 : maximum length of a tolerated silence within a valid audio activity is 30 * 30 == 300 ms - tokenizer = StreamTokenizer(validator=validator, min_length=20, max_length=400, max_continuous_silence=30) - - asource.open() - tokens = tokenizer.tokenize(asource) - - # Play detected regions back - - player = player_for(asource) - - # Rewind and read the whole signal - asource.rewind() - original_signal = [] - - while True: - w = asource.read() - if w is None: - break - original_signal.append(w) - - original_signal = b''.join(original_signal) - - print("Playing the original file...") - player.play(original_signal) - - print("playing detected regions...") - for t in tokens: - print("Token starts at {0} and ends at {1}".format(t[1], t[2])) - data = b''.join(t[0]) - player.play(data) - - assert len(tokens) == 8 - - -The tokenizer extracts 8 audio regions from the signal, including all isolated digits -(from 1 to 6) as well as the 2-phase respiration of the subject. You might have noticed -that, in the original file, the last three digit are closer to each other than the -previous ones. If you wan them to be extracted as one single phrase, you can do so -by tolerating a larger continuous silence within a detection: - - #!python - tokenizer.max_continuous_silence = 50 - asource.rewind() - tokens = tokenizer.tokenize(asource) - - for t in tokens: - print("Token starts at {0} and ends at {1}".format(t[1], t[2])) - data = b''.join(t[0]) - player.play(data) - - assert len(tokens) == 6 - - -## Trim leading and tailing silence - -The tokenizer in the following example is set up to remove the silence -that precedes the first acoustic activity or follows the last activity -in a record. It preserves whatever it founds between the two activities. -In other words, it removes the leading and tailing silence. - -Sampling rate is 44100 sample per second, we'll use an analysis window of 100 ms -(i.e. block_size == 4410) - -Energy threshold is 50. - -The tokenizer will start accumulating windows up from the moment it encounters -the first analysis window of an energy >= 50. ALL the following windows will be -kept regardless of their energy. At the end of the analysis, it will drop tailing - windows with an energy below 50. - -This is an interesting example because the audio file we're analyzing contains a very -brief noise that occurs within the leading silence. We certainly do want our tokenizer -to stop at this point and considers whatever it comes after as a useful signal. -To force the tokenizer to ignore that brief event we use two other parameters `init_min` -and `init_max_silence`. By `init_min`=3 and `init_max_silence`=1 we tell the tokenizer -that a valid event must start with at least 3 noisy windows, between which there -is at most 1 silent window. - -Still with this configuration we can get the tokenizer detect that noise as a valid event -(if it actually contains 3 consecutive noisy frames). To circumvent this we use an enough -large analysis window (here of 100 ms) to ensure that the brief noise be surrounded by a much -longer silence and hence the energy of the overall analysis window will be below 50. - -When using a shorter analysis window (of 10ms for instance, block_size == 441), the brief -noise contributes more to energy calculation which yields an energy of over 50 for the window. -Again we can deal with this situation by using a higher energy threshold (55 for example) - - - #!python - from auditok import ADSFactory, AudioEnergyValidator, StreamTokenizer, player_for, dataset - - # record = True so that we'll be able to rewind the source. - asource = ADSFactory.ads(filename=dataset.was_der_mensch_saet_mono_44100_lead_tail_silence, - record=True, block_size=4410) - asource.open() - - original_signal = [] - # Read the whole signal - while True: - w = asource.read() - if w is None: - break - original_signal.append(w) - - original_signal = b''.join(original_signal) - - # rewind source - asource.rewind() - - # Create a validator with an energy threshold of 50 - validator = AudioEnergyValidator(sample_width=asource.get_sample_width(), energy_threshold=50) - - # Create a tokenizer with an unlimited token length and continuous silence within a token - # Note the DROP_TAILING_SILENCE mode that will ensure removing tailing silence - trimmer = StreamTokenizer(validator, min_length = 20, max_length=99999999, init_min=3, init_max_silence=1, - max_continuous_silence=9999999, mode=StreamTokenizer.DROP_TAILING_SILENCE) - - - tokens = trimmer.tokenize(asource) - - # Make sure we only have one token - assert len(tokens) == 1, "Should have detected one single token" - - trimmed_signal = b''.join(tokens[0][0]) - - player = player_for(asource) - - print("Playing original signal (with leading and tailing silence)...") - player.play(original_signal) - print("Playing trimmed signal...") - player.play(trimmed_signal) - - - -## Online audio signal processing - -In the next example, audio data is directly acquired from the built-in microphone. -The `tokenize` method is passed a callback function so that audio activities -are delivered as soon as they are detected. Each detected activity is played -back using the build-in audio output device. - -As mentioned before , Signal energy is strongly related to many factors such -microphone sensitivity, background noise (including noise inherent to the hardware), -distance and your operating system sound settings. Try a lower `energy_threshold` -if your noise does not seem to be detected and a higher threshold if you notice -an over detection (echo method prints a detection where you have made no noise). - - #!python - from auditok import ADSFactory, AudioEnergyValidator, StreamTokenizer, player_for - - # record = True so that we'll be able to rewind the source. - # max_time = 10: read 10 seconds from the microphone - asource = ADSFactory.ads(record=True, max_time=10) - - validator = AudioEnergyValidator(sample_width=asource.get_sample_width(), energy_threshold=50) - tokenizer = StreamTokenizer(validator=validator, min_length=20, max_length=250, max_continuous_silence=30) - - player = player_for(asource) - - def echo(data, start, end): - print("Acoustic activity at: {0}--{1}".format(start, end)) - player.play(b''.join(data)) - - asource.open() - - tokenizer.tokenize(asource, callback=echo) - -If you want to re-run the tokenizer after changing of one or many parameters, use the following code: - - #!python - asource.rewind() - # change energy threshold for example - tokenizer.validator.set_energy_threshold(55) - tokenizer.tokenize(asource, callback=echo) - -In case you want to play the whole recorded signal back use: - - #!python - player.play(asource.get_audio_source().get_data_buffer()) - - -Contributing ------------- -`auditok` is on [GitHub](https://github.com/amsehili). You're welcome to fork -it and contribute. - - -@author: Amine SEHILI <amine.sehili@gmail.com> -September, 2015 - -License -------- +:License: This package is published under GNU GPL Version 3. """
--- a/auditok/core.py Tue Dec 01 20:13:02 2015 +0100 +++ b/auditok/core.py Wed Dec 02 11:10:54 2015 +0100 @@ -1,7 +1,16 @@ +""" +This module gathers processing (i.e. tokenization) classes. + +Class summary +============= + +.. autosummary:: + + StreamTokenizer +""" + from auditok.util import DataValidator - - __all__ = ["StreamTokenizer"] @@ -220,13 +229,13 @@ New mode, must be one of: - - a. `StreamTokenizer.STRICT_MIN_LENGTH` + - `StreamTokenizer.STRICT_MIN_LENGTH` - - b. `StreamTokenizer.DROP_TRAILING_SILENCE` + - `StreamTokenizer.DROP_TRAILING_SILENCE` - - c. `StreamTokenizer.STRICT_MIN_LENGTH | StreamTokenizer.DROP_TRAILING_SILENCE` + - `StreamTokenizer.STRICT_MIN_LENGTH | StreamTokenizer.DROP_TRAILING_SILENCE` - - d. `0` + - `0` See `StreamTokenizer.__init__` for more information about the mode. """
--- a/auditok/dataset.py Tue Dec 01 20:13:02 2015 +0100 +++ b/auditok/dataset.py Wed Dec 02 11:10:54 2015 +0100 @@ -1,10 +1,5 @@ """ -.. py:module:: dataset - :synopsis: This module contains links to audio files you can use for test purposes. - -.. moduleauthor:: Amine SEHILI <amine.sehili@gmail.com> - -September 2015 +This module contains links to audio files you can use for test purposes. """ import os
--- a/auditok/io.py Tue Dec 01 20:13:02 2015 +0100 +++ b/auditok/io.py Wed Dec 02 11:10:54 2015 +0100 @@ -1,5 +1,5 @@ """ -Module for low-level audio input-output operations. +Module for low-level audio input-output operations. Class summary =============
--- a/auditok/util.py Tue Dec 01 20:13:02 2015 +0100 +++ b/auditok/util.py Wed Dec 02 11:10:54 2015 +0100 @@ -1,5 +1,3 @@ -#!/usr/bin/env python -# -*- coding: utf-8 -*- """ Class summary =============
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/apireference.rst Wed Dec 02 11:10:54 2015 +0100 @@ -0,0 +1,11 @@ +`auditok` API Reference +======================= + +.. toctree:: + :titlesonly: + :maxdepth: 3 + + auditok.core <core.rst> + auditok.util <util.rst> + auditok.io <io.rst> + auditok.dataset <dataset.rst>
--- a/doc/apitutorial.rst Tue Dec 01 20:13:02 2015 +0100 +++ b/doc/apitutorial.rst Wed Dec 02 11:10:54 2015 +0100 @@ -70,7 +70,7 @@ Let us look at some examples using the :class:`auditok.util.StringDataSource` class created for test and illustration purposes. Imagine that each character of -:class:`auditok.util.StringDataSource` data represent an audio slice of 100 ms for +:class:`auditok.util.StringDataSource` data represents an audio slice of 100 ms for example. In the following examples we will use upper case letters to represent noisy audio slices (i.e. analysis windows or frames) and lower case letter for silent frames.
--- a/doc/cmdline.rst Tue Dec 01 20:13:02 2015 +0100 +++ b/doc/cmdline.rst Wed Dec 02 11:10:54 2015 +0100 @@ -18,15 +18,17 @@ .. figure:: figures/figure_1.png :align: center - :alt: alternate text + :alt: Output from a detector that tolerates silence periods up to 300 ms :figclass: align-center + :scale: 40 % 2. the detector splits an audio activity event into many activities if the within activity silence is over 0.2 second: .. figure:: figures/figure_2.png :align: center - :alt: alternate text + :alt: Output from a detector that tolerates silence periods up to 200 ms :figclass: align-center + :scale: 40 % ****************** @@ -186,8 +188,8 @@ Put some text here... -2nd Practical use case example: build ab (very) basic voice control application -############################################################################### +2nd Practical use case example: build a (very) basic voice control application +############################################################################## `This repository <https://github.com/amsehili/gspeech-rec>`_ supplies a bash script the can send audio data to Google's Speech Recognition service and get its transcription. In the following we will use **auditok** as a lower layer component
--- a/doc/conf.py Tue Dec 01 20:13:02 2015 +0100 +++ b/doc/conf.py Wed Dec 02 11:10:54 2015 +0100 @@ -112,9 +112,9 @@ # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. -#html_theme = 'sphinxdoc' +html_theme = 'sphinxdoc' - +""" # on_rtd is whether we are on readthedocs.org import os on_rtd = os.environ.get('READTHEDOCS', None) == 'True' @@ -123,7 +123,7 @@ import sphinx_rtd_theme html_theme = 'sphinx_rtd_theme' html_theme_path = [sphinx_rtd_theme.get_html_theme_path()] - +""" # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation.
--- a/doc/index.rst Tue Dec 01 20:13:02 2015 +0100 +++ b/doc/index.rst Wed Dec 02 11:10:54 2015 +0100 @@ -34,30 +34,18 @@ sudo python setup.py install - Getting started --------------- .. toctree:: :titlesonly: - :maxdepth: 1 + :maxdepth: 3 Command-line Usage Guide <cmdline.rst> API Tutorial <apitutorial.rst> + API Reference <apireference.rst> -API Reference -------------- - -.. toctree:: - :titlesonly: - :maxdepth: 1 - - auditok.core <core.rst> - auditok.util <util.rst> - auditok.io <io.rst> - auditok.dataset <dataset.rst> - Indices and tables