amsehili@11: [![Build Status](https://travis-ci.org/amsehili/auditok.svg?branch=master)](https://travis-ci.org/amsehili/auditok) amine@37: [![Documentation Status](https://readthedocs.org/projects/auditok/badge/?version=latest)](http://auditok.readthedocs.org/en/latest/?badge=latest) amsehili@11: AUDIo TOKenizer amine@2: =============== amine@2: amsehili@20: `auditok` is an **Audio Activity Detection** tool that can process online data (read from an audio device or from standard input) as well as audio files. It can be used as a command line program and offers an easy to use API. amsehili@20: amsehili@45: A more detailed version of this user-guide, an API tutorial and API reference can be found at [Readthedocs](http://auditok.readthedocs.org/en/latest/) amine@35: amsehili@25: - [Two-figure explanation](https://github.com/amsehili/auditok#two-figure-explanation) amsehili@25: - [Requirements](https://github.com/amsehili/auditok#requirements) amsehili@25: - [Installation](https://github.com/amsehili/auditok#installation) amsehili@25: - [Command line usage](https://github.com/amsehili/auditok#command-line-usage) amsehili@25: - [Try the detector with your voice](https://github.com/amsehili/auditok#try-the-detector-with-your-voice) amsehili@26: - [Play back detections](https://github.com/amsehili/auditok#play-back-detections) amsehili@26: - [Set detection threshold](https://github.com/amsehili/auditok#set-detection-threshold) amsehili@29: - [Set format for printed detections information](https://github.com/amsehili/auditok#set-format-for-printed-detections-information) amsehili@43: - [Plot signal and detections](https://github.com/amsehili/auditok#plot-signal-and-detections) amsehili@26: - [Save plot as image or PDF](https://github.com/amsehili/auditok#save-plot-as-image-or-pdf) amsehili@26: - [Read data from file](https://github.com/amsehili/auditok#read-data-from-file) amsehili@26: - [Limit the length of aquired/read data](https://github.com/amsehili/auditok#limit-the-length-of-aquired-data) amsehili@26: - [Save the whole acquired audio signal](https://github.com/amsehili/auditok#save-the-whole-acquired-audio-signal) amsehili@26: - [Save each detection into a separate audio file](https://github.com/amsehili/auditok#save-each-detection-into-a-separate-audio-file) amsehili@45: - [Setting detection parameters](https://github.com/amsehili/auditok#setting-detection-parameters) amsehili@43: - [Some practical use cases](https://github.com/amsehili/auditok#some-practical-use-cases) amsehili@43: - [1st practical use case: generate a subtitles template](https://github.com/amsehili/auditok#1st-practical-use-case-generate-a-subtitles-template) amsehili@44: - [2nd Practical use case example: build a (very) basic voice control application](https://github.com/amsehili/auditok#2nd-practical-use-case-example-build-a-very-basic-voice-control-application) amsehili@26: - [License](https://github.com/amsehili/auditok#license) amine@41: - [Author](https://github.com/amsehili/auditok#author) amsehili@25: amsehili@25: Two-figure explanation amsehili@25: ---------------------- amsehili@25: The following two figures illustrate an audio signal (blue) and regions detected as valid audio activities (green rectangles) according to a given threshold (red dashed line). They respectively depict the detection result when: amsehili@20: amsehili@20: 1. the detector tolerates phases of silence of up to 0.3 second (300 ms) within an audio activity (also referred to as acoustic event): amsehili@20: ![](doc/figures/figure_1.png) amsehili@20: amsehili@25: 2. the detector splits an audio activity event into many activities if the within activity silence is over 0.2 second: amsehili@20: ![](doc/figures/figure_2.png) amsehili@20: amine@35: Beyond plotting signal and detections, you can play back audio activities as they are detected, save them or run a user command each time there is an activity, amine@35: using, optionally, the file name of audio activity as an argument for the command. amine@2: amine@2: Requirements amine@2: ------------ amine@40: `auditok` can be used with standard Python! amine@40: amine@40: However, if you want more features, the following packages are needed: amsehili@20: - [pydub](https://github.com/jiaaro/pydub): read audio files of popular audio formats (ogg, mp3, etc.) or extract audio from a video file amsehili@20: - [PyAudio](http://people.csail.mit.edu/hubert/pyaudio/): read audio data from the microphone and play back detections amine@40: - [matplotlib](http://matplotlib.org/): plot audio signal and detections (see figures above) amine@40: - [numpy](http://www.numpy.org): required by matplotlib. Also used for math operations instead of standard python if available amsehili@20: - Optionnaly, you can use `sox` or `parecord` for data acquisition and feed `auditok` using a pipe. amsehili@20: amine@2: amine@2: Installation amine@2: ------------ amine@40: amine@40: git clone https://github.com/amsehili/auditok.git amine@40: cd auditok amine@4: python setup.py install amine@2: amsehili@25: Command line usage amine@21: ------------------ amine@21: amsehili@25: ### Try the detector with your voice amsehili@25: amine@21: The first thing you want to check is perhaps how `auditok` detects your voice. If you have installed `PyAudio` just run (`Ctrl-C` to stop): amine@21: amsehili@25: auditok amine@21: amine@35: This will print `id`, `start-time` and `end-time` for each detected activity. If you don't have `PyAudio`, you can use `sox` for data acquisition (`sudo apt-get install sox`) and tell `auditok` to read data from standard input: amine@21: amsehili@25: rec -q -t raw -r 16000 -c 1 -b 16 -e signed - | auditok -i - -r 16000 -w 2 -c 1 amsehili@25: amsehili@25: Note that when data is read from standard input the same audio parameters must be used for both `sox` (or any other data generation/acquisition tool) and `auditok`. The following table summarizes audio parameters. amine@21: amine@35: | Audio parameter | sox option | `auditok` option | `auditok` default | amine@35: | --------------- |------------|------------------|-----------------------| amine@35: | Sampling rate | -r | -r | 16000 | amine@35: | Sample width | -b (bits) | -w (bytes) | 2 | amine@35: | Channels | -c | -c | 1 | amine@35: | Encoding | -e | None | always signed integer | amine@21: amsehili@25: According to this table, the previous command can be run as: amine@21: amsehili@25: rec -q -t raw -r 16000 -c 1 -b 16 -e signed - | auditok -i - amine@21: mathieu@79: ### PyAudio mathieu@79: mathieu@79: When capturing input with PyAudio, you may need to adjust the device index with -I if multiple input devices are available. Use `lsusb -t` to get the list of usb devices, or use `arecord -l` if you're using a non-usb input device. If you don't know what index to use, just try `0`, `1`, `2` and so on, outputting the audio using `-E` (echo) until you hear the sound. mathieu@79: mathieu@79: You may also get an error `[Errno -9981] Input overflowed` from PyAudio. If that's the case, you need a bigger frame buffer. mathieu@79: Use `-F` with 2048 or 4096 (the default is 1024). mathieu@79: amsehili@25: ### Play back detections amine@21: amsehili@25: auditok -E amine@21: amine@35: **or** amsehili@25: amsehili@25: rec -q -t raw -r 16000 -c 1 -b 16 -e signed - | auditok -i - -E amsehili@25: amsehili@25: Option `-E` stands for echo, so `auditok` plays back whatever it detects. Using `-E` requires `PyAudio`, if you don't have `PyAudio` and want to play detections with sox, use the `-C` option: amsehili@25: amsehili@25: rec -q -t raw -r 16000 -c 1 -b 16 -e signed - | auditok -i - -C "play -q -t raw -r 16000 -c 1 -b 16 -e signed $" amine@21: amsehili@25: The `-C` option tells `auditok` to interpret its content as a command that should be run whenever `auditok` detects an audio activity, replacing the `$` by a name of a temporary file into which the activity is saved as raw audio. Here we use `play` to play the activity, giving the necessary `play` arguments for raw data. amsehili@25: amsehili@25: `rec` and `play` are just an alias for `sox`. amine@21: amine@21: The `-C` option can be useful in many cases. Imagine a command that sends audio data over a network only if there is an audio activity and saves bandwidth during silence. amine@21: amsehili@25: ### Set detection threshold amsehili@25: amsehili@25: If you notice that there are too many detections, use a higher value for energy threshold (the current version only implements a `validator` based on energy threshold. The use of spectral information is also desirable and might be part of future releases). To change the energy threshold (default: 50), use option `-e`: amsehili@25: amsehili@25: auditok -E -e 55 amsehili@25: amine@35: **or** amsehili@25: amsehili@25: rec -q -t raw -r 16000 -c 1 -b 16 -e signed - | auditok -i - -e 55 -C "play -q -t raw -r 16000 -c 1 -b 16 -e signed $" amsehili@25: amsehili@26: If however you figure out that the detector is missing some of or all your audio activities, use a lower value for `-e`. amsehili@25: amsehili@29: ### Set format for printed detections information amsehili@25: amine@35: By default, `auditok` prints the `id` `start-time` `end-time` of each detected activity: amsehili@25: amsehili@25: 1 1.87 2.67 amsehili@25: 2 3.05 3.73 amsehili@25: 3 3.97 4.49 amsehili@25: ... amsehili@25: amine@35: If you want to customize the output format, use `--printf` option: amsehili@25: amsehili@25: auditok -e 55 --printf "[{id}]: {start} to {end}" amsehili@25: amsehili@25: Output: amsehili@25: amsehili@25: [1]: 0.22 to 0.67 amsehili@25: [2]: 2.81 to 4.18 amsehili@25: [3]: 5.53 to 6.44 amsehili@25: [4]: 7.32 to 7.82 amsehili@25: ... amsehili@25: amsehili@28: Keywords `{id}`, `{start}` and `{end}` can be placed and repeated anywhere in the text. Time is shown in seconds, if you want a more detailed time information, use `--time-format`: amsehili@25: amsehili@25: auditok -e 55 --printf "[{id}]: {start} to {end}" --time-format "%h:%m:%s.%i" amsehili@25: amsehili@25: Output: amsehili@25: amsehili@25: [1]: 00:00:01.080 to 00:00:01.760 amsehili@25: [2]: 00:00:02.420 to 00:00:03.440 amsehili@25: [3]: 00:00:04.930 to 00:00:05.570 amsehili@25: [4]: 00:00:05.690 to 00:00:06.020 amsehili@25: [5]: 00:00:07.470 to 00:00:07.980 amsehili@25: ... amsehili@25: amsehili@25: Valid time directives are: `%h` (hours) `%m` (minutes) `%s` (seconds) `%i` (milliseconds). Two other directives, `%S` (default) and `%I` can be used for absolute time in seconds and milliseconds respectively. amsehili@25: amsehili@43: ### Plot signal and detections amine@21: amsehili@25: use option `-p`. Requires `matplotlib` and `numpy`. amine@21: amsehili@25: auditok ... -p amsehili@25: amsehili@26: ### Save plot as image or PDF amsehili@25: amsehili@25: auditok ... --save-image output.png amsehili@25: amsehili@25: Requires `matplotlib` and `numpy`. Accepted formats: eps, jpeg, jpg, pdf, pgf, png, ps, raw, rgba, svg, svgz, tif, tiff. amsehili@25: amsehili@25: ### Read data from file amine@21: amine@21: auditok -i input.wav ... amine@21: amine@21: Install `pydub` for other audio formats. amine@21: amine@21: ### Limit the length of aquired data amine@21: amine@21: auditok -M 12 ... amine@21: amine@21: Time is in seconds. amine@21: amine@21: ### Save the whole acquired audio signal amine@21: amine@21: auditok -O output.wav ... amine@21: amine@21: Install `pydub` for other audio formats. amine@21: amine@21: amine@21: ### Save each detection into a separate audio file amine@21: amine@21: auditok -o det_{N}_{start}_{end}.wav ... amine@21: amine@35: You can use a free text and place `{N}`, `{start}` and `{end}` wherever you want, they will be replaced by detection number, `start-time` and `end-time` respectively. Another example: amine@21: amine@21: auditok -o {start}-{end}.wav ... amine@21: amine@21: Install `pydub` for more audio formats. amine@21: amine@2: amsehili@26: Setting detection parameters amsehili@26: ---------------------------- amsehili@26: amsehili@26: Alongside the threshold option `-e` seen so far, a couple of other options can have a great impact on the detector behavior. These options are summarized in the following table: amsehili@26: amsehili@26: amsehili@27: | Option | Description | Unit | Default | amsehili@27: | -------|-------------------------------------------------------|---------|------------------| amsehili@27: | `-n` | Minimum length an accepted audio activity should have | second | 0.2 (200 ms) | amsehili@27: | `-m` | Maximum length an accepted audio activity should reach| second | 5. | amsehili@27: | `-s` | Maximum length of a continuous silence period within | second | 0.3 (300 ms) | amsehili@27: | | an accepted audio activity | | | amsehili@27: | `-d` | Drop trailing silence from an accepted audio activity | boolean | False | amsehili@27: | `-a` | Analysis window length (default value should be good) | second | 0.01 (10 ms) | amsehili@26: amsehili@43: Some practical use cases amsehili@43: ------------------------ amsehili@43: amsehili@43: ### 1st practical use case: generate a subtitles template amsehili@43: amsehili@43: Using `--printf ` and `--time-format`, the following command, used with an input audio or video file, will generate and an **srt** file template that can be later edited with a subtitles editor in a way that reduces the time needed to define when each utterance starts and where it ends: amsehili@43: amsehili@43: auditok -e 55 -i input.wav -m 10 --printf "{id}\n{start} --> {end}\nPut some text here...\n" --time-format "%h:%m:%s.%i" amsehili@43: amsehili@43: Output: amsehili@43: amsehili@43: 1 amsehili@43: 00:00:00.730 --> 00:00:01.460 amsehili@43: Put some text here... amsehili@43: amsehili@43: 2 amsehili@43: 00:00:02.440 --> 00:00:03.900 amsehili@43: Put some text here... amsehili@43: amsehili@43: 3 amsehili@43: 00:00:06.410 --> 00:00:06.970 amsehili@43: Put some text here... amsehili@43: amsehili@43: 4 amsehili@43: 00:00:07.260 --> 00:00:08.340 amsehili@43: Put some text here... amsehili@43: amsehili@43: 5 amsehili@43: 00:00:09.510 --> 00:00:09.820 amsehili@43: Put some text here... amsehili@43: amsehili@43: ### 2nd Practical use case example: build a (very) basic voice control application amsehili@43: amsehili@43: [This repository](https://github.com/amsehili/gspeech-rec) supplies a bash script the can send audio data to Google's amsehili@43: Speech Recognition service and get its transcription. In the following we will use **auditok** as a lower layer component amsehili@43: of a voice control application. The basic idea is to tell **auditok** to run, for each detected audio activity, a certain amsehili@43: number of commands that make up the rest of our voice control application. amsehili@43: amsehili@43: Assume you have installed **sox** and downloaded the Speech Recognition script. The sequence of commands to run is: amsehili@43: amsehili@43: 1- Convert raw audio data to flac using **sox**: amsehili@43: amsehili@43: sox -t raw -r 16000 -c 1 -b 16 -e signed raw_input output.flac amsehili@43: amsehili@43: 2- Send flac audio data to Google and get its filtered transcription using [speech-rec.sh](https://github.com/amsehili/gspeech-rec/blob/master/speech-rec.sh): amsehili@43: amsehili@43: speech-rec.sh -i output.flac -r 16000 amsehili@43: amsehili@43: 3- Use **grep** to select lines that contain *transcript*: amsehili@43: amsehili@43: grep transcript amsehili@43: amsehili@43: amsehili@43: 4- Launch the following script, giving it the transcription as input: amsehili@43: amsehili@43: #!/bin/bash amsehili@43: amsehili@43: read line amsehili@43: amsehili@43: RES=`echo "$line" | grep -i "open firefox"` amsehili@43: amsehili@43: if [[ $RES ]] amsehili@43: then amsehili@43: echo "Launch command: 'firefox &' ... " amsehili@43: firefox & amsehili@43: exit 0 amsehili@43: fi amsehili@43: amsehili@43: exit 0 amsehili@43: amsehili@43: As you can see, the script can handle one single voice command. It runs firefox if the text it receives contains **open firefox**. amsehili@43: Save a script into a file named voice-control.sh (don't forget to run a **chmod u+x voice-control.sh**). amsehili@43: amsehili@43: Now, thanks to option `-C`, we will use the four instructions with a pipe and tell **auditok** to run them each time it detects amsehili@43: an audio activity. Try the following command and say *open firefox*: amsehili@43: amsehili@43: rec -q -t raw -r 16000 -c 1 -b 16 -e signed - | auditok -M 5 -m 3 -n 1 --debug-file file.log -e 60 -C "sox -t raw -r 16000 -c 1 -b 16 -e signed $ audio.flac ; speech-rec.sh -i audio.flac -r 16000 | grep transcript | ./voice-control.sh" amsehili@43: amsehili@43: Here we used option `-M 5` to limit the amount of read audio data to 5 seconds (**auditok** stops if there are no more data) and amsehili@43: option `-n 1` to tell **auditok** to only accept tokens of 1 second or more and throw any token shorter than 1 second. amsehili@43: amsehili@43: With `--debug-file file.log`, all processing steps are written into file.log with their timestamps, including any run command and the file name the command was given. amsehili@43: amsehili@26: amine@2: License amine@2: ------- amine@2: `auditok` is published under the GNU General Public License Version 3. amine@2: amine@2: Author amine@2: ------ amine@2: Amine Sehili () amine@21: