amsehili@11
|
1 [](https://travis-ci.org/amsehili/auditok)
|
amine@37
|
2 [](http://auditok.readthedocs.org/en/latest/?badge=latest)
|
amsehili@11
|
3 AUDIo TOKenizer
|
amine@2
|
4 ===============
|
amine@2
|
5
|
amsehili@20
|
6 `auditok` is an **Audio Activity Detection** tool that can process online data (read from an audio device or from standard input) as well as audio files. It can be used as a command line program and offers an easy to use API.
|
amsehili@20
|
7
|
amine@35
|
8 A more detailed version of this user guide as well as an API tutorial and API reference can be found at [Readthedocs](http://auditok.readthedocs.org/en/latest/)
|
amine@35
|
9
|
amsehili@25
|
10 - [Two-figure explanation](https://github.com/amsehili/auditok#two-figure-explanation)
|
amsehili@25
|
11 - [Requirements](https://github.com/amsehili/auditok#requirements)
|
amsehili@25
|
12 - [Installation](https://github.com/amsehili/auditok#installation)
|
amsehili@25
|
13 - [Command line usage](https://github.com/amsehili/auditok#command-line-usage)
|
amsehili@25
|
14 - [Try the detector with your voice](https://github.com/amsehili/auditok#try-the-detector-with-your-voice)
|
amsehili@26
|
15 - [Play back detections](https://github.com/amsehili/auditok#play-back-detections)
|
amsehili@26
|
16 - [Set detection threshold](https://github.com/amsehili/auditok#set-detection-threshold)
|
amsehili@29
|
17 - [Set format for printed detections information](https://github.com/amsehili/auditok#set-format-for-printed-detections-information)
|
amsehili@43
|
18 - [Plot signal and detections](https://github.com/amsehili/auditok#plot-signal-and-detections)
|
amsehili@26
|
19 - [Save plot as image or PDF](https://github.com/amsehili/auditok#save-plot-as-image-or-pdf)
|
amsehili@26
|
20 - [Read data from file](https://github.com/amsehili/auditok#read-data-from-file)
|
amsehili@26
|
21 - [Limit the length of aquired/read data](https://github.com/amsehili/auditok#limit-the-length-of-aquired-data)
|
amsehili@26
|
22 - [Save the whole acquired audio signal](https://github.com/amsehili/auditok#save-the-whole-acquired-audio-signal)
|
amsehili@26
|
23 - [Save each detection into a separate audio file](https://github.com/amsehili/auditok#save-each-detection-into-a-separate-audio-file)
|
amsehili@26
|
24 - [Setting detection parameters](https://github.com/amsehili/auditok#setting-detection-parameters)
|
amsehili@43
|
25 - [Some practical use cases](https://github.com/amsehili/auditok#some-practical-use-cases)
|
amsehili@43
|
26 - [1st practical use case: generate a subtitles template](https://github.com/amsehili/auditok#1st-practical-use-case-generate-a-subtitles-template)
|
amsehili@44
|
27 - [2nd Practical use case example: build a (very) basic voice control application](https://github.com/amsehili/auditok#2nd-practical-use-case-example-build-a-very-basic-voice-control-application)
|
amsehili@26
|
28 - [License](https://github.com/amsehili/auditok#license)
|
amine@41
|
29 - [Author](https://github.com/amsehili/auditok#author)
|
amsehili@25
|
30
|
amsehili@25
|
31 Two-figure explanation
|
amsehili@25
|
32 ----------------------
|
amsehili@25
|
33 The following two figures illustrate an audio signal (blue) and regions detected as valid audio activities (green rectangles) according to a given threshold (red dashed line). They respectively depict the detection result when:
|
amsehili@20
|
34
|
amsehili@20
|
35 1. the detector tolerates phases of silence of up to 0.3 second (300 ms) within an audio activity (also referred to as acoustic event):
|
amsehili@20
|
36 
|
amsehili@20
|
37
|
amsehili@25
|
38 2. the detector splits an audio activity event into many activities if the within activity silence is over 0.2 second:
|
amsehili@20
|
39 
|
amsehili@20
|
40
|
amine@35
|
41 Beyond plotting signal and detections, you can play back audio activities as they are detected, save them or run a user command each time there is an activity,
|
amine@35
|
42 using, optionally, the file name of audio activity as an argument for the command.
|
amine@2
|
43
|
amine@2
|
44 Requirements
|
amine@2
|
45 ------------
|
amine@40
|
46 `auditok` can be used with standard Python!
|
amine@40
|
47
|
amine@40
|
48 However, if you want more features, the following packages are needed:
|
amsehili@20
|
49 - [pydub](https://github.com/jiaaro/pydub): read audio files of popular audio formats (ogg, mp3, etc.) or extract audio from a video file
|
amsehili@20
|
50 - [PyAudio](http://people.csail.mit.edu/hubert/pyaudio/): read audio data from the microphone and play back detections
|
amine@40
|
51 - [matplotlib](http://matplotlib.org/): plot audio signal and detections (see figures above)
|
amine@40
|
52 - [numpy](http://www.numpy.org): required by matplotlib. Also used for math operations instead of standard python if available
|
amsehili@20
|
53 - Optionnaly, you can use `sox` or `parecord` for data acquisition and feed `auditok` using a pipe.
|
amsehili@20
|
54
|
amine@2
|
55
|
amine@2
|
56 Installation
|
amine@2
|
57 ------------
|
amine@40
|
58
|
amine@40
|
59 git clone https://github.com/amsehili/auditok.git
|
amine@40
|
60 cd auditok
|
amine@4
|
61 python setup.py install
|
amine@2
|
62
|
amsehili@25
|
63 Command line usage
|
amine@21
|
64 ------------------
|
amine@21
|
65
|
amsehili@25
|
66 ### Try the detector with your voice
|
amsehili@25
|
67
|
amine@21
|
68 The first thing you want to check is perhaps how `auditok` detects your voice. If you have installed `PyAudio` just run (`Ctrl-C` to stop):
|
amine@21
|
69
|
amsehili@25
|
70 auditok
|
amine@21
|
71
|
amine@35
|
72 This will print `id`, `start-time` and `end-time` for each detected activity. If you don't have `PyAudio`, you can use `sox` for data acquisition (`sudo apt-get install sox`) and tell `auditok` to read data from standard input:
|
amine@21
|
73
|
amsehili@25
|
74 rec -q -t raw -r 16000 -c 1 -b 16 -e signed - | auditok -i - -r 16000 -w 2 -c 1
|
amsehili@25
|
75
|
amsehili@25
|
76 Note that when data is read from standard input the same audio parameters must be used for both `sox` (or any other data generation/acquisition tool) and `auditok`. The following table summarizes audio parameters.
|
amine@21
|
77
|
amine@35
|
78 | Audio parameter | sox option | `auditok` option | `auditok` default |
|
amine@35
|
79 | --------------- |------------|------------------|-----------------------|
|
amine@35
|
80 | Sampling rate | -r | -r | 16000 |
|
amine@35
|
81 | Sample width | -b (bits) | -w (bytes) | 2 |
|
amine@35
|
82 | Channels | -c | -c | 1 |
|
amine@35
|
83 | Encoding | -e | None | always signed integer |
|
amine@21
|
84
|
amsehili@25
|
85 According to this table, the previous command can be run as:
|
amine@21
|
86
|
amsehili@25
|
87 rec -q -t raw -r 16000 -c 1 -b 16 -e signed - | auditok -i -
|
amine@21
|
88
|
amsehili@25
|
89 ### Play back detections
|
amine@21
|
90
|
amsehili@25
|
91 auditok -E
|
amine@21
|
92
|
amine@35
|
93 **or**
|
amsehili@25
|
94
|
amsehili@25
|
95 rec -q -t raw -r 16000 -c 1 -b 16 -e signed - | auditok -i - -E
|
amsehili@25
|
96
|
amsehili@25
|
97 Option `-E` stands for echo, so `auditok` plays back whatever it detects. Using `-E` requires `PyAudio`, if you don't have `PyAudio` and want to play detections with sox, use the `-C` option:
|
amsehili@25
|
98
|
amsehili@25
|
99 rec -q -t raw -r 16000 -c 1 -b 16 -e signed - | auditok -i - -C "play -q -t raw -r 16000 -c 1 -b 16 -e signed $"
|
amine@21
|
100
|
amsehili@25
|
101 The `-C` option tells `auditok` to interpret its content as a command that should be run whenever `auditok` detects an audio activity, replacing the `$` by a name of a temporary file into which the activity is saved as raw audio. Here we use `play` to play the activity, giving the necessary `play` arguments for raw data.
|
amsehili@25
|
102
|
amsehili@25
|
103 `rec` and `play` are just an alias for `sox`.
|
amine@21
|
104
|
amine@21
|
105 The `-C` option can be useful in many cases. Imagine a command that sends audio data over a network only if there is an audio activity and saves bandwidth during silence.
|
amine@21
|
106
|
amsehili@25
|
107 ### Set detection threshold
|
amsehili@25
|
108
|
amsehili@25
|
109 If you notice that there are too many detections, use a higher value for energy threshold (the current version only implements a `validator` based on energy threshold. The use of spectral information is also desirable and might be part of future releases). To change the energy threshold (default: 50), use option `-e`:
|
amsehili@25
|
110
|
amsehili@25
|
111 auditok -E -e 55
|
amsehili@25
|
112
|
amine@35
|
113 **or**
|
amsehili@25
|
114
|
amsehili@25
|
115 rec -q -t raw -r 16000 -c 1 -b 16 -e signed - | auditok -i - -e 55 -C "play -q -t raw -r 16000 -c 1 -b 16 -e signed $"
|
amsehili@25
|
116
|
amsehili@26
|
117 If however you figure out that the detector is missing some of or all your audio activities, use a lower value for `-e`.
|
amsehili@25
|
118
|
amsehili@29
|
119 ### Set format for printed detections information
|
amsehili@25
|
120
|
amine@35
|
121 By default, `auditok` prints the `id` `start-time` `end-time` of each detected activity:
|
amsehili@25
|
122
|
amsehili@25
|
123 1 1.87 2.67
|
amsehili@25
|
124 2 3.05 3.73
|
amsehili@25
|
125 3 3.97 4.49
|
amsehili@25
|
126 ...
|
amsehili@25
|
127
|
amine@35
|
128 If you want to customize the output format, use `--printf` option:
|
amsehili@25
|
129
|
amsehili@25
|
130 auditok -e 55 --printf "[{id}]: {start} to {end}"
|
amsehili@25
|
131
|
amsehili@25
|
132 Output:
|
amsehili@25
|
133
|
amsehili@25
|
134 [1]: 0.22 to 0.67
|
amsehili@25
|
135 [2]: 2.81 to 4.18
|
amsehili@25
|
136 [3]: 5.53 to 6.44
|
amsehili@25
|
137 [4]: 7.32 to 7.82
|
amsehili@25
|
138 ...
|
amsehili@25
|
139
|
amsehili@28
|
140 Keywords `{id}`, `{start}` and `{end}` can be placed and repeated anywhere in the text. Time is shown in seconds, if you want a more detailed time information, use `--time-format`:
|
amsehili@25
|
141
|
amsehili@25
|
142 auditok -e 55 --printf "[{id}]: {start} to {end}" --time-format "%h:%m:%s.%i"
|
amsehili@25
|
143
|
amsehili@25
|
144 Output:
|
amsehili@25
|
145
|
amsehili@25
|
146 [1]: 00:00:01.080 to 00:00:01.760
|
amsehili@25
|
147 [2]: 00:00:02.420 to 00:00:03.440
|
amsehili@25
|
148 [3]: 00:00:04.930 to 00:00:05.570
|
amsehili@25
|
149 [4]: 00:00:05.690 to 00:00:06.020
|
amsehili@25
|
150 [5]: 00:00:07.470 to 00:00:07.980
|
amsehili@25
|
151 ...
|
amsehili@25
|
152
|
amsehili@25
|
153 Valid time directives are: `%h` (hours) `%m` (minutes) `%s` (seconds) `%i` (milliseconds). Two other directives, `%S` (default) and `%I` can be used for absolute time in seconds and milliseconds respectively.
|
amsehili@25
|
154
|
amsehili@43
|
155 ### Plot signal and detections
|
amine@21
|
156
|
amsehili@25
|
157 use option `-p`. Requires `matplotlib` and `numpy`.
|
amine@21
|
158
|
amsehili@25
|
159 auditok ... -p
|
amsehili@25
|
160
|
amsehili@26
|
161 ### Save plot as image or PDF
|
amsehili@25
|
162
|
amsehili@25
|
163 auditok ... --save-image output.png
|
amsehili@25
|
164
|
amsehili@25
|
165 Requires `matplotlib` and `numpy`. Accepted formats: eps, jpeg, jpg, pdf, pgf, png, ps, raw, rgba, svg, svgz, tif, tiff.
|
amsehili@25
|
166
|
amsehili@25
|
167 ### Read data from file
|
amine@21
|
168
|
amine@21
|
169 auditok -i input.wav ...
|
amine@21
|
170
|
amine@21
|
171 Install `pydub` for other audio formats.
|
amine@21
|
172
|
amine@21
|
173 ### Limit the length of aquired data
|
amine@21
|
174
|
amine@21
|
175 auditok -M 12 ...
|
amine@21
|
176
|
amine@21
|
177 Time is in seconds.
|
amine@21
|
178
|
amine@21
|
179 ### Save the whole acquired audio signal
|
amine@21
|
180
|
amine@21
|
181 auditok -O output.wav ...
|
amine@21
|
182
|
amine@21
|
183 Install `pydub` for other audio formats.
|
amine@21
|
184
|
amine@21
|
185
|
amine@21
|
186 ### Save each detection into a separate audio file
|
amine@21
|
187
|
amine@21
|
188 auditok -o det_{N}_{start}_{end}.wav ...
|
amine@21
|
189
|
amine@35
|
190 You can use a free text and place `{N}`, `{start}` and `{end}` wherever you want, they will be replaced by detection number, `start-time` and `end-time` respectively. Another example:
|
amine@21
|
191
|
amine@21
|
192 auditok -o {start}-{end}.wav ...
|
amine@21
|
193
|
amine@21
|
194 Install `pydub` for more audio formats.
|
amine@21
|
195
|
amine@2
|
196
|
amsehili@26
|
197 Setting detection parameters
|
amsehili@26
|
198 ----------------------------
|
amsehili@26
|
199
|
amsehili@26
|
200 Alongside the threshold option `-e` seen so far, a couple of other options can have a great impact on the detector behavior. These options are summarized in the following table:
|
amsehili@26
|
201
|
amsehili@26
|
202
|
amsehili@27
|
203 | Option | Description | Unit | Default |
|
amsehili@27
|
204 | -------|-------------------------------------------------------|---------|------------------|
|
amsehili@27
|
205 | `-n` | Minimum length an accepted audio activity should have | second | 0.2 (200 ms) |
|
amsehili@27
|
206 | `-m` | Maximum length an accepted audio activity should reach| second | 5. |
|
amsehili@27
|
207 | `-s` | Maximum length of a continuous silence period within | second | 0.3 (300 ms) |
|
amsehili@27
|
208 | | an accepted audio activity | | |
|
amsehili@27
|
209 | `-d` | Drop trailing silence from an accepted audio activity | boolean | False |
|
amsehili@27
|
210 | `-a` | Analysis window length (default value should be good) | second | 0.01 (10 ms) |
|
amsehili@26
|
211
|
amsehili@43
|
212 Some practical use cases
|
amsehili@43
|
213 ------------------------
|
amsehili@43
|
214
|
amsehili@43
|
215 ### 1st practical use case: generate a subtitles template
|
amsehili@43
|
216
|
amsehili@43
|
217 Using `--printf ` and `--time-format`, the following command, used with an input audio or video file, will generate and an **srt** file template that can be later edited with a subtitles editor in a way that reduces the time needed to define when each utterance starts and where it ends:
|
amsehili@43
|
218
|
amsehili@43
|
219 auditok -e 55 -i input.wav -m 10 --printf "{id}\n{start} --> {end}\nPut some text here...\n" --time-format "%h:%m:%s.%i"
|
amsehili@43
|
220
|
amsehili@43
|
221 Output:
|
amsehili@43
|
222
|
amsehili@43
|
223 1
|
amsehili@43
|
224 00:00:00.730 --> 00:00:01.460
|
amsehili@43
|
225 Put some text here...
|
amsehili@43
|
226
|
amsehili@43
|
227 2
|
amsehili@43
|
228 00:00:02.440 --> 00:00:03.900
|
amsehili@43
|
229 Put some text here...
|
amsehili@43
|
230
|
amsehili@43
|
231 3
|
amsehili@43
|
232 00:00:06.410 --> 00:00:06.970
|
amsehili@43
|
233 Put some text here...
|
amsehili@43
|
234
|
amsehili@43
|
235 4
|
amsehili@43
|
236 00:00:07.260 --> 00:00:08.340
|
amsehili@43
|
237 Put some text here...
|
amsehili@43
|
238
|
amsehili@43
|
239 5
|
amsehili@43
|
240 00:00:09.510 --> 00:00:09.820
|
amsehili@43
|
241 Put some text here...
|
amsehili@43
|
242
|
amsehili@43
|
243 ### 2nd Practical use case example: build a (very) basic voice control application
|
amsehili@43
|
244
|
amsehili@43
|
245 [This repository](https://github.com/amsehili/gspeech-rec) supplies a bash script the can send audio data to Google's
|
amsehili@43
|
246 Speech Recognition service and get its transcription. In the following we will use **auditok** as a lower layer component
|
amsehili@43
|
247 of a voice control application. The basic idea is to tell **auditok** to run, for each detected audio activity, a certain
|
amsehili@43
|
248 number of commands that make up the rest of our voice control application.
|
amsehili@43
|
249
|
amsehili@43
|
250 Assume you have installed **sox** and downloaded the Speech Recognition script. The sequence of commands to run is:
|
amsehili@43
|
251
|
amsehili@43
|
252 1- Convert raw audio data to flac using **sox**:
|
amsehili@43
|
253
|
amsehili@43
|
254 sox -t raw -r 16000 -c 1 -b 16 -e signed raw_input output.flac
|
amsehili@43
|
255
|
amsehili@43
|
256 2- Send flac audio data to Google and get its filtered transcription using [speech-rec.sh](https://github.com/amsehili/gspeech-rec/blob/master/speech-rec.sh):
|
amsehili@43
|
257
|
amsehili@43
|
258 speech-rec.sh -i output.flac -r 16000
|
amsehili@43
|
259
|
amsehili@43
|
260 3- Use **grep** to select lines that contain *transcript*:
|
amsehili@43
|
261
|
amsehili@43
|
262 grep transcript
|
amsehili@43
|
263
|
amsehili@43
|
264
|
amsehili@43
|
265 4- Launch the following script, giving it the transcription as input:
|
amsehili@43
|
266
|
amsehili@43
|
267 #!/bin/bash
|
amsehili@43
|
268
|
amsehili@43
|
269 read line
|
amsehili@43
|
270
|
amsehili@43
|
271 RES=`echo "$line" | grep -i "open firefox"`
|
amsehili@43
|
272
|
amsehili@43
|
273 if [[ $RES ]]
|
amsehili@43
|
274 then
|
amsehili@43
|
275 echo "Launch command: 'firefox &' ... "
|
amsehili@43
|
276 firefox &
|
amsehili@43
|
277 exit 0
|
amsehili@43
|
278 fi
|
amsehili@43
|
279
|
amsehili@43
|
280 exit 0
|
amsehili@43
|
281
|
amsehili@43
|
282 As you can see, the script can handle one single voice command. It runs firefox if the text it receives contains **open firefox**.
|
amsehili@43
|
283 Save a script into a file named voice-control.sh (don't forget to run a **chmod u+x voice-control.sh**).
|
amsehili@43
|
284
|
amsehili@43
|
285 Now, thanks to option `-C`, we will use the four instructions with a pipe and tell **auditok** to run them each time it detects
|
amsehili@43
|
286 an audio activity. Try the following command and say *open firefox*:
|
amsehili@43
|
287
|
amsehili@43
|
288 rec -q -t raw -r 16000 -c 1 -b 16 -e signed - | auditok -M 5 -m 3 -n 1 --debug-file file.log -e 60 -C "sox -t raw -r 16000 -c 1 -b 16 -e signed $ audio.flac ; speech-rec.sh -i audio.flac -r 16000 | grep transcript | ./voice-control.sh"
|
amsehili@43
|
289
|
amsehili@43
|
290 Here we used option `-M 5` to limit the amount of read audio data to 5 seconds (**auditok** stops if there are no more data) and
|
amsehili@43
|
291 option `-n 1` to tell **auditok** to only accept tokens of 1 second or more and throw any token shorter than 1 second.
|
amsehili@43
|
292
|
amsehili@43
|
293 With `--debug-file file.log`, all processing steps are written into file.log with their timestamps, including any run command and the file name the command was given.
|
amsehili@43
|
294
|
amsehili@26
|
295
|
amine@2
|
296 License
|
amine@2
|
297 -------
|
amine@2
|
298 `auditok` is published under the GNU General Public License Version 3.
|
amine@2
|
299
|
amine@2
|
300 Author
|
amine@2
|
301 ------
|
amine@2
|
302 Amine Sehili (<amine.sehili@gmail.com>)
|
amine@21
|
303
|