Mercurial > hg > auditok
comparison quickstart.rst @ 2:edee860b9f61
First release on Github
author | Amine Sehili <amine.sehili@gmail.com> |
---|---|
date | Thu, 17 Sep 2015 22:01:30 +0200 |
parents | |
children | 364eeb8e8bd2 |
comparison
equal
deleted
inserted
replaced
1:78ba0ead5f9f | 2:edee860b9f61 |
---|---|
1 .. auditok documentation. | |
2 | |
3 auditok, an AUDIo TOKenization module | |
4 ===================================== | |
5 | |
6 | |
7 **auditok** is a module that can be used as a generic tool for data | |
8 tokenization. Although its core motivation is **Acoustic Activity | |
9 Detection** (AAD) and extraction from audio streams (i.e. detect | |
10 where a noise/an acoustic activity occurs within an audio stream and | |
11 extract the corresponding portion of signal), it can easily be | |
12 adapted to other tasks. | |
13 | |
14 Globally speaking, it can be used to extract, from a sequence of | |
15 observations, all sub-sequences that meet a certain number of | |
16 criteria in terms of: | |
17 | |
18 1. Minimum length of a **valid** token (i.e. sub-sequence) | |
19 2. Maximum length of a valid token | |
20 3. Maximum tolerated consecutive **non-valid** observations within | |
21 a valid token | |
22 | |
23 Examples of a non-valid observation are: a non-numeric ascii symbol | |
24 if you are interested in sub-sequences of numeric symbols, or a silent | |
25 audio window (of 10, 20 or 100 milliseconds for instance) if what | |
26 interests you are audio regions made up of a sequence of "noisy" | |
27 windows (whatever kind of noise: speech, baby cry, laughter, etc.). | |
28 | |
29 The most important component of `auditok` is the `StreamTokenizer` class. | |
30 An instance of this class encapsulates a `DataValidator` and can be | |
31 configured to detect the desired regions from a stream. | |
32 The `auditok.core.StreamTokenizer.tokenize` method accepts a `DataSource` | |
33 object that has a `read` method. Read data can be of any type accepted | |
34 by the `validator`. | |
35 | |
36 | |
37 As the main aim of this module is **Audio Activity Detection**, | |
38 it provides the `auditok.util.ADSFactory` factory class that makes | |
39 it very easy to create an `AudioDataSource` (a class that implements `DataSource`) | |
40 object, be that from: | |
41 | |
42 - A file on the disk | |
43 - A buffer of data | |
44 - The built-in microphone (requires PyAudio) | |
45 | |
46 | |
47 The `AudioDataSource` class inherits from `DataSource` and supplies | |
48 a higher abstraction level than `AudioSource` thanks to a bunch of | |
49 handy features: | |
50 | |
51 - Define a fixed-length of block_size (i.e. analysis window) | |
52 - Allow overlap between two consecutive analysis windows (hop_size < block_size). | |
53 This can be very important if your validator use the **spectral** | |
54 information of audio data instead of raw audio samples. | |
55 - Limit the amount (i.e. duration) of read data (very useful when reading | |
56 data from the microphone) | |
57 - Record and rewind data (also useful if you read data from the microphone | |
58 and you want to process it many times offline and/or save it) | |
59 | |
60 | |
61 Last but not least, the current version has only one audio window validator based on | |
62 signal energy. | |
63 | |
64 Requirements | |
65 ============ | |
66 | |
67 `auditok` requires `Pyaudio <http://people.csail.mit.edu/hubert/pyaudio/>`_ | |
68 for audio acquisition and playback. | |
69 | |
70 | |
71 Illustrative examples with strings | |
72 ================================== | |
73 | |
74 Let us look at some examples using the `auditok.util.StringDataSource` class | |
75 created for test and illustration purposes. Imagine that each character of | |
76 `auditok.util.StringDataSource` data represent an audio slice of 100 ms for | |
77 example. In the following examples we will use upper case letters to represent | |
78 noisy audio slices (i.e. analysis windows or frames) and lower case letter for | |
79 silent frames. | |
80 | |
81 | |
82 Extract sub-sequences of consecutive upper case letters | |
83 ------------------------------------------------------- | |
84 | |
85 We want to extract sub-sequences of characters that have: | |
86 | |
87 - A minimu length of 1 (`min_length` = 1) | |
88 - A maximum length of 9999 (`max_length` = 9999) | |
89 - Zero consecutive lower case characters within them (`max_continuous_silence` = 0) | |
90 | |
91 We also create the `UpperCaseChecker` whose `read` method returns `True` if the | |
92 checked character is in upper case and `False` otherwise. | |
93 | |
94 .. code:: python | |
95 | |
96 from auditok import StreamTokenizer, StringDataSource, DataValidator | |
97 | |
98 class UpperCaseChecker(DataValidator): | |
99 def is_valid(self, frame): | |
100 return frame.isupper() | |
101 | |
102 dsource = StringDataSource("aaaABCDEFbbGHIJKccc") | |
103 tokenizer = StreamTokenizer(validator=UpperCaseChecker(), | |
104 min_length=1, max_length=9999, max_continuous_silence=0) | |
105 | |
106 tokenizer.tokenize(dsource) | |
107 | |
108 The output is a list of two tuples, each contains the extracted sub-sequence and its | |
109 start and end position in the original sequence respectively: | |
110 | |
111 | |
112 [(['A', 'B', 'C', 'D', 'E', 'F'], 3, 8), (['G', 'H', 'I', 'J', 'K'], 11, 15)] | |
113 | |
114 Tolerate up to two non-valid (lower case) letter within an extracted sequence | |
115 ----------------------------------------------------------------------------- | |
116 | |
117 To do so, we set `max_continuous_silence` =2: | |
118 | |
119 .. code:: python | |
120 | |
121 | |
122 from auditok import StreamTokenizer, StringDataSource, DataValidator | |
123 | |
124 class UpperCaseChecker(DataValidator): | |
125 def is_valid(self, frame): | |
126 return frame.isupper() | |
127 | |
128 dsource = StringDataSource("aaaABCDbbEFcGHIdddJKee") | |
129 tokenizer = StreamTokenizer(validator=UpperCaseChecker(), | |
130 min_length=1, max_length=9999, max_continuous_silence=2) | |
131 | |
132 tokenizer.tokenize(dsource) | |
133 | |
134 | |
135 output: | |
136 | |
137 .. code:: python | |
138 | |
139 [(['A', 'B', 'C', 'D', 'b', 'b', 'E', 'F', 'c', 'G', 'H', 'I', 'd', 'd'], 3, 16), (['J', 'K', 'e', 'e'], 18, 21)] | |
140 | |
141 Notice the trailing lower case letters "dd" and "ee" at the end of the two | |
142 tokens. The default behavior of `StreamTokenizer` is to keep the *trailing | |
143 silence* if it does'nt exceed `max_continuous_silence`. This can be changed | |
144 using the `DROP_TRAILING_SILENCE` mode (see next example). | |
145 | |
146 Remove trailing silence | |
147 ----------------------- | |
148 | |
149 Trailing silence can be useful for many sound recognition applications, including | |
150 speech recognition. Moreover, from the human auditory system point of view, trailing | |
151 low energy signal helps removing abrupt signal cuts. | |
152 | |
153 If you want to remove it anyway, you can do it by setting `mode` to `StreamTokenizer.DROP_TRAILING_SILENCE`: | |
154 | |
155 .. code:: python | |
156 | |
157 from auditok import StreamTokenizer, StringDataSource, DataValidator | |
158 | |
159 class UpperCaseChecker(DataValidator): | |
160 def is_valid(self, frame): | |
161 return frame.isupper() | |
162 | |
163 dsource = StringDataSource("aaaABCDbbEFcGHIdddJKee") | |
164 tokenizer = StreamTokenizer(validator=UpperCaseChecker(), | |
165 min_length=1, max_length=9999, max_continuous_silence=2, | |
166 mode=StreamTokenizer.DROP_TRAILING_SILENCE) | |
167 | |
168 tokenizer.tokenize(dsource) | |
169 | |
170 output: | |
171 | |
172 .. code:: python | |
173 | |
174 [(['A', 'B', 'C', 'D', 'b', 'b', 'E', 'F', 'c', 'G', 'H', 'I'], 3, 14), (['J', 'K'], 18, 19)] | |
175 | |
176 | |
177 Limit the length of detected tokens | |
178 ----------------------------------- | |
179 | |
180 Imagine that you just want to detect and recognize a small part of a long | |
181 acoustic event (e.g. engine noise, water flow, etc.) and avoid that that | |
182 event hogs the tokenizer and prevent it from feeding the event to the next | |
183 processing step (i.e. a sound recognizer). You can do this by: | |
184 | |
185 - limiting the length of a detected token. | |
186 | |
187 and | |
188 | |
189 - using a callback function as an argument to `StreamTokenizer.tokenize` | |
190 so that the tokenizer delivers a token as soon as it is detected. | |
191 | |
192 The following code limits the length of a token to 5: | |
193 | |
194 .. code:: python | |
195 | |
196 from auditok import StreamTokenizer, StringDataSource, DataValidator | |
197 | |
198 class UpperCaseChecker(DataValidator): | |
199 def is_valid(self, frame): | |
200 return frame.isupper() | |
201 | |
202 dsource = StringDataSource("aaaABCDEFGHIJKbbb") | |
203 tokenizer = StreamTokenizer(validator=UpperCaseChecker(), | |
204 min_length=1, max_length=5, max_continuous_silence=0) | |
205 | |
206 def print_token(data, start, end): | |
207 print("token = '{0}', starts at {1}, ends at {2}".format(''.join(data), start, end)) | |
208 | |
209 tokenizer.tokenize(dsource, callback=print_token) | |
210 | |
211 | |
212 output: | |
213 | |
214 "token = 'ABCDE', starts at 3, ends at 7" | |
215 "token = 'FGHIJ', starts at 8, ends at 12" | |
216 "token = 'K', starts at 13, ends at 13" | |
217 | |
218 | |
219 Using real audio data | |
220 ===================== | |
221 | |
222 In this section we will use `ADSFactory`, `AudioEnergyValidator` and `StreamTokenizer` | |
223 for an AAD demonstration using audio data. Before we get any, further it is worth | |
224 explaining a certain number of points. | |
225 | |
226 `ADSFactory.ads` method is called to create an `AudioDataSource` object that can be | |
227 passed to `StreamTokenizer.tokenize`. `ADSFactory.ads` accepts a number of keyword | |
228 arguments, of which none is mandatory. The returned `AudioDataSource` object can | |
229 however greatly differ depending on the passed arguments. Further details can be found | |
230 in the respective method documentation. Note however the following two calls that will | |
231 create an `AudioDataSource` that read data from an audio file and from the built-in | |
232 microphone respectively. | |
233 | |
234 .. code:: python | |
235 | |
236 from auditok import ADSFactory | |
237 | |
238 # Get an AudioDataSource from a file | |
239 file_ads = ADSFactory.ads(filename = "path/to/file/") | |
240 | |
241 # Get an AudioDataSource from the built-in microphone | |
242 # The returned object has the default values for sampling | |
243 # rate, sample width an number of channels. see method's | |
244 # documentation for customized values | |
245 mic_ads = ADSFactory.ads() | |
246 | |
247 For `StreamTkenizer`, parameters `min_length`, `max_length` and `max_continuous_silence` | |
248 are expressed in term of number of frames. If you want a `max_length` of *2 seconds* for | |
249 your detected sound events and your *analysis window* is *10 ms* long, you have to specify | |
250 a `max_length` of 200 (`int(2. / (10. / 1000)) == 200`). For a `max_continuous_silence` of *300 ms* | |
251 for instance, the value to pass to StreamTokenizer is 30 (`int(0.3 / (10. / 1000)) == 30`). | |
252 | |
253 | |
254 Where do you get the size of the **analysis window** from? | |
255 | |
256 | |
257 Well this is a parameter you pass to `ADSFactory.ads`. By default `ADSFactory.ads` uses | |
258 an analysis window of 10 ms. the number of samples that 10 ms of signal contain will | |
259 vary depending on the sampling rate of your audio source (file, microphone, etc.). | |
260 For a sampling rate of 16KHz (16000 samples per second), we have 160 samples for 10 ms. | |
261 Therefore you can use block sizes of 160, 320, 1600 for analysis windows of 10, 20 and 100 | |
262 ms respectively. | |
263 | |
264 .. code:: python | |
265 | |
266 from auditok import ADSFactory | |
267 | |
268 file_ads = ADSFactory.ads(filename = "path/to/file/", block_size = 160) | |
269 | |
270 file_ads = ADSFactory.ads(filename = "path/to/file/", block_size = 320) | |
271 | |
272 # If no sampling rate is specified, ADSFactory use 16KHz as the default | |
273 # rate for the microphone. If you want to use a window of 100 ms, use | |
274 # a block size of 1600 | |
275 mic_ads = ADSFactory.ads(block_size = 1600) | |
276 | |
277 So if your not sure what you analysis windows in seconds is, use the following: | |
278 | |
279 .. code:: python | |
280 | |
281 my_ads = ADSFactory.ads(...) | |
282 analysis_win_seconds = float(my_ads.get_block_size()) / my_ads.get_sampling_rate() | |
283 analysis_window_ms = analysis_win_seconds * 1000 | |
284 | |
285 # For a `max_continuous_silence` of 300 ms use: | |
286 max_continuous_silence = int(300. / analysis_window_ms) | |
287 | |
288 # Which is the same as | |
289 max_continuous_silence = int(0.3 / (analysis_window_ms / 1000)) | |
290 | |
291 | |
292 Examples | |
293 -------- | |
294 | |
295 Extract isolated phrases from an utterance | |
296 ------------------------------------------ | |
297 | |
298 We will build an `AudioDataSource` using a wave file from the database. | |
299 The file contains of isolated pronunciation of digits from 1 to 1 | |
300 in Arabic as well as breath-in/out between 2 and 3. The code will play the | |
301 original file then the detected sounds separately. Note that we use an | |
302 `energy_threshold` of 65, this parameter should be carefully chosen. It depends | |
303 on microphone quality, background noise and the amplitude of events you want to | |
304 detect. | |
305 | |
306 .. code:: python | |
307 | |
308 from auditok import ADSFactory, AudioEnergyValidator, StreamTokenizer, player_for, dataset | |
309 | |
310 # We set the `record` argument to True so that we can rewind the source | |
311 asource = ADSFactory.ads(filename=dataset.one_to_six_arabic_16000_mono_bc_noise, record=True) | |
312 | |
313 validator = AudioEnergyValidator(sample_width=asource.get_sample_width(), energy_threshold=65) | |
314 | |
315 # Defalut analysis window is 10 ms (float(asource.get_block_size()) / asource.get_sampling_rate()) | |
316 # min_length=20 : minimum length of a valid audio activity is 20 * 10 == 200 ms | |
317 # max_length=4000 : maximum length of a valid audio activity is 400 * 10 == 4000 ms == 4 seconds | |
318 # max_continuous_silence=30 : maximum length of a tolerated silence within a valid audio activity is 30 * 30 == 300 ms | |
319 tokenizer = StreamTokenizer(validator=validator, min_length=20, max_length=400, max_continuous_silence=30) | |
320 | |
321 asource.open() | |
322 tokens = tokenizer.tokenize(asource) | |
323 | |
324 # Play detected regions back | |
325 | |
326 player = player_for(asource) | |
327 | |
328 # Rewind and read the whole signal | |
329 asource.rewind() | |
330 original_signal = [] | |
331 | |
332 while True: | |
333 w = asource.read() | |
334 if w is None: | |
335 break | |
336 original_signal.append(w) | |
337 | |
338 original_signal = ''.join(original_signal) | |
339 | |
340 print("Playing the original file...") | |
341 player.play(original_signal) | |
342 | |
343 print("playing detected regions...") | |
344 for t in tokens: | |
345 print("Token starts at {0} and ends at {1}".format(t[1], t[2])) | |
346 data = ''.join(t[0]) | |
347 player.play(data) | |
348 | |
349 assert len(tokens) == 8 | |
350 | |
351 | |
352 The tokenizer extracts 8 audio regions from the signal, including all isolated digits | |
353 (from 1 to 6) as well as the 2-phase respiration of the subject. You might have noticed | |
354 that, in the original file, the last three digit are closer to each other than the | |
355 previous ones. If you wan them to be extracted as one single phrase, you can do so | |
356 by tolerating a larger continuous silence within a detection: | |
357 | |
358 .. code:: python | |
359 | |
360 tokenizer.max_continuous_silence = 50 | |
361 asource.rewind() | |
362 tokens = tokenizer.tokenize(asource) | |
363 | |
364 for t in tokens: | |
365 print("Token starts at {0} and ends at {1}".format(t[1], t[2])) | |
366 data = ''.join(t[0]) | |
367 player.play(data) | |
368 | |
369 assert len(tokens) == 6 | |
370 | |
371 | |
372 Trim leading and trailing silence | |
373 --------------------------------- | |
374 | |
375 The tokenizer in the following example is set up to remove the silence | |
376 that precedes the first acoustic activity or follows the last activity | |
377 in a record. It preserves whatever it founds between the two activities. | |
378 In other words, it removes the leading and trailing silence. | |
379 | |
380 Sampling rate is 44100 sample per second, we'll use an analysis window of 100 ms | |
381 (i.e. bloc_ksize == 4410) | |
382 | |
383 Energy threshold is 50. | |
384 | |
385 The tokenizer will start accumulating windows up from the moment it encounters | |
386 the first analysis window of an energy >= 50. ALL the following windows will be | |
387 kept regardless of their energy. At the end of the analysis, it will drop trailing | |
388 windows with an energy below 50. | |
389 | |
390 This is an interesting example because the audio file we're analyzing contains a very | |
391 brief noise that occurs within the leading silence. We certainly do want our tokenizer | |
392 to stop at this point and considers whatever it comes after as a useful signal. | |
393 To force the tokenizer to ignore that brief event we use two other parameters `init_min` | |
394 ans `init_max_silence`. By `init_min` = 3 and `init_max_silence` = 1 we tell the tokenizer | |
395 that a valid event must start with at least 3 noisy windows, between which there | |
396 is at most 1 silent window. | |
397 | |
398 Still with this configuration we can get the tokenizer detect that noise as a valid event | |
399 (if it actually contains 3 consecutive noisy frames). To circummvent this we use an enough | |
400 large analysis window (here of 100 ms) to ensure that the brief noise be surrounded by a much | |
401 longer silence and hence the energy of the overall analysis window will be below 50. | |
402 | |
403 When using a shorter analysis window (of 10ms for instance, block_size == 441), the brief | |
404 noise contributes more to energy calculation which yields an energy of over 50 for the window. | |
405 Again we can deal with this situation by using a higher energy threshold (55 for example). | |
406 | |
407 .. code:: python | |
408 | |
409 from auditok import ADSFactory, AudioEnergyValidator, StreamTokenizer, player_for, dataset | |
410 import pyaudio | |
411 | |
412 # record = True so that we'll be able to rewind the source. | |
413 asource = ADSFactory.ads(filename=dataset.was_der_mensch_saet_mono_44100_lead_trail_silence, | |
414 record=True, block_size=4410) | |
415 asource.open() | |
416 | |
417 original_signal = [] | |
418 # Read the whole signal | |
419 while True: | |
420 w = asource.read() | |
421 if w is None: | |
422 break | |
423 original_signal.append(w) | |
424 | |
425 original_signal = ''.join(original_signal) | |
426 | |
427 # rewind source | |
428 asource.rewind() | |
429 | |
430 # Create a validator with an energy threshold of 50 | |
431 validator = AudioEnergyValidator(sample_width=asource.get_sample_width(), energy_threshold=50) | |
432 | |
433 # Create a tokenizer with an unlimited token length and continuous silence within a token | |
434 # Note the DROP_TRAILING_SILENCE mode that will ensure removing trailing silence | |
435 trimmer = StreamTokenizer(validator, min_length = 20, max_length=99999999, init_min=3, init_max_silence=1, max_continuous_silence=9999999, mode=StreamTokenizer.DROP_TRAILING_SILENCE) | |
436 | |
437 | |
438 tokens = trimmer.tokenize(asource) | |
439 | |
440 # Make sure we only have one token | |
441 assert len(tokens) == 1, "Should have detected one single token" | |
442 | |
443 trimmed_signal = ''.join(tokens[0][0]) | |
444 | |
445 player = player_for(asource) | |
446 | |
447 print("Playing original signal (with leading and trailing silence)...") | |
448 player.play(original_signal) | |
449 print("Playing trimmed signal...") | |
450 player.play(trimmed_signal) | |
451 | |
452 | |
453 Online audio signal processing | |
454 ------------------------------ | |
455 | |
456 In the next example, audio data is directely acquired from the built-in microphone. | |
457 The `tokenize` method is passed a callback function so that audio activities | |
458 are delivered as soon as they are detected. Each detected activity is played | |
459 back using the build-in audio output device. | |
460 | |
461 As mentionned before , Signal energy is strongly related to many factors such | |
462 microphone sensitivity, background noise (including noise inherent to the hardware), | |
463 distance and your operating system sound settings. Try a lower `energy_threshold` | |
464 if your noise does not seem to be detected and a higher threshold if you notice | |
465 an over detection (echo method prints a detection where you have made no noise). | |
466 | |
467 .. code:: python | |
468 | |
469 from auditok import ADSFactory, AudioEnergyValidator, StreamTokenizer, player_for | |
470 import pyaudio | |
471 | |
472 # record = True so that we'll be able to rewind the source. | |
473 # max_time = 10: read 10 seconds from the microphone | |
474 asource = ADSFactory.ads(record=True, max_time=10) | |
475 | |
476 validator = AudioEnergyValidator(sample_width=asource.get_sample_width(), energy_threshold=50) | |
477 tokenizer = StreamTokenizer(validator=validator, min_length=20, max_length=250, max_continuous_silence=30) | |
478 | |
479 player = player_for(asource) | |
480 | |
481 def echo(data, start, end): | |
482 print("Acoustic activity at: {0}--{1}".format(start, end)) | |
483 player.play(''.join(data)) | |
484 | |
485 asource.open() | |
486 | |
487 tokenizer.tokenize(asource, callback=echo) | |
488 | |
489 If you want to re-run the tokenizer after changing of one or many parameters, use the following code: | |
490 | |
491 .. code:: python | |
492 | |
493 asource.rewind() | |
494 # change energy threshold for example | |
495 tokenizer.validator.set_energy_threshold(55) | |
496 tokenizer.tokenize(asource, callback=echo) | |
497 | |
498 In case you want to play the whole recorded signal back use: | |
499 | |
500 .. code:: python | |
501 | |
502 player.play(asource.get_audio_source().get_data_buffer()) | |
503 | |
504 | |
505 Contributing | |
506 ============ | |
507 **auditok** is on `GitHub <https://github.com/amsehili/auditok>`_. You're welcome to fork it and contribute. | |
508 | |
509 | |
510 Amine SEHILI <amine.sehili[_at_]gmail.com> | |
511 September 2015 | |
512 | |
513 License | |
514 ======= | |
515 | |
516 This package is published under GNU GPL Version 3. |