comparison quickstart.rst @ 2:edee860b9f61

First release on Github
author Amine Sehili <amine.sehili@gmail.com>
date Thu, 17 Sep 2015 22:01:30 +0200
parents
children 364eeb8e8bd2
comparison
equal deleted inserted replaced
1:78ba0ead5f9f 2:edee860b9f61
1 .. auditok documentation.
2
3 auditok, an AUDIo TOKenization module
4 =====================================
5
6
7 **auditok** is a module that can be used as a generic tool for data
8 tokenization. Although its core motivation is **Acoustic Activity
9 Detection** (AAD) and extraction from audio streams (i.e. detect
10 where a noise/an acoustic activity occurs within an audio stream and
11 extract the corresponding portion of signal), it can easily be
12 adapted to other tasks.
13
14 Globally speaking, it can be used to extract, from a sequence of
15 observations, all sub-sequences that meet a certain number of
16 criteria in terms of:
17
18 1. Minimum length of a **valid** token (i.e. sub-sequence)
19 2. Maximum length of a valid token
20 3. Maximum tolerated consecutive **non-valid** observations within
21 a valid token
22
23 Examples of a non-valid observation are: a non-numeric ascii symbol
24 if you are interested in sub-sequences of numeric symbols, or a silent
25 audio window (of 10, 20 or 100 milliseconds for instance) if what
26 interests you are audio regions made up of a sequence of "noisy"
27 windows (whatever kind of noise: speech, baby cry, laughter, etc.).
28
29 The most important component of `auditok` is the `StreamTokenizer` class.
30 An instance of this class encapsulates a `DataValidator` and can be
31 configured to detect the desired regions from a stream.
32 The `auditok.core.StreamTokenizer.tokenize` method accepts a `DataSource`
33 object that has a `read` method. Read data can be of any type accepted
34 by the `validator`.
35
36
37 As the main aim of this module is **Audio Activity Detection**,
38 it provides the `auditok.util.ADSFactory` factory class that makes
39 it very easy to create an `AudioDataSource` (a class that implements `DataSource`)
40 object, be that from:
41
42 - A file on the disk
43 - A buffer of data
44 - The built-in microphone (requires PyAudio)
45
46
47 The `AudioDataSource` class inherits from `DataSource` and supplies
48 a higher abstraction level than `AudioSource` thanks to a bunch of
49 handy features:
50
51 - Define a fixed-length of block_size (i.e. analysis window)
52 - Allow overlap between two consecutive analysis windows (hop_size < block_size).
53 This can be very important if your validator use the **spectral**
54 information of audio data instead of raw audio samples.
55 - Limit the amount (i.e. duration) of read data (very useful when reading
56 data from the microphone)
57 - Record and rewind data (also useful if you read data from the microphone
58 and you want to process it many times offline and/or save it)
59
60
61 Last but not least, the current version has only one audio window validator based on
62 signal energy.
63
64 Requirements
65 ============
66
67 `auditok` requires `Pyaudio <http://people.csail.mit.edu/hubert/pyaudio/>`_
68 for audio acquisition and playback.
69
70
71 Illustrative examples with strings
72 ==================================
73
74 Let us look at some examples using the `auditok.util.StringDataSource` class
75 created for test and illustration purposes. Imagine that each character of
76 `auditok.util.StringDataSource` data represent an audio slice of 100 ms for
77 example. In the following examples we will use upper case letters to represent
78 noisy audio slices (i.e. analysis windows or frames) and lower case letter for
79 silent frames.
80
81
82 Extract sub-sequences of consecutive upper case letters
83 -------------------------------------------------------
84
85 We want to extract sub-sequences of characters that have:
86
87 - A minimu length of 1 (`min_length` = 1)
88 - A maximum length of 9999 (`max_length` = 9999)
89 - Zero consecutive lower case characters within them (`max_continuous_silence` = 0)
90
91 We also create the `UpperCaseChecker` whose `read` method returns `True` if the
92 checked character is in upper case and `False` otherwise.
93
94 .. code:: python
95
96 from auditok import StreamTokenizer, StringDataSource, DataValidator
97
98 class UpperCaseChecker(DataValidator):
99 def is_valid(self, frame):
100 return frame.isupper()
101
102 dsource = StringDataSource("aaaABCDEFbbGHIJKccc")
103 tokenizer = StreamTokenizer(validator=UpperCaseChecker(),
104 min_length=1, max_length=9999, max_continuous_silence=0)
105
106 tokenizer.tokenize(dsource)
107
108 The output is a list of two tuples, each contains the extracted sub-sequence and its
109 start and end position in the original sequence respectively:
110
111
112 [(['A', 'B', 'C', 'D', 'E', 'F'], 3, 8), (['G', 'H', 'I', 'J', 'K'], 11, 15)]
113
114 Tolerate up to two non-valid (lower case) letter within an extracted sequence
115 -----------------------------------------------------------------------------
116
117 To do so, we set `max_continuous_silence` =2:
118
119 .. code:: python
120
121
122 from auditok import StreamTokenizer, StringDataSource, DataValidator
123
124 class UpperCaseChecker(DataValidator):
125 def is_valid(self, frame):
126 return frame.isupper()
127
128 dsource = StringDataSource("aaaABCDbbEFcGHIdddJKee")
129 tokenizer = StreamTokenizer(validator=UpperCaseChecker(),
130 min_length=1, max_length=9999, max_continuous_silence=2)
131
132 tokenizer.tokenize(dsource)
133
134
135 output:
136
137 .. code:: python
138
139 [(['A', 'B', 'C', 'D', 'b', 'b', 'E', 'F', 'c', 'G', 'H', 'I', 'd', 'd'], 3, 16), (['J', 'K', 'e', 'e'], 18, 21)]
140
141 Notice the trailing lower case letters "dd" and "ee" at the end of the two
142 tokens. The default behavior of `StreamTokenizer` is to keep the *trailing
143 silence* if it does'nt exceed `max_continuous_silence`. This can be changed
144 using the `DROP_TRAILING_SILENCE` mode (see next example).
145
146 Remove trailing silence
147 -----------------------
148
149 Trailing silence can be useful for many sound recognition applications, including
150 speech recognition. Moreover, from the human auditory system point of view, trailing
151 low energy signal helps removing abrupt signal cuts.
152
153 If you want to remove it anyway, you can do it by setting `mode` to `StreamTokenizer.DROP_TRAILING_SILENCE`:
154
155 .. code:: python
156
157 from auditok import StreamTokenizer, StringDataSource, DataValidator
158
159 class UpperCaseChecker(DataValidator):
160 def is_valid(self, frame):
161 return frame.isupper()
162
163 dsource = StringDataSource("aaaABCDbbEFcGHIdddJKee")
164 tokenizer = StreamTokenizer(validator=UpperCaseChecker(),
165 min_length=1, max_length=9999, max_continuous_silence=2,
166 mode=StreamTokenizer.DROP_TRAILING_SILENCE)
167
168 tokenizer.tokenize(dsource)
169
170 output:
171
172 .. code:: python
173
174 [(['A', 'B', 'C', 'D', 'b', 'b', 'E', 'F', 'c', 'G', 'H', 'I'], 3, 14), (['J', 'K'], 18, 19)]
175
176
177 Limit the length of detected tokens
178 -----------------------------------
179
180 Imagine that you just want to detect and recognize a small part of a long
181 acoustic event (e.g. engine noise, water flow, etc.) and avoid that that
182 event hogs the tokenizer and prevent it from feeding the event to the next
183 processing step (i.e. a sound recognizer). You can do this by:
184
185 - limiting the length of a detected token.
186
187 and
188
189 - using a callback function as an argument to `StreamTokenizer.tokenize`
190 so that the tokenizer delivers a token as soon as it is detected.
191
192 The following code limits the length of a token to 5:
193
194 .. code:: python
195
196 from auditok import StreamTokenizer, StringDataSource, DataValidator
197
198 class UpperCaseChecker(DataValidator):
199 def is_valid(self, frame):
200 return frame.isupper()
201
202 dsource = StringDataSource("aaaABCDEFGHIJKbbb")
203 tokenizer = StreamTokenizer(validator=UpperCaseChecker(),
204 min_length=1, max_length=5, max_continuous_silence=0)
205
206 def print_token(data, start, end):
207 print("token = '{0}', starts at {1}, ends at {2}".format(''.join(data), start, end))
208
209 tokenizer.tokenize(dsource, callback=print_token)
210
211
212 output:
213
214 "token = 'ABCDE', starts at 3, ends at 7"
215 "token = 'FGHIJ', starts at 8, ends at 12"
216 "token = 'K', starts at 13, ends at 13"
217
218
219 Using real audio data
220 =====================
221
222 In this section we will use `ADSFactory`, `AudioEnergyValidator` and `StreamTokenizer`
223 for an AAD demonstration using audio data. Before we get any, further it is worth
224 explaining a certain number of points.
225
226 `ADSFactory.ads` method is called to create an `AudioDataSource` object that can be
227 passed to `StreamTokenizer.tokenize`. `ADSFactory.ads` accepts a number of keyword
228 arguments, of which none is mandatory. The returned `AudioDataSource` object can
229 however greatly differ depending on the passed arguments. Further details can be found
230 in the respective method documentation. Note however the following two calls that will
231 create an `AudioDataSource` that read data from an audio file and from the built-in
232 microphone respectively.
233
234 .. code:: python
235
236 from auditok import ADSFactory
237
238 # Get an AudioDataSource from a file
239 file_ads = ADSFactory.ads(filename = "path/to/file/")
240
241 # Get an AudioDataSource from the built-in microphone
242 # The returned object has the default values for sampling
243 # rate, sample width an number of channels. see method's
244 # documentation for customized values
245 mic_ads = ADSFactory.ads()
246
247 For `StreamTkenizer`, parameters `min_length`, `max_length` and `max_continuous_silence`
248 are expressed in term of number of frames. If you want a `max_length` of *2 seconds* for
249 your detected sound events and your *analysis window* is *10 ms* long, you have to specify
250 a `max_length` of 200 (`int(2. / (10. / 1000)) == 200`). For a `max_continuous_silence` of *300 ms*
251 for instance, the value to pass to StreamTokenizer is 30 (`int(0.3 / (10. / 1000)) == 30`).
252
253
254 Where do you get the size of the **analysis window** from?
255
256
257 Well this is a parameter you pass to `ADSFactory.ads`. By default `ADSFactory.ads` uses
258 an analysis window of 10 ms. the number of samples that 10 ms of signal contain will
259 vary depending on the sampling rate of your audio source (file, microphone, etc.).
260 For a sampling rate of 16KHz (16000 samples per second), we have 160 samples for 10 ms.
261 Therefore you can use block sizes of 160, 320, 1600 for analysis windows of 10, 20 and 100
262 ms respectively.
263
264 .. code:: python
265
266 from auditok import ADSFactory
267
268 file_ads = ADSFactory.ads(filename = "path/to/file/", block_size = 160)
269
270 file_ads = ADSFactory.ads(filename = "path/to/file/", block_size = 320)
271
272 # If no sampling rate is specified, ADSFactory use 16KHz as the default
273 # rate for the microphone. If you want to use a window of 100 ms, use
274 # a block size of 1600
275 mic_ads = ADSFactory.ads(block_size = 1600)
276
277 So if your not sure what you analysis windows in seconds is, use the following:
278
279 .. code:: python
280
281 my_ads = ADSFactory.ads(...)
282 analysis_win_seconds = float(my_ads.get_block_size()) / my_ads.get_sampling_rate()
283 analysis_window_ms = analysis_win_seconds * 1000
284
285 # For a `max_continuous_silence` of 300 ms use:
286 max_continuous_silence = int(300. / analysis_window_ms)
287
288 # Which is the same as
289 max_continuous_silence = int(0.3 / (analysis_window_ms / 1000))
290
291
292 Examples
293 --------
294
295 Extract isolated phrases from an utterance
296 ------------------------------------------
297
298 We will build an `AudioDataSource` using a wave file from the database.
299 The file contains of isolated pronunciation of digits from 1 to 1
300 in Arabic as well as breath-in/out between 2 and 3. The code will play the
301 original file then the detected sounds separately. Note that we use an
302 `energy_threshold` of 65, this parameter should be carefully chosen. It depends
303 on microphone quality, background noise and the amplitude of events you want to
304 detect.
305
306 .. code:: python
307
308 from auditok import ADSFactory, AudioEnergyValidator, StreamTokenizer, player_for, dataset
309
310 # We set the `record` argument to True so that we can rewind the source
311 asource = ADSFactory.ads(filename=dataset.one_to_six_arabic_16000_mono_bc_noise, record=True)
312
313 validator = AudioEnergyValidator(sample_width=asource.get_sample_width(), energy_threshold=65)
314
315 # Defalut analysis window is 10 ms (float(asource.get_block_size()) / asource.get_sampling_rate())
316 # min_length=20 : minimum length of a valid audio activity is 20 * 10 == 200 ms
317 # max_length=4000 : maximum length of a valid audio activity is 400 * 10 == 4000 ms == 4 seconds
318 # max_continuous_silence=30 : maximum length of a tolerated silence within a valid audio activity is 30 * 30 == 300 ms
319 tokenizer = StreamTokenizer(validator=validator, min_length=20, max_length=400, max_continuous_silence=30)
320
321 asource.open()
322 tokens = tokenizer.tokenize(asource)
323
324 # Play detected regions back
325
326 player = player_for(asource)
327
328 # Rewind and read the whole signal
329 asource.rewind()
330 original_signal = []
331
332 while True:
333 w = asource.read()
334 if w is None:
335 break
336 original_signal.append(w)
337
338 original_signal = ''.join(original_signal)
339
340 print("Playing the original file...")
341 player.play(original_signal)
342
343 print("playing detected regions...")
344 for t in tokens:
345 print("Token starts at {0} and ends at {1}".format(t[1], t[2]))
346 data = ''.join(t[0])
347 player.play(data)
348
349 assert len(tokens) == 8
350
351
352 The tokenizer extracts 8 audio regions from the signal, including all isolated digits
353 (from 1 to 6) as well as the 2-phase respiration of the subject. You might have noticed
354 that, in the original file, the last three digit are closer to each other than the
355 previous ones. If you wan them to be extracted as one single phrase, you can do so
356 by tolerating a larger continuous silence within a detection:
357
358 .. code:: python
359
360 tokenizer.max_continuous_silence = 50
361 asource.rewind()
362 tokens = tokenizer.tokenize(asource)
363
364 for t in tokens:
365 print("Token starts at {0} and ends at {1}".format(t[1], t[2]))
366 data = ''.join(t[0])
367 player.play(data)
368
369 assert len(tokens) == 6
370
371
372 Trim leading and trailing silence
373 ---------------------------------
374
375 The tokenizer in the following example is set up to remove the silence
376 that precedes the first acoustic activity or follows the last activity
377 in a record. It preserves whatever it founds between the two activities.
378 In other words, it removes the leading and trailing silence.
379
380 Sampling rate is 44100 sample per second, we'll use an analysis window of 100 ms
381 (i.e. bloc_ksize == 4410)
382
383 Energy threshold is 50.
384
385 The tokenizer will start accumulating windows up from the moment it encounters
386 the first analysis window of an energy >= 50. ALL the following windows will be
387 kept regardless of their energy. At the end of the analysis, it will drop trailing
388 windows with an energy below 50.
389
390 This is an interesting example because the audio file we're analyzing contains a very
391 brief noise that occurs within the leading silence. We certainly do want our tokenizer
392 to stop at this point and considers whatever it comes after as a useful signal.
393 To force the tokenizer to ignore that brief event we use two other parameters `init_min`
394 ans `init_max_silence`. By `init_min` = 3 and `init_max_silence` = 1 we tell the tokenizer
395 that a valid event must start with at least 3 noisy windows, between which there
396 is at most 1 silent window.
397
398 Still with this configuration we can get the tokenizer detect that noise as a valid event
399 (if it actually contains 3 consecutive noisy frames). To circummvent this we use an enough
400 large analysis window (here of 100 ms) to ensure that the brief noise be surrounded by a much
401 longer silence and hence the energy of the overall analysis window will be below 50.
402
403 When using a shorter analysis window (of 10ms for instance, block_size == 441), the brief
404 noise contributes more to energy calculation which yields an energy of over 50 for the window.
405 Again we can deal with this situation by using a higher energy threshold (55 for example).
406
407 .. code:: python
408
409 from auditok import ADSFactory, AudioEnergyValidator, StreamTokenizer, player_for, dataset
410 import pyaudio
411
412 # record = True so that we'll be able to rewind the source.
413 asource = ADSFactory.ads(filename=dataset.was_der_mensch_saet_mono_44100_lead_trail_silence,
414 record=True, block_size=4410)
415 asource.open()
416
417 original_signal = []
418 # Read the whole signal
419 while True:
420 w = asource.read()
421 if w is None:
422 break
423 original_signal.append(w)
424
425 original_signal = ''.join(original_signal)
426
427 # rewind source
428 asource.rewind()
429
430 # Create a validator with an energy threshold of 50
431 validator = AudioEnergyValidator(sample_width=asource.get_sample_width(), energy_threshold=50)
432
433 # Create a tokenizer with an unlimited token length and continuous silence within a token
434 # Note the DROP_TRAILING_SILENCE mode that will ensure removing trailing silence
435 trimmer = StreamTokenizer(validator, min_length = 20, max_length=99999999, init_min=3, init_max_silence=1, max_continuous_silence=9999999, mode=StreamTokenizer.DROP_TRAILING_SILENCE)
436
437
438 tokens = trimmer.tokenize(asource)
439
440 # Make sure we only have one token
441 assert len(tokens) == 1, "Should have detected one single token"
442
443 trimmed_signal = ''.join(tokens[0][0])
444
445 player = player_for(asource)
446
447 print("Playing original signal (with leading and trailing silence)...")
448 player.play(original_signal)
449 print("Playing trimmed signal...")
450 player.play(trimmed_signal)
451
452
453 Online audio signal processing
454 ------------------------------
455
456 In the next example, audio data is directely acquired from the built-in microphone.
457 The `tokenize` method is passed a callback function so that audio activities
458 are delivered as soon as they are detected. Each detected activity is played
459 back using the build-in audio output device.
460
461 As mentionned before , Signal energy is strongly related to many factors such
462 microphone sensitivity, background noise (including noise inherent to the hardware),
463 distance and your operating system sound settings. Try a lower `energy_threshold`
464 if your noise does not seem to be detected and a higher threshold if you notice
465 an over detection (echo method prints a detection where you have made no noise).
466
467 .. code:: python
468
469 from auditok import ADSFactory, AudioEnergyValidator, StreamTokenizer, player_for
470 import pyaudio
471
472 # record = True so that we'll be able to rewind the source.
473 # max_time = 10: read 10 seconds from the microphone
474 asource = ADSFactory.ads(record=True, max_time=10)
475
476 validator = AudioEnergyValidator(sample_width=asource.get_sample_width(), energy_threshold=50)
477 tokenizer = StreamTokenizer(validator=validator, min_length=20, max_length=250, max_continuous_silence=30)
478
479 player = player_for(asource)
480
481 def echo(data, start, end):
482 print("Acoustic activity at: {0}--{1}".format(start, end))
483 player.play(''.join(data))
484
485 asource.open()
486
487 tokenizer.tokenize(asource, callback=echo)
488
489 If you want to re-run the tokenizer after changing of one or many parameters, use the following code:
490
491 .. code:: python
492
493 asource.rewind()
494 # change energy threshold for example
495 tokenizer.validator.set_energy_threshold(55)
496 tokenizer.tokenize(asource, callback=echo)
497
498 In case you want to play the whole recorded signal back use:
499
500 .. code:: python
501
502 player.play(asource.get_audio_source().get_data_buffer())
503
504
505 Contributing
506 ============
507 **auditok** is on `GitHub <https://github.com/amsehili/auditok>`_. You're welcome to fork it and contribute.
508
509
510 Amine SEHILI <amine.sehili[_at_]gmail.com>
511 September 2015
512
513 License
514 =======
515
516 This package is published under GNU GPL Version 3.