Mercurial > hg > auditok
changeset 351:3d6e4d8f6903
Update docstring
author | Amine Sehili <amine.sehili@gmail.com> |
---|---|
date | Tue, 31 Mar 2020 22:21:13 +0200 |
parents | 1076056833c5 |
children | 02f4aa16598a |
files | auditok/core.py |
diffstat | 1 files changed, 415 insertions(+), 257 deletions(-) [+] |
line wrap: on
line diff
--- a/auditok/core.py Wed Jan 22 23:21:44 2020 +0100 +++ b/auditok/core.py Tue Mar 31 22:21:13 2020 +0200 @@ -36,61 +36,99 @@ strict_min_dur=False, **kwargs ): - """Splits audio data and returns a generator of `AudioRegion`s + """ + Split audio data and return a generator of `AudioRegion`s - :Parameters: + Parameters + ---------- + input : str, bytes, AudioSource, AudioReader, AudioRegion or None + input audio data. If str, it should be a path to an existing audio file. + If bytes, input is considered as raw audio data. If None, read audio + from microphone. + Every object that is not an ´AudioReader´ will be transformed into an + `AudioReader` before processing. If it is an `str` that refers to a raw + audio file, `bytes` or None, audio parameters should be provided using + kwargs (i.e., `samplig_rate`, `sample_width` and `channels` or their + alias). + If ´input´ is str then audio format will be guessed from file extension. + `audio_format` (alias `fmt`) kwarg can also be given to specify audio + format explicitly. If none of these options is available, rely on + backend (currently only pydub is supported) to load data. + min_dur : float, default: 0.2 + minimun duration in seconds of a detected audio event. By using large + values for `min_dur`, very short audio events (e.g., very short 1-word + utterances like 'yes' or 'no') can be misdetected. Using very short + values might result in a high number of short, unuseful audio events. + max_dur : float, default: 5 + maximum duration in seconds of a detected audio event. If an audio event + lasts more than `max_dur` it will be truncated. If the continuation of a + truncated audio event is shorter than `min_dur` then this continuation + is accepted as a valid audio event if `strict_min_dur` is False. + Otherwise it is rejected. + max_silence : float, default: 0.3 + maximum duration of continuous silence within an audio event. There + might be many silent gaps of this duration within one audio event. If + the continuous silence happens at the end of the event than it's kept as + part of the event if `drop_trailing_silence` is False (default). + drop_trailing_silence : bool, default: False + Whether to remove trailing silence from detected events. To avoid abrupt + cuts in speech, trailing silence should be kept, therefor + `drop_trailing_silence` should be False.s + detection, it + strict_min_dur : bool, default: False + strict minimum duration. Do not accept an audio event if it is shorter + than ´min_dur´ even if it is continguous to the latest valid event. This + happens if the the latest detected event had reached ´max_dur´. - input: str, bytes, AudioSource, AudioRegion, AudioReader - input audio data. If str, it should be a path to an existing audio - file. If bytes, input is considered as raw audio data. - min_dur: float - minimun duration in seconds of a detected audio event. Default: 0.2. - Using large values, very short audio events (e.g., very short 1-word - utterances like 'yes' or 'no') can be missed. - Using very short values might result in a high number of short, - unuseful audio events. - max_dur: float - maximum duration in seconds of a detected audio event. Default: 5. - max_silence: float - maximum duration of consecutive silence within an audio event. There - might be many silent gaps of this duration within an audio event. - drop_trailing_silence: bool - drop trailing silence from detected events - strict_min_dur: bool - strict minimum duration. Drop an event if it is shorter than ´min_dur´ - even if it is continguous to the latest valid event. This happens if - the the latest event had reached ´max_dur´. - analysis_window, aw: float - duration of analysis window in seconds. Default: 0.05 second (50 ms). - A value up to 0.1 second (100 ms) should be good for most use-cases. - You might need a different value, especially if you use a custom - validator. - audio_format, fmt: str - type of audio date (e.g., wav, ogg, raw, etc.). This will only be used - if ´input´ is a string path to audio file. If not given, audio type - will be guessed from file name extension or from file header. - sampling_rate, sr: int - sampling rate of audio data. Only needed for raw audio files/data. - sample_width, sw: int - number of bytes used to encode an audio sample, typically 1, 2 or 4. - Only needed for raw audio files/data. - channels, ch: int - nuumber of channels of audio data. Only needed for raw audio files. - use_channel, uc: int, str - which channel to use if input has multichannel audio data. Can be an - int (0 being the first channel), or one of the following values: - - None, "any": a valid frame from one any given channel makes - parallel frames from all other channels automatically valid. - - 'mix': compute average channel (i.e. mix down all channels) - max_read, mr: float - maximum data to read in seconds. Default: `None`, read until there is - no more data to read. - validator, val: DataValidator + Kwargs + ------ + analysis_window, aw : float, default: 0.05 (50 ms) + duration of analysis window in seconds. A value between 0.01 (10 ms) and + 0.1 (100 ms) should be good for most use-cases. + audio_format, fmt : str + type of audio data (e.g., wav, ogg, flac, raw, etc.). This will only be + used if ´input´ is a string path to an audio file. If not given, audio + type will be guessed from file name extension or from file header. + sampling_rate, sr : int + sampling rate of audio data. Reauired if `input` is a raw audio file, is + a bytes object or None (i.e., read from microphone). + sample_width, sw : int + number of bytes used to encode one audio sample, typically 1, 2 or 4. + Required for raw data, see `sampling_rate`. + channels, ch : int + nuumber of channels of audio data. Required for raw data, see + `sampling_rate`. + use_channel, uc : {None, "mix"} or int + which channel to use for split if `input` has multiple audio channels. + Regardless of which channel is used for splitting, returned audio events + contain data from *all* channels, just as `input`. + The following values are accepted: + - None (alias "any"): accept audio activity from any channel, even + if other channels are silent. This is the default behavior. + - "mix" ("avg" or "average"): mix down all channels (i.e. compute + average channel) and split the resulting channel. + - int (0 <=, > `channels`): use one channel, specified by integer + id, for split. + large_file : bool, default: False + If True, AND if `input` is a path to a *wav* of a *raw* audio file + (and only these two formats) then audio data is lazily loaded to memory + (i.e., one analysis window a time). Otherwise the whole file is loaded + to memory before split. Set to True if the size of the file is larger + than available memory. + max_read, mr : float, default: None (read until end of stream) + maximum data to read from source in seconds. + validator, val : callable, DataValidator custom data validator. If ´None´ (default), an `AudioEnergyValidor` is - used with the given energy threshold. - energy_threshold, eth: float - energy threshlod for audio activity detection, default: 50. If a custom - validator is given, this argumemt will be ignored. + used with the given energy threshold. Can be a callable or an instnace + of `DataValidator` that implements `is_valid`. In either case, it'll be + called with with a window of audio data as the first parameter. + energy_threshold, eth : float, default: 50 + energy threshlod for audio activity detection. Audio regions that have + enough windows of with a signal energy equal to or above this threshold + are considered valid audio events. Here we are referring to this quntity + as enegry of this signal but to be more accurate, it is the log energy + of the signal computed as: 10 . log10 dot(x, x) / |x| + If `validator` is given, this argumemt is ignored. """ if min_dur <= 0: raise ValueError("'min_dur' ({}) must be > 0".format(min_dur)) @@ -210,22 +248,23 @@ `duration` and `analysis_window` can be in seconds or milliseconds but must be in the same unit. - :Parameters: + Parameters + ---------- - duration: float + duration : float a given duration in seconds or ms. analysis_window: float size of analysis window, in the same unit as `duration`. - round_fn: callable + round_fn : callable function called to round the result. Default: `round`. - epsilon: float + epsilon : float small value to add to the division result before rounding. E.g., `0.3 / 0.1 = 2.9999999999999996`, when called with `round_fn=math.floor` returns `2` instead of `3`. Adding a small value to `0.3 / 0.1` avoids this error. - Returns: - -------- + Returns + ------- nb_windows: int minimum number of `analysis_window`'s to cover `durartion`. That means that `analysis_window * nb_windows >= duration`. @@ -246,23 +285,28 @@ sample_width, channels, ): - """Create and return an `AudioRegion`. + """ + Helper function to create an `AudioRegion` from parameters returned by + tokenization object. It takes care of setting up region `start` and `end` + in metadata. - :Parameters: + Parameters + ---------- frame_duration: float duration of analysis window in seconds - start_frame: int + start_frame : int index of the fisrt analysis window - samling_rate: int + samling_rate : int sampling rate of audio data - sample_width: int + sample_width : int number of bytes of one audio sample - channels: int + channels : int number of channels of audio data - Returns: - audio_region: AudioRegion + Returns + ------- + audio_region : AudioRegion AudioRegion whose start time is calculeted as: `1000 * start_frame * frame_duration` """ @@ -274,6 +318,24 @@ def _read_chunks_online(max_read, **kwargs): + """ + Helper function to read audio data from an online blocking source + (i.e., microphone). Used to build an `AudioRegion` and can intercept + KeyboardInterrupt so that reading stops as soon as this exception is + raised. Makes building `AudioRegion`s on [i]python sessions and jupyter + notebooks more user friendly. + + Parameters + ---------- + max_read : float + maximum amount of data to read in seconds. + kwargs : + audio parameters (sampling_rate, sample_width and channels). + + See also + -------- + `AudioRegion.build` + """ reader = AudioReader(None, block_dur=0.5, max_read=max_read, **kwargs) reader.open() data = [] @@ -297,6 +359,28 @@ def _read_offline(input, skip=0, max_read=None, **kwargs): + """ + Helper function to read audio data from an offline (i.e., file). Used to + build `AudioRegion`s. + + Parameters + ---------- + input : str, bytes + path to audio file (if str), or a bytes object representing raw audio + data. + skip : float, default 0 + amount of data to skip from the begining of audio source. + max_read : float, default: None + maximum amount of audio data to read. Default: None, means read until + end of stream. + kwargs : + audio parameters (sampling_rate, sample_width and channels). + + See also + -------- + `AudioRegion.build` + + """ audio_source = get_audio_source(input, **kwargs) audio_source.open() if skip is not None and skip > 0: @@ -329,6 +413,10 @@ class _SecondsView: + """A class to create a view of `AudioRegion` that can be sliced using + indices in seconds. + """ + def __init__(self, region): self._region = region @@ -350,6 +438,10 @@ class _MillisView(_SecondsView): + """A class to create a view of `AudioRegion` that can be sliced using + indices in milliseconds. + """ + def __getitem__(self, index): err_msg = ( "Slicing AudioRegion by milliseconds requires indices of type " @@ -376,6 +468,9 @@ class _AudioRegionMetadata(dict): + """A class to store `AudioRegion`'s metadata. + """ + def __getattr__(self, name): if name in self: return self[name] @@ -396,18 +491,28 @@ class AudioRegion(object): def __init__(self, data, sampling_rate, sample_width, channels, meta=None): """ - A class for detected audio events. + AudioRegion encapsulates raw audio data and provides an interface to + perform simple operations on it. Use `AudioRegion.load` to build an + `AudioRegion` from different types of objects. - :Parameters: + Parameters + ---------- + data : bytes + raw audio data as a bytes object + samling_rate : int + sampling rate of audio data + sample_width : int + number of bytes of one audio sample + channels : int + number of channels of audio data + meta : dict, default: None + any collection of <key:value> elements used to build metadata for this + `AudioRegion. Meta data can be accessed via `region.meta.key` if `key` + is a valid python attribute name, or via `region.meta[key]` if not. - data: bytes - audio data - samling_rate: int - sampling rate of audio data - sample_width: int - number of bytes of one audio sample - channels: int - number of channels of audio data + See also + -------- + AudioRegion.load """ check_audio_data(data, sample_width, channels) self._data = data @@ -423,7 +528,8 @@ self._meta = None self._seconds_view = _SecondsView(self) - self.s = self.sec + self.sec = self.seconds + self.s = self.seconds self._millis_view = _MillisView(self) self.ms = self.millis @@ -438,16 +544,60 @@ @classmethod def load(cls, input, skip=0, max_read=None, **kwargs): + """ + Create an `AudioRegion` by loading data from `input`. + + Parameters + ---------- + input : None, str, bytes, AudioSource + source to load data from. If None, load data from microphone. If + bytes, create region from raw data. If str, load data from file. + Input can also an AudioSource object. + skip : float, default: 0 + amount, in seconds, of audio data to skip from source. If read from + microphone, ``skip`` must be 0, otherwise a ValueError is raised. + max_read : float, default: None + amount, in seconds, of audio data to read from source. If read from + microphone, `max_read` should not be None, otherwise a ValueError is + raised. + + audio_format, fmt : str + type of audio data (e.g., wav, ogg, flac, raw, etc.). This will only + be used if `input` is a string path to an audio file. If not given, + audio type will be guessed from file name extension or from file + header. + sampling_rate, sr : int + sampling rate of audio data. Reauired if `input` is a raw audio file, + a bytes object or None (i.e., read from microphone). + sample_width, sw : int + number of bytes used to encode one audio sample, typically 1, 2 or 4. + Required for raw data, see `sampling_rate`. + channels, ch : int + nuumber of channels of audio data. Required for raw data, see + `sampling_rate`. + large_file : bool, default: False + If True, AND if `input` is a path to a *wav* of a *raw* audio file + (and only these two formats) then audio file is not fully loaded to + memory. Set to True to only load `max_read` data from file. + + Returns + ------- + region: AudioRegion + + Raises + ------ + ValueError if `input` is None and `skip` != 0 or `max_read` is None. + """ if input is None: + if skip > 0: + raise ValueError( + "'skip' should be 0 when reading from microphone" + ) if max_read is None or max_read < 0: raise ValueError( "'max_read' should not be None when reading from " "microphone" ) - if skip > 0: - raise ValueError( - "'skip' should be 0 when reading from microphone" - ) data, sampling_rate, sample_width, channels = _read_chunks_online( max_read, **kwargs ) @@ -459,7 +609,7 @@ return cls(data, sampling_rate, sample_width, channels) @property - def sec(self): + def seconds(self): return self._seconds_view @property @@ -500,20 +650,21 @@ return self._channels def play(self, progress_bar=False, player=None, **progress_bar_kwargs): - """Play audio region + """ + Play audio region. - :Parameters: - - player: AudioPalyer, default: None - audio player to use. if None (default), use `player_for(self)` + Parameters + ---------- + progress_bar : bool, default: False + whether to use a progress bar while playing audio. Default: False. + `progress_bar` requires `tqdm`, if not installed, no progress bar + will be shown. + player : AudioPalyer, default: None + audio player to use. if None (default), use `player_for()` to get a new audio player. - - progress_bar bool, default: False - whether to use a progress bar while playing audio. Default: False. - - progress_bar_kwargs: kwargs - keyword arguments to pass to progress_bar object. Currently only - `tqdm` is supported. + progress_bar_kwargs : kwargs + keyword arguments to pass to `tqdm` progress_bar builder (e.g., + use `leave=False` to clean up screen when play finishes). """ if player is None: player = player_for(self) @@ -521,47 +672,51 @@ self._data, progress_bar=progress_bar, **progress_bar_kwargs ) - def save(self, file, format=None, exists_ok=True, **audio_parameters): - """Save audio region to file. + def save( + self, file, audio_format=None, exists_ok=True, **audio_parameters + ): + """ + Save audio region to file. - :Parameters: + Parameters + ---------- + file : str + path to output audio file. May contain ´{duration}´ placeholder + as well as any place holder that this region's metadata might + contain (e.g., regions returned by `split` contain metadata with + `start` and `end` attributes that can be used to build output file + name as ´{meta.start}´ and ´{meta.end}´. See examples using + placeholders with formatting. - file: str, file-like object - path to output file or a file-like object. If ´str´, it may contain - and ´{duration}´ place holders as well as any place holder that - this region's metadata might contain (e.g., ´{meta.start}´). + audio_format : str + format used to save audio data. If None (default), format is guessed + from file name's extension. If file name has no extension, audio + data is saved as a raw (headerless) audio file. + exists_ok : bool, default: True + If True, overwrite ´file´ if a file with the same name exists. + If False, raise an ´IOError´ if `file` exists. + audio_parameters: dict + any keyword arguments to be passed to audio saving backend. + FIXME: this is not yet implemented! + Returns + ------- + file: str + name of output file with replaced placehoders. - format: str - type of audio file. If None (default), file type is guessed from - `file`'s extension. If `file` is not a ´str´ or does not have - an extension, audio data is as a raw (headerless) audio file. - exists_ok: bool, default: True - If True, overwrite ´file´ if a file with the same name exists. - If False, raise an ´IOError´ if the file exists. - audio_parameters: dict - any keyword arguments to be passed to audio saving backend - (e.g. bitrate, etc.) + Raises + IOError if ´file´ exists and ´exists_ok´ is False. - :Returns: - - file: str, file-like object - name of the file of file-like object to which audio data was - written. If parameter ´file´ was a ´str´ with at least one {start}, - {end} or {duration} place holders. - - :Raises: - - IOError if ´file´ exists and ´exists_ok´ is False. - - Example: + Example + ------- .. code:: python region = AudioRegion(b'\0' * 2 * 24000, sampling_rate=16000, sample_width=2, channels=1) - region.meta = {"start": 2.25, "end": 2.25 + region.duration} + region.meta.start = 2.25 + region.meta.end = 2.25 + region.duration region.save('audio_{meta.start}-{meta.end}.wav') audio_2.25-3.75.wav region.save('region_{meta.start:.3f}_{duration:.3f}.wav') @@ -574,7 +729,7 @@ to_file( self._data, file, - format, + audio_format, sr=self.sr, sw=self.sw, ch=self.ch, @@ -591,7 +746,8 @@ strict_min_dur=False, **kwargs ): - """Split region. See :auditok.split() for split parameters description. + """Split audio region. See `auditok.split()` for split parameters + description. """ if kwargs.get("max_read", kwargs.get("mr")) is not None: warn_msg = "'max_read' (or 'mr') should not be used with " @@ -827,165 +983,167 @@ Class for stream tokenizers. It implements a 4-state automaton scheme to extract sub-sequences of interest on the fly. - :Parameters: + Parameters + ---------- - `validator` : - Callable or an instance of DataValidator that implements - `is_valid` method. + validator : callable, DataValidator (must implement `is_valid`) + called with each data frame read from source. Should take one positional + argument and return True or False for valid and invalid frames + respectively. - `min_length` : *(int)* - Minimum number of frames of a valid token. This includes all - tolerated non valid frames within the token. + min_length : int + Minimum number of frames of a valid token. This includes all + tolerated non valid frames within the token. - `max_length` : *(int)* - Maximum number of frames of a valid token. This includes all - tolerated non valid frames within the token. + max_length : int + Maximum number of frames of a valid token. This includes all + tolerated non valid frames within the token. - `max_continuous_silence` : *(int)* - Maximum number of consecutive non-valid frames within a token. - Note that, within a valid token, there may be many tolerated - *silent* regions that contain each a number of non valid frames up - to `max_continuous_silence` + `max_continuous_silence` : *(int)* + Maximum number of consecutive non-valid frames within a token. + Note that, within a valid token, there may be many tolerated + *silent* regions that contain each a number of non valid frames up + to `max_continuous_silence` - `init_min` : *(int, default=0)* - Minimum number of consecutive valid frames that must be - **initially** gathered before any sequence of non valid frames can - be tolerated. This option is not always needed, it can be used to - drop non-valid tokens as early as possible. **Default = 0** means - that the option is by default ineffective. + `init_min` : *(int, default=0)* + Minimum number of consecutive valid frames that must be + **initially** gathered before any sequence of non valid frames can + be tolerated. This option is not always needed, it can be used to + drop non-valid tokens as early as possible. **Default = 0** means + that the option is by default ineffective. - `init_max_silence` : *(int, default=0)* - Maximum number of tolerated consecutive non-valid frames if the - number already gathered valid frames has not yet reached - 'init_min'.This argument is normally used if `init_min` is used. - **Default = 0**, by default this argument is not taken into - consideration. + `init_max_silence` : *(int, default=0)* + Maximum number of tolerated consecutive non-valid frames if the + number already gathered valid frames has not yet reached + 'init_min'.This argument is normally used if `init_min` is used. + **Default = 0**, by default this argument is not taken into + consideration. - `mode` : *(int, default=0)* - `mode` can be: + `mode` : *(int, default=0)* + `mode` can be: - 1. `StreamTokenizer.NORMAL`: - Do not drop trailing silence, and accept a token shorter than - `min_length` if it is the continuation of the latest delivered token. + 1. `StreamTokenizer.NORMAL`: + Do not drop trailing silence, and accept a token shorter than + `min_length` if it is the continuation of the latest delivered token. - 2. `StreamTokenizer.STRICT_MIN_LENGTH`: - if token *i* is delivered because `max_length` - is reached, and token *i+1* is immediately adjacent to - token *i* (i.e. token *i* ends at frame *k* and token *i+1* starts - at frame *k+1*) then accept token *i+1* only of it has a size of at - least `min_length`. The default behavior is to accept token *i+1* - event if it is shorter than `min_length` (given that the above - conditions are fulfilled of course). + 2. `StreamTokenizer.STRICT_MIN_LENGTH`: + if token *i* is delivered because `max_length` + is reached, and token *i+1* is immediately adjacent to + token *i* (i.e. token *i* ends at frame *k* and token *i+1* starts + at frame *k+1*) then accept token *i+1* only of it has a size of at + least `min_length`. The default behavior is to accept token *i+1* + event if it is shorter than `min_length` (given that the above + conditions are fulfilled of course). - :Examples: + :Examples: - In the following code, without `STRICT_MIN_LENGTH`, the 'BB' token is - accepted although it is shorter than `min_length` (3), because it - immediately follows the latest delivered token: + In the following code, without `STRICT_MIN_LENGTH`, the 'BB' token is + accepted although it is shorter than `min_length` (3), because it + immediately follows the latest delivered token: + + .. code:: python + + from auditok import (StreamTokenizer, + StringDataSource, + DataValidator) + + class UpperCaseChecker(DataValidator): + def is_valid(self, frame): + return frame.isupper() + + + dsource = StringDataSource("aaaAAAABBbbb") + tokenizer = StreamTokenizer(validator=UpperCaseChecker(), + min_length=3, + max_length=4, + max_continuous_silence=0) + + tokenizer.tokenize(dsource) + + :output: .. code:: python - from auditok import (StreamTokenizer, - StringDataSource, - DataValidator) + [(['A', 'A', 'A', 'A'], 3, 6), (['B', 'B'], 7, 8)] - class UpperCaseChecker(DataValidator): - def is_valid(self, frame): - return frame.isupper() + The following tokenizer will however reject the 'BB' token: - dsource = StringDataSource("aaaAAAABBbbb") - tokenizer = StreamTokenizer(validator=UpperCaseChecker(), - min_length=3, - max_length=4, - max_continuous_silence=0) + .. code:: python + dsource = StringDataSource("aaaAAAABBbbb") + tokenizer = StreamTokenizer(validator=UpperCaseChecker(), + min_length=3, max_length=4, + max_continuous_silence=0, + mode=StreamTokenizer.STRICT_MIN_LENGTH) + tokenizer.tokenize(dsource) + + :output: + + .. code:: python + + [(['A', 'A', 'A', 'A'], 3, 6)] + + + 3. `StreamTokenizer.DROP_TRAILING_SILENCE`: drop all tailing non-valid + frames from a token to be delivered if and only if it is not + **truncated**. This can be a bit tricky. A token is actually delivered + if: - a. `max_continuous_silence` is reached + + :or: + + - b. Its length reaches `max_length`. This is called a **truncated** + token + + In the current implementation, a `StreamTokenizer`'s decision is only + based on already seen data and on incoming data. Thus, if a token is + truncated at a non-valid but tolerated frame (`max_length` is reached + but `max_continuous_silence` not yet) any tailing silence will be kept + because it can potentially be part of valid token (if `max_length` was + bigger). But if `max_continuous_silence` is reached before + `max_length`, the delivered token will not be considered as truncated + but a result of *normal* end of detection (i.e. no more valid data). + In that case the tariling silence can be removed if you use the + `StreamTokenizer.DROP_TRAILING_SILENCE` mode. + + :Example: + + .. code:: python + + tokenizer = StreamTokenizer( + validator=UpperCaseChecker(), + min_length=3, + max_length=6, + max_continuous_silence=3, + mode=StreamTokenizer.DROP_TRAILING_SILENCE + ) + + dsource = StringDataSource("aaaAAAaaaBBbbbb") tokenizer.tokenize(dsource) - :output: + :output: - .. code:: python + .. code:: python - [(['A', 'A', 'A', 'A'], 3, 6), (['B', 'B'], 7, 8)] + [(['A', 'A', 'A', 'a', 'a', 'a'], 3, 8), (['B', 'B'], 9, 10)] + The first token is delivered with its tailing silence because it is + truncated while the second one has its tailing frames removed. - The following tokenizer will however reject the 'BB' token: + Without `StreamTokenizer.DROP_TRAILING_SILENCE` the output would be: - .. code:: python + .. code:: python - dsource = StringDataSource("aaaAAAABBbbb") - tokenizer = StreamTokenizer(validator=UpperCaseChecker(), - min_length=3, max_length=4, - max_continuous_silence=0, - mode=StreamTokenizer.STRICT_MIN_LENGTH) - tokenizer.tokenize(dsource) + [ + (['A', 'A', 'A', 'a', 'a', 'a'], 3, 8), + (['B', 'B', 'b', 'b', 'b'], 9, 13) + ] - :output: - .. code:: python - - [(['A', 'A', 'A', 'A'], 3, 6)] - - - 3. `StreamTokenizer.DROP_TRAILING_SILENCE`: drop all tailing non-valid - frames from a token to be delivered if and only if it is not - **truncated**. This can be a bit tricky. A token is actually delivered - if: - a. `max_continuous_silence` is reached - - :or: - - - b. Its length reaches `max_length`. This is called a **truncated** - token - - In the current implementation, a `StreamTokenizer`'s decision is only - based on already seen data and on incoming data. Thus, if a token is - truncated at a non-valid but tolerated frame (`max_length` is reached - but `max_continuous_silence` not yet) any tailing silence will be kept - because it can potentially be part of valid token (if `max_length` was - bigger). But if `max_continuous_silence` is reached before - `max_length`, the delivered token will not be considered as truncated - but a result of *normal* end of detection (i.e. no more valid data). - In that case the tariling silence can be removed if you use the - `StreamTokenizer.DROP_TRAILING_SILENCE` mode. - - :Example: - - .. code:: python - - tokenizer = StreamTokenizer( - validator=UpperCaseChecker(), - min_length=3, - max_length=6, - max_continuous_silence=3, - mode=StreamTokenizer.DROP_TRAILING_SILENCE - ) - - dsource = StringDataSource("aaaAAAaaaBBbbbb") - tokenizer.tokenize(dsource) - - :output: - - .. code:: python - - [(['A', 'A', 'A', 'a', 'a', 'a'], 3, 8), (['B', 'B'], 9, 10)] - - The first token is delivered with its tailing silence because it is - truncated while the second one has its tailing frames removed. - - Without `StreamTokenizer.DROP_TRAILING_SILENCE` the output would be: - - .. code:: python - - [ - (['A', 'A', 'A', 'a', 'a', 'a'], 3, 8), - (['B', 'B', 'b', 'b', 'b'], 9, 13) - ] - - - 4. `(StreamTokenizer.STRICT_MIN_LENGTH | - StreamTokenizer.DROP_TRAILING_SILENCE)`: - use both options. That means: first remove tailing silence, then ckeck - if the token still has at least a length of `min_length`. + 4. `(StreamTokenizer.STRICT_MIN_LENGTH | + StreamTokenizer.DROP_TRAILING_SILENCE)`: + use both options. That means: first remove tailing silence, then ckeck + if the token still has at least a length of `min_length`. """ SILENCE = 0