Mercurial > hg > aim92
comparison man/man1/gensai.1 @ 0:5242703e91d3 tip
Initial checkin for AIM92 aimR8.2 (last updated May 1997).
author | tomwalters |
---|---|
date | Fri, 20 May 2011 15:19:45 +0100 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:5242703e91d3 |
---|---|
1 .TH GENSAI 1 "26 May 1995" | |
2 .LP | |
3 .SH NAME | |
4 .LP | |
5 gensai \- generate stabilised auditory image | |
6 .LP | |
7 .SH SYNOPSIS/SYNTAX | |
8 .LP | |
9 gensai [ option=value | -option ] filename | |
10 .LP | |
11 .SH DESCRIPTION | |
12 .LP | |
13 | |
14 Periodic sounds give rise to static, rather than oscillating, | |
15 perceptions indicating that temporal integration is applied to the NAP | |
16 in the production of our initial perception of a sound -- our auditory | |
17 image. Traditionally, auditory temporal integration is represented by | |
18 a simple leaky integration process and AIM provides a bank of lowpass | |
19 filters to enable the user to generate auditory spectra (Patterson, | |
20 1994a) and auditory spectrograms (Patterson et al., 1992b). However, | |
21 the leaky integrator removes the phase-locked fine structure observed | |
22 in the NAP, and this conflicts with perceptual data indicating that | |
23 the fine structure plays an important role in determining sound | |
24 quality and source identification (Patterson, 1994b; Patterson and | |
25 Akeroyd, 1995). As a result, AIM includes two modules which preserve | |
26 much of the time-interval information in the NAP during temporal | |
27 integration, and which produce a better representation of our auditory | |
28 images. In the functional version of AIM, this is accomplished with | |
29 strobed temporal integration (Patterson et al., 1992a,b), and this is | |
30 the topic of this manual entry. | |
31 | |
32 .LP | |
33 | |
34 In the physiological version of AIM, the auditory image is constructed | |
35 with a bank of autocorrelators (Slaney and Lyon, 1990; Meddis and | |
36 Hewitt, 1991). The autocorrelation module is an aimTool rather than | |
37 an integral part of the main program 'gen'. The appropriate tool is | |
38 'acgram'. Type 'manaim acgram' for the documentation. The module | |
39 extracts periodicity information and preserves intra-period fine | |
40 structure by autocorrelating each channel of the NAP separately. The | |
41 correlogram is the multi-channel version of this process. It was | |
42 originally introduced as a model of pitch perception (Licklider, | |
43 1951). It is not yet known whether STI or autocorrelation is more | |
44 realistic, or more efficient, as a means of simulating our perceived | |
45 auditory images. At present, the purpose is to provide a software | |
46 package that can be used to compare these auditory representations in | |
47 a way not previously possible. | |
48 | |
49 .RE | |
50 .LP | |
51 .SH STROBED TEMPORAL INTEGRATION | |
52 .PP | |
53 | |
54 In strobed temporal integration, a bank of delay lines is used to form | |
55 a buffer store for the NAP, one delay line per channel, and as the NAP | |
56 proceeds along the buffer it decays linearly with time, at about 2.5 | |
57 %/ms. Each channel of the buffer is assigned a strobe unit which | |
58 monitors activity in that channel looking for local maxima in the | |
59 stream of NAP pulses. When one is found, the unit initiates temporal | |
60 integration in that channel; that is, it transfers a copy of the NAP | |
61 at that instant to the corresponding channel of an image buffer and | |
62 adds it point-for-point with whatever is already there. The local | |
63 maximum itself is mapped to the 0-ms point in the image buffer. The | |
64 multi-channel version of this STI process is AIM's representation of | |
65 our auditory image of a sound. Periodic and quasi-periodic sounds | |
66 cause regular strobing which leads to simulated auditory images that | |
67 are static, or nearly static, but with the same temporal resolution as | |
68 the NAP. Dynamic sounds are represented as a sequence of auditory | |
69 image frames. If the rate of change in a sound is not too rapid, as is | |
70 diphthongs, features are seen to move smoothly as the sound proceeds, | |
71 much as objects move smoothly in animated cartoons. | |
72 | |
73 .LP | |
74 It is important to emphasise, that the triggering done on a | |
75 channel by channel basis and that triggering is asynchronous | |
76 across channels, inasmuch as the major peaks in one channel occur | |
77 at different times from the major peaks in other channels. It | |
78 is this aspect of the triggering process that causes the | |
79 alignment of the auditory image and which accounts for the loss | |
80 of phase information in the auditory system (Patterson, 1987). | |
81 | |
82 .LP | |
83 | |
84 The auditory image has the same vertical dimension as the neural | |
85 activity pattern (filter centre frequency). The continuous time | |
86 dimension of the neural activity pattern becomes a local, | |
87 time-interval dimension in the auditory image; specifically, it is | |
88 "the time interval between a given pulse and the succeeding strobe | |
89 pulse". In order to preserve the direction of asymmetry of features | |
90 that appear in the NAP, the time-interval origin is plotted towards | |
91 the right-hand edge of the image, with increasing, positive time | |
92 intervals proceeding to towards the left. | |
93 | |
94 .LP | |
95 .SH OPTIONS | |
96 .LP | |
97 .SS "Display options for the auditory image" | |
98 .PP | |
99 | |
100 The options that control the positioning of the window in which the | |
101 auditory image appears are the same as those used to set up the | |
102 earlier windows, as are the options that control the level of the | |
103 image within the display. In addition, there are three new options | |
104 that are required to present this new auditory representation. The | |
105 options are frstep_aid, pwid_aid, and nwid_aid; the suffix "_aid" | |
106 means "auditory image display". These options are described here | |
107 before the options that control the image construction process itself, | |
108 as they occur first in the options list. There are also three extra | |
109 display options for presenting the auditory image in its spiral form; | |
110 these options have the suffix "_spd" for "spiral display"; they are | |
111 described in the manual entry for 'genspl'. | |
112 | |
113 .LP | |
114 .TP 17 | |
115 frstep_aid | |
116 The frame step interval, or the update interval for the auditory image display | |
117 .RS | |
118 Default units: ms. Default value: 16 ms. | |
119 .RE | |
120 .RS | |
121 | |
122 Conceptually, the auditory image exists continuously in time. The | |
123 simulation of the image produced by AIM is not continuous; rather it | |
124 is like an animated cartoon. Frames of the cartoon are calculated at | |
125 discrete points in time, and then the sequence of frames is replayed | |
126 to reveal the dynamics of the sound, or the lack of dynamics in the | |
127 case of periodic sounds. When the sound is changing at a rate where | |
128 we hear smooth glides, the structures in the simulated auditory image | |
129 move much like objects in a cartoon. frstep_aid determines the time | |
130 interval between frames of the auditory image cartoon. Frames are | |
131 calculated at time zero and integer multiples of segment_sai. | |
132 | |
133 .RE | |
134 | |
135 The default value (16 ms) is reasonable for musical sounds and speech | |
136 sounds. For a detailed examination of the development of the image of | |
137 brief transient sounds frstep_aid should be decreased to 4 or even 2 | |
138 ms. | |
139 .LP | |
140 .TP 16 | |
141 pwidth_sai | |
142 | |
143 The maximum positive time interval presented in the display of the | |
144 auditory image (to the left of 0 ms). | |
145 | |
146 .RS | |
147 Default units: ms. Default value: 35 ms. | |
148 .RE | |
149 .LP | |
150 .TP 16 | |
151 nwidth_sai | |
152 | |
153 The maximum negative time interval presented in the display of the | |
154 auditory image (to the right of 0 ms). | |
155 | |
156 .RS | |
157 Default units: ms. Default value: -5 ms. | |
158 .RE | |
159 | |
160 .LP | |
161 .TP 12 | |
162 animate | |
163 Present the frames of the simulated auditory image as a cartoon. | |
164 .RS | |
165 Switch. Default off. | |
166 .RE | |
167 .RS | |
168 | |
169 With reasonable resolution and a reasonable frame rate, the auditory | |
170 cartoon for a second of sound will require on the order of 1 Mbyte of | |
171 storage. As a result, auditory cartoons are only stored at the | |
172 specific request of the user. When the animate flag is set to `on', | |
173 the bit maps that constitute the frames the auditory cartoon are | |
174 stored in computer memory. They can then be replayed as an auditory | |
175 cartoon by pressing `carriage return'. To exit the instruction, type | |
176 "q" for `quit' or "control c". The bit maps are discarded unless | |
177 option bitmap=on. | |
178 | |
179 .RE | |
180 .LP | |
181 .SS "Storage options for the auditory image " | |
182 .PP | |
183 | |
184 A record of the auditory image can be stored in two ways depending on | |
185 the purpose for which it is stored. The actual numerical values of | |
186 the auditory image can be stored as previously, by setting output=on. | |
187 In this case, a file with a .sai suffix will be created in accordance | |
188 with the conventions of the software. These values can be recalled | |
189 for further processing with the aimTools. In this regard the SAI | |
190 module is like any previous module. | |
191 | |
192 .LP | |
193 It is also possible to store the bit maps which are displayed on | |
194 the screen for the auditory image cartoon. The bit maps require | |
195 less storage space and reload more quickly, so this is the | |
196 preferred mode of storage when one simply wants to review the | |
197 visual image. | |
198 .LP | |
199 .TP 10 | |
200 bitmap | |
201 Produce a bit-map storage file | |
202 .RS | |
203 Switch. Default value: off. | |
204 .RE | |
205 .RS | |
206 | |
207 When the bitmap option is set to `on', the bit maps are stored in a | |
208 file with the suffix .ctn. The bitmaps are reloaded into memory using | |
209 the commands review, or xreview, followed by the file name without the | |
210 suffix .ctn. The auditory image can then be replayed, as with animate, | |
211 by typing `carriage return'. xreview is the newer and preferred | |
212 display routine. It enables the user to select subsets of the cartoon | |
213 and to change the rate of play via a convenient control window. | |
214 | |
215 | |
216 | |
217 .LP | |
218 The strobe mechanism is relatively simple. A trigger threshold | |
219 value is maintained for each channel and when a NAP pulse exceeds | |
220 the threshold a trigger pulse is generated at the time associated | |
221 with the maximum of the peak. The threshold value is then reset | |
222 to a value somewhat above the height of the current NAP peak and | |
223 the threshold value decays exponentially with time thereafter. | |
224 | |
225 | |
226 | |
227 There are six options with the suffix "_ai", short for | |
228 'auditory image'. Four of these control STI itself -- stdecay_ai, | |
229 stcrit_ai, stthresh_ai and decay_ai. The option stinfo_ai is a switch | |
230 that causes the software to produce information about the current STI | |
231 analysis for demonstration or diagnostic purposes. The final option, | |
232 napdecay_ai controls the decay rate for the NAP while it flows down | |
233 the NAP buffer. | |
234 | |
235 .LP | |
236 .TP 17 | |
237 napdecay_ai | |
238 Decay rate for the neural activity pattern (NAP) | |
239 .RS | |
240 Default units: %/ms. Default value 2.5 %/ms. | |
241 .RE | |
242 .RS | |
243 | |
244 napdecay_ai determines the rate at which the information in the neural | |
245 activity pattern decays as it proceeds along the auditory buffer that | |
246 stores the NAP prior to temporal integration. | |
247 .RE | |
248 | |
249 | |
250 .LP | |
251 .TP 16 | |
252 stdecay_ai | |
253 Strobe threshold decay rate | |
254 .RS | |
255 Default units: %/ms. Default value: 5 %/ms. | |
256 .RE | |
257 .RS | |
258 stdecay_sai determines the rate at which the strobe threshold decays. | |
259 .RE | |
260 .LP | |
261 General purpose pitch mechanisms based on peak picking are | |
262 notoriously difficult to design, and the trigger mechanism just | |
263 described would not work well on an arbitrary acoustic waveform. | |
264 The reason that this simple trigger mechanism is sufficient for | |
265 the construction of the auditory image is that NAP functions are | |
266 highly constrained. The microstructure reveals a function that | |
267 rises from zero to a local maximum smoothly and returns smoothly | |
268 back to zero where it stays for more than half of a period of the | |
269 centre frequency of that channel. On the longer time scale, the | |
270 amplitude of successive peaks changes only relatively slowly with | |
271 respect to time. As a result, for periodic sounds there tends | |
272 to be one clear maximum per period in all but the lowest channels | |
273 where there is an integer number of maxima per period. The | |
274 simplicity of the NAP functions follows from the fact that the | |
275 acoustic waveform has passed through a narrow band filter and so | |
276 it has a limited number of degrees of freedom. In all but the | |
277 highest frequency channels, the output of the auditory filter | |
278 resembles a modulated sine wave whose frequency is near the | |
279 centre frequency of the filter. Thus the neural activity pattern | |
280 is largely restricted to a set of peaks which are modified | |
281 versions of the positive halves of a sine wave, and the remaining | |
282 degrees of freedom appear as relatively slow changes in peak | |
283 amplitude and relatively small changes in peak time (or phase). | |
284 .LP | |
285 .LP | |
286 When the acoustic input terminates, the auditory image must | |
287 decay. In the ASP model the form of the decay is exponential and | |
288 the decay rate is determined by decayrate_sai. | |
289 .LP | |
290 .TP 18 | |
291 decay_ai | |
292 SAI decay time constant | |
293 .RS | |
294 Default units: ms. Default value 30 ms. | |
295 .RE | |
296 .RS | |
297 decay_ai determines the rate at which the auditory image decays. | |
298 .RE | |
299 .RS | |
300 | |
301 In addition, decay_ai determines the rate at which the strength of the | |
302 auditory image increases and the level to which it asymptotes if the | |
303 sound continues indefinitely. In an exponential process, the asymptote | |
304 is reached when the increment provided by each new cycle of the sound | |
305 equals the amount that the image decays over the same period. | |
306 | |
307 .RE | |
308 .SH MOTIVATION | |
309 .LP | |
310 .SS "Auditory temporal integration: The problem " | |
311 .PP | |
312 Image stabilisation and temporal smearing. | |
313 .LP | |
314 When the input to the auditory system is a periodic sound like | |
315 pt_8ms or ae_8ms, the output of the cochlea is a rapidly flowing | |
316 neural activity pattern on which the information concerning the | |
317 source repeats every 8 ms. Consider the display problem that | |
318 would arise if one attempted to present a one second sample of | |
319 either pt_8ms or ae_8ms with the resolution and format of Figure | |
320 5.2. In that figure each 8 ms period of the sound occupies about | |
321 4 cm of width. There are 125 repetitions of the period in one | |
322 second and so a paper version of the complete NAP would be 5 | |
323 metres in length. If the NAP were presented as a real-time flow | |
324 process, the paper would have to move past a typical window at | |
325 the rate of 5 metres per second! At this rate, the temporal | |
326 detail within the cycle would be lost. The image would be stable | |
327 but the information would be reduced to horizontal banding. The | |
328 fine-grain temporal information is lost because the integration | |
329 time of the visual system is long with respect to the rate of | |
330 flow of information when the record is moving at 5 metres a | |
331 second. | |
332 .LP | |
333 Traditional models of auditory temporal integration are similar | |
334 to visual models. They assume that we hear a stable auditory | |
335 image in response to a periodic sound because the neural activity | |
336 is passed through a temporal weighting function that integrates | |
337 over time. The output does not fluctuate if the integration time | |
338 is long enough. Unfortunately, the simple model of temporal | |
339 integration does not work for the auditory system. If the output | |
340 is to be stable, the integrator must integrate over 10 or more | |
341 cycles of the sound. We hear stable images for pitches as low | |
342 as, say 50 cycles per second, which suggests that the integration | |
343 time of the auditory system would have to be 200 ms at the | |
344 minimum. Such an integrator would cause far more smearing of | |
345 auditory information than we know occurs. For example, phase | |
346 shifts that produce small changes half way through the period of | |
347 a pulse train are often audible (see Patterson, 1987, for a | |
348 review). Small changes of this sort would be obscured by lengthy | |
349 temporal integration. | |
350 .LP | |
351 Thus the problem in modelling auditory temporal integration is | |
352 to determine how the auditory system can integrate information | |
353 to form a stable auditory image without losing the fine-grain | |
354 temporal information within the individual cycles of periodic | |
355 sounds. In visual terms, the problem is how to present a neural | |
356 activity pattern at a rate of 5 metres per second while at the | |
357 same time enabling the viewer to see features within periods | |
358 greater than about 4 ms. | |
359 .LP | |
360 .SS "Periodic sounds and information packets. " | |
361 .PP | |
362 Now consider temporal integration from an information processing | |
363 perspective, and in particular, the problem of preserving formant | |
364 information in the auditory image. The shape of the neural | |
365 activity pattern within the period of a vowel sound provides | |
366 information about the resonances of the vocal tract (see Figure | |
367 3.6), and thus the identity of the vowel. The information about | |
368 the source arrives in packets whose duration is the period of the | |
369 source. Many of the sounds in speech and music have the property | |
370 that the source information changes relatively slowly when | |
371 compared with the repetition rate of the source wave (i.e. the | |
372 pitch). Thus, from an information processing point of view, one | |
373 would like to combine source information from neighbouring | |
374 packets, while at the same time taking care not to smear the | |
375 source information contained within the individual packets. In | |
376 short, one would like to perform quantised temporal integration, | |
377 integrating over cycles but not within cycles of the sound. | |
378 .LP | |
379 .SH EXAMPLES | |
380 .LP | |
381 This first pair of examples is intended to illustrate the | |
382 dominant forms of motion that appear in the auditory image, and | |
383 the fact that shapes can be tracked across the image provided the | |
384 rate of change is not excessive. The first example is a pitch | |
385 glide for a note with fixed timbre. The second example involves | |
386 formant motion (a form of timbre glide) in a monotone voice (i.e. | |
387 for a relatively fixed pitch). | |
388 .LP | |
389 .SS "A pitch glide in the auditory image " | |
390 .PP | |
391 Up to this point, we have focussed on the way that TQTI can | |
392 convert a fast flowing NAP pattern into a stabilised auditory | |
393 image. The mechanism is not, however, limited to continuous or | |
394 stationary sounds. The data file cegc contains pulse trains that | |
395 produce pitches near the musical notes C3, E3, G3, and C4, along | |
396 with glides from one note to the next. The notes are relatively | |
397 long and the pitch glides are relatively slow. As a result, each | |
398 note form a stabilised auditory image and there is smooth motion | |
399 from one note image to the next. The stimulus file cegc is | |
400 intended to support several examples including ones involving the | |
401 spiral representation of the auditory image and its relationship | |
402 to musical consonance in the next chapter. For brevity, the | |
403 current example is limited to the transition from C to E near the | |
404 start of the file. The pitch of musical notes is determined by | |
405 the lower harmonics when they are present and so the command for | |
406 the example is: | |
407 .LP | |
408 gensai mag=16 min=100 max=2000 start=100 length=600 | |
409 duration_sai=32 cegc | |
410 .LP | |
411 In point of fact, the pulse train associated with the first note | |
412 has a period of 8 ms like pt_8ms and so this "C" is actually a | |
413 little below the musical note C3. Since the initial C is the | |
414 same as pt_8ms, the onset of the first note is the same as in the | |
415 previous example; however, four cycles of the pulse train pattern | |
416 build up in the window because it has been set to show 32 ms of | |
417 'auditory image time'. During the transition, the period of the | |
418 stimulus decreases from 32/4 ms down to 32/5 ms, and so the image | |
419 stabilises with five cycles in the window. The period of E is | |
420 4/5 that of C. | |
421 .LP | |
422 During the transition, in the lower channels associated with the | |
423 first and second harmonic, the individual SAI pulses march from | |
424 left to right in time and, at the same time, they move up in | |
425 frequency as the energy of these harmonics moves out of lower | |
426 filters and into higher filters. In these low channels the | |
427 motion is relatively smooth because the SAI pulse has a duration | |
428 which is a significant proportion of the period of the sound. As | |
429 the pitch rises and the periods get shorter, each new NAP cycle | |
430 contributes a NAP pulse which is shifted a little to the right | |
431 relative to the corresponding SAI pulse. This increases the | |
432 leading edge of the SAI pulse without contributing to the lagging | |
433 edge. As a result, the leading edge builds, the lagging edge | |
434 decays, and the SAI pulse moves to the right. The SAI pulses are | |
435 asymmetric during the motion, with the trailing edge more shallow | |
436 than the leading edge, and the effect is greater towards the left | |
437 edge of the image because the discrepancies over four cycles are | |
438 larger than the discrepancies over one cycle. The effects are | |
439 larger for the second harmonic than for the first harmonic | |
440 because the width of the pulses of the second harmonic are a | |
441 smaller proportion of the period. During the pitch glide the SAI | |
442 pulses have a reduced peak height because the activity is | |
443 distributed over more channels and over longer durations. | |
444 .LP | |
445 The SAI pulses associated with the higher harmonics are | |
446 relatively narrow with regard to the changes in period during the | |
447 pitch glide. As a result there is more blurring of the image | |
448 during the glide in the higher channels. Towards the right-hand | |
449 edge, for the column that shows correlations over one cycle, the | |
450 blurring is minimal. Towards the left-hand edge the details of | |
451 the pattern are blurred and we see mainly activity moving in | |
452 vertical bands from left to right. When the glide terminates the | |
453 fine structure reforms from right to left across the image and | |
454 the stationary image for the note E appears. | |
455 .LP | |
456 The details of the motion are more readily observed when the | |
457 image is played in slow motion. If the disc space is available | |
458 (about 1.3 Mbytes), it is useful to generate a cegc.img file | |
459 using the image option. The auditory image can then be played | |
460 in slow motion using the review command and the slow down option | |
461 "-". | |
462 .LP | |
463 .LP | |
464 .SS "Formant motion in the auditory image " | |
465 .PP | |
466 The vowels of speech are quasi-periodic sounds and the period for | |
467 the average male speaker is on the order of 8ms. As the | |
468 articulators change the shape of the vocal tract during speech, | |
469 formants appear in the auditory image and move about. The | |
470 position and motion of the formants represent the speech | |
471 information conveyed by the voiced parts of speech. When the | |
472 speaker uses a monotone voice, the pitch remains relatively | |
473 steady and the motion of the formants is essentially in the | |
474 vertical dimension. An example of monotone voiced speech is | |
475 provided in the file leo which is the acoustic waveform of the | |
476 word 'leo'. The auditory image of leo can be produced using the | |
477 command | |
478 .LP | |
479 gensai mag=12 segment=40 duration_sai=20 leo | |
480 .LP | |
481 The dominant impression on first observing the auditory image of | |
482 leo is the motion in the formation of the "e" sound, the | |
483 transition from "e" to "o", and the formation of the "o" sound. | |
484 .LP | |
485 The vocal chords come on at the start of the "l" sound but the | |
486 tip of the tongue is pressed against the roof of the mouth just | |
487 behind the teeth and so it restricts the air flow and the start | |
488 of the "l" does not contain much energy. As a result, in the | |
489 auditory image, the presence of the "l" is primarily observed in | |
490 the transition from the "l" to the "e". That is, as the three | |
491 formants in the auditory image of the "e" come on and grow | |
492 stronger, the second formant glides into its "e" position from | |
493 below, indicating that the second formant was recently at a lower | |
494 frequency for the previous sound. | |
495 .LP | |
496 In the "e", the first formant is low, centred on the third | |
497 harmonic at the bottom of the auditory image. The second formant | |
498 is high, up near the third formant. The lower portion of the | |
499 fourth formant shows along the upper edge of the image. | |
500 Recognition systems that ignore temporal fine structure often | |
501 have difficulty determining whether a high frequency | |
502 concentration of energy is a single broad formant or a pair of | |
503 narrower formants close together. This makes it more difficult | |
504 to distinguish "e". In the auditory image, information about the | |
505 pulsing of the vocal chords is maintained and the temporal | |
506 fluctuation of the formant shapes makes it easier to distinguish | |
507 that there are two overlapping formants rather than a single | |
508 large formant. | |
509 .LP | |
510 As the "e" changes into the "o", the second formant moves back | |
511 down onto the eighth harmonic and the first formant moves up to | |
512 a position between the third and fourth harmonics. The third and | |
513 fourth formants remain relatively fixed in frequency but they | |
514 become softer as the "o" takes over. During the transition, the | |
515 second formant becomes fuzzy and moves down a set of vertical | |
516 ridges at multiples of the period. | |
517 .LP | |
518 .LP | |
519 .SS "The vowel triangle: aiua " | |
520 .PP | |
521 In speech research, the vowels are specified by the centre | |
522 frequencies of their formants. The first two formants carry the | |
523 most information and it is common to see sets of vowels | |
524 represented on a graph whose axes are the centre frequencies of | |
525 the first and second formant. Not all combinations of these | |
526 formant frequencies occur in speech; rather, the vowels occupy a | |
527 triangular region within this vowel space and the points of the | |
528 triangle are represented by /a/ as in paw /i/ as in beet, /u/ as | |
529 in toot. The file aiua contains a synthetic speech wave that | |
530 provides a tour around the vowel triangle from /a/ to /i/ to /u/ | |
531 and back to /a/, and there are smooth transitions from one vowel | |
532 to the next. The auditory image of aiua can be generated using | |
533 the command | |
534 .LP | |
535 gensai mag=12 segment=40 duration=20 aiua | |
536 .LP | |
537 The initial vowel /a/ has a high first formant centred on the | |
538 fifth harmonic and a low second formant centred between the | |
539 seventh and eighth harmonics (for these low formants the harmonic | |
540 number can be determined by counting the number of SAI peaks in | |
541 one period of the image). The third formant is at the top of the | |
542 image and it is reasonably strong, although relatively short in | |
543 duration. As the sound changes from /a/ to /i/, the first formant | |
544 moves successively down through the low harmonics and comes to | |
545 rest on the second harmonic. At the same time the second formant | |
546 moves all the way up to a position adjacent to the third formant, | |
547 similar to the "e" in leo. All three of the formants are | |
548 relatively strong. During the transition from the /i/ to the / | |
549 u/, the third formant becomes much weaker;. The second formant | |
550 moves down onto the seventh harmonic and it remains relatively | |
551 weak. The first formant remains centred on the second harmonic | |
552 and it is relatively strong. Finally, the formants return to | |
553 their /a/ positions. | |
554 .LP | |
555 .LP | |
556 .SS "Speaker separation in the auditory image " | |
557 .PP | |
558 One of the more intriguing aspects of speech recognition is our | |
559 ability to hear out one voice in the presence of competing voices | |
560 -- the proverbial cocktail party phenomenon. It is assumed that | |
561 we use pitch differences to help separate the voices. In support | |
562 of this view, several researchers have presented listeners with | |
563 pairs of vowels and shown that they can discriminate the vowels | |
564 better when they have different pitches (Summerfield & Assman, | |
565 1989). The final example involves a double vowel stimulus, /a/ | |
566 with /i/, and it shows that stable images of the dominant | |
567 formants of both vowels appear in the image. The file dblvow | |
568 (double vowel) contains seven double-vowel pulses. The amplitude | |
569 of the /a/ is fixed at a moderate level; the amplitude of the / | |
570 i/ begins at a level 12 dB greater than that of the /a/ and it | |
571 decreases 4 dB with each successive pulse, and so they are equal | |
572 in level in the fourth pulse. Each pulse is 200 ms in duration | |
573 with 20 ms rise and fall times that are included within the 200 | |
574 ms. There are 80 ms silent gaps between pulses and a gap of 80 | |
575 ms at the start of the file. The auditory image can be generated | |
576 with the command | |
577 .LP | |
578 gensai mag=12 samplerate=10000 segment=40 duration=20 dblvow | |
579 .LP | |
580 The pitch of the /a/ and the /i/ are 100 and 125 Hz, respectively. | |
581 The image reveals a strong first formant centred on the second | |
582 harmonic of 125 Hz (8 ms), and strong third and fourth formants | |
583 with a period of 8 ms (125 Hz). These are the formants of the / | |
584 e/ which is the stronger of the two vowels at this point. In | |
585 between the first and second formants of the /i/ are the first | |
586 and second formants of the /a/ at a somewhat lower level. The | |
587 formants of the /a/ show their proper period, 10 ms. The | |
588 triggering mechanism can stabilise the formants of both vowels | |
589 at their proper periods because the triggering is done on a | |
590 channel by channel basis. The upper formants of the /a/ fall in | |
591 the same channels as the upper formants of the /i/ and since they | |
592 are much weaker, they are repressed by the /i/ formants. | |
593 .LP | |
594 As the example proceeds, the formants of the /e/ become | |
595 progressively weaker. In the image of the fifth burst of the | |
596 double vowel we see evidence of both the upper formants of the / | |
597 i/ and the upper formants of the /a/ in the same channel. | |
598 Finally, in the last burst the first formant of the /i/ has | |
599 disappeared from the lowest channels entirely. There is still | |
600 some evidence of /e/ in the region of the upper formants but it | |
601 is the formants of the /a/ that now dominate in the high frequency | |
602 region. | |
603 .LP | |
604 .SH SEE ALSO | |
605 .LP | |
606 .SH COPYRIGHT | |
607 .LP | |
608 Copyright (c) Applied Psychology Unit, Medical Research Council, 1995 | |
609 .LP | |
610 Permission to use, copy, modify, and distribute this software without fee | |
611 is hereby granted for research purposes, provided that this copyright | |
612 notice appears in all copies and in all supporting documentation, and that | |
613 the software is not redistributed for any fee (except for a nominal | |
614 shipping charge). Anyone wanting to incorporate all or part of this | |
615 software in a commercial product must obtain a license from the Medical | |
616 Research Council. | |
617 .LP | |
618 The MRC makes no representations about the suitability of this | |
619 software for any purpose. It is provided "as is" without express or | |
620 implied warranty. | |
621 .LP | |
622 THE MRC DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING | |
623 ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL | |
624 THE A.P.U. BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES | |
625 OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, | |
626 WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, | |
627 ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS | |
628 SOFTWARE. | |
629 .LP | |
630 .SH ACKNOWLEDGEMENTS | |
631 .LP | |
632 The AIM software was developed for Unix workstations by John | |
633 Holdsworth and Mike Allerhand of the MRC APU, under the direction of | |
634 Roy Patterson. The physiological version of AIM was developed by | |
635 Christian Giguere. The options handler is by Paul Manson. The revised | |
636 SAI module is by Jay Datta. Michael Akeroyd extended the postscript | |
637 facilites and developed the xreview routine for auditory image | |
638 cartoons. | |
639 .LP | |
640 The project was supported by the MRC and grants from the U.K. Defense | |
641 Research Agency, Farnborough (Research Contract 2239); the EEC Esprit | |
642 BR Porgramme, Project ACTS (3207); and the U.K. Hearing Research Trust. | |
643 |