comparison docs/PAG95.doc @ 0:5242703e91d3 tip

Initial checkin for AIM92 aimR8.2 (last updated May 1997).
author tomwalters
date Fri, 20 May 2011 15:19:45 +0100
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:5242703e91d3
1 Revised for JASA, 3 April 95 1
2
3
4 Time-domain modelling of peripheral auditory processing:
5 A modular architecture and a software platform*
6
7 Roy D. Patterson and Mike H. Allerhand
8 MRC Applied Psychology Unit, 15 Chaucer Road, Cambridge CB2 2EF, UK
9
10 Christian Gigure Laboratory of Experimental Audiology, University
11 Hospital Utrecht, 3508 GA Utrecht, The Netherlands
12
13 (Received December, 1994) (Revised 31 March 1995)
14
15 A software package with a modular architecture has been developed to
16 support perceptual modelling of the fine-grain spectro-temporal
17 information observed in the auditory nerve. The package contains both
18 functional and physiological modules to simulate auditory spectral
19 analysis, neural encoding and temporal integration, including new
20 forms of periodicity-sensitive temporal integration that generate
21 stabilized auditory images. Combinations of the modules enable the
22 user to approximate a wide variety of existing, time-domain, auditory
23 models. Sequences of auditory images can be replayed to produce
24 cartoons of auditory perceptions that illustrate the dynamic response
25 of the auditory system to everyday sounds.
26
27 PACS numbers: 43.64.Bt, 43.66.Ba, 43.71.An
28
29 Running head: Auditory Image Model Software
30
31
32 INTRODUCTION
33
34 Several years ago, we developed a functional model of the cochlea to
35 simulate the phase-locked activity that complex sounds produce in the
36 auditory nerve. The purpose was to investigate the role of the
37 fine-grain timing information in auditory perception generally
38 (Patterson et al., 1992a; Patterson and Akeroyd, 1995), and in speech
39 perception in particular (Patterson, Holdsworth and Allerhand, 1992b).
40 The architecture of the resulting Auditory Image Model (AIM) is shown
41 in the left-hand column of Fig. 1. The responses of the three modules
42 to the vowel in 'hat' are shown in the three panels of Fig. 2.
43 Briefly, the spectral analysis stage converts the sound wave into the
44 model's representation of basilar membrane motion (BMM). For the vowel
45 in 'hat', each glottal cycle generates a version of the basic vowel
46 structure in the BMM (top panel). The neural encoding stage
47 stabilizes the BMM in level and sharpens features like vowel formants,
48 to produce a simulation of the neural activity pattern (NAP) produced
49 by the sound in the auditory nerve (middle panel). The temporal
50 integration stage stabilizes the repeating structure in the NAP and
51 produces a simulation of our perception of the vowel (bottom panel),
52 referred to as the auditory image. Sequences of simulated images can
53 be generated at regular intervals and replayed as an animated cartoon
54 to show the dynamic behaviour of the auditory images produced by
55 everyday sounds.
56
57 An earlier version of the AIM software was made available to
58 collaborators via the Internet. From there it spread to the speech and
59 music communities, indicating a more general interest in auditory
60 models than we had originally anticipated. This has prompted us to
61 prepare documentation and a formal release of the software (AIM R7).
62
63 A number of users wanted to compare the outputs from the functional
64 model, which is almost level independent, with those from
65 physiological models of the cochlea, which are fundamentally level
66 dependent. Others wanted to compare the auditory images produced by
67 strobed temporal integration with correlograms. As a result, we have
68 installed alternative modules for each of the three main stages as
69 shown in the right-hand column of Fig. 1. The alternative spectral
70 analysis module is a non-linear, transmission line filterbank based on
71 Gigure and Woodland (1994a). The neural encoding module is based on
72 the inner haircell model of Meddis (1988). The temporal integration
73 module generates correlograms like those of Slaney and Lyon (1990) or
74 Meddis and Hewitt (1991), using the algorithm proposed by Allerhand
75 and Patterson (1992). The responses of the three modules to the vowel
76 in 'hat' are shown in Fig. 3 for the case where the level of the vowel
77 is 60 dB SPL. The patterns are broadly similar to those of the
78 functional modules but the details differ, particularly at the output
79 of the third stage. The differences grow more pronounced when the
80 level of the vowel is reduced to 30 dB SPL or increased to 90 dB SPL.
81 Figures 2 and 3 together illustrate how the software can be used to
82 compare and contrast different auditory models. The new modules also
83 open the way to time-domain simulation of hearing impairment and
84 distortion products of cochlear origin.
85
86 Switches were installed to enable the user to shift from the
87 functional to the physiological version of AIM at the output of each
88 stage of the model. This architecture enables the system to implement
89 other popular auditory models such as the gammatone- filterbank,
90 Meddis-haircell, correlogram models proposed by Assmann and
91 Summerfield (1990), Meddis and Hewitt (1991), and Brown and Cooke
92 (1994). The remainder of this letter describes the integrated software
93 package with emphasis on the functional and physiological routes, and
94 on practical aspects of obtaining the software package.*
95
96
97
98 I. THE AUDITORY IMAGE MODEL
99
100 A. The spectral analysis stage
101
102 Spectral analysis is performed by a bank of auditory filters which
103 converts a digitized wave into an array of filtered waves like those
104 shown in the top panels of Figs 2 and 3. The set of waves is AIM's
105 representation of basilar membrane motion. The software distributes
106 the filters linearly along a frequency scale measured in Equivalent
107 Rectangular Bandwidths (ERB's). The ERB scale was proposed by Glasberg
108 and Moore (1990) based on physiological research summarized in
109 Greenwood (1990) and psychoacoustic research summarized in Patterson
110 and Moore (1986). The constants of the ERB function can also be set to
111 produce a reasonable approximation to the Bark scale. Options enable
112 the user to specify the number of channels in the filterbank and the
113 minimum and maximum filter center frequencies.
114
115 AIM provides both a functional auditory filter and a physiological
116 auditory filter for generating the BMM: the former is a linear,
117 gammatone filter (Patterson et al., 1992a); the latter is a
118 non-linear, transmission-line filter (Gigure and Woodland, 1994a).
119 The impulse response of the gammatone filter provides an excellent fit
120 to the impulse response of primary auditory neurons in cats, and its
121 amplitude characteristic is very similar to that of the 'roex' filter
122 commonly used to represent the human auditory filter. The motivation
123 for the gammatone filterbank and the available implementations are
124 summarized in Patterson (1994a). The input wave is passed through an
125 optional middle-ear filter adapted from Lutman and Martin (1979).
126
127 In the physiological version, a 'wave digital filter' is used to
128 implement the classical, one-dimensional, transmission-line
129 approximation to cochlear hydrodynamics. A feedback circuit
130 representing the fast motile response of the outer haircells generates
131 level- dependent basilar membrane motion (Gigure and Woodland,
132 1994a). The filterbank generates combination tones of the type
133 f1-n(f2-f1) which propagate to the appropriate channel, and it has the
134 potential to generate cochlear echoes. Options enable the user to
135 customize the transmission line filter by specifying the feedback gain
136 and saturation level of the outer haircell circuit. The middle ear
137 filter forms an integral part of the simulation in this case.
138 Together, it and the transmission line filterbank provide a
139 bi-directional model of auditory spectral analysis.
140
141 The upper panels of Figs 2 and 3 show the responses of the two
142 filterbanks to the vowel in 'hat'. They have 75 channels covering the
143 frequency range 100 to 6000 Hz (3.3 to 30.6 ERB's). In the
144 high-frequency channels, the filters are broad and the glottal pulses
145 generate impulse responses which decay relatively quickly. In the
146 low-frequency channels, the filters are narrow and so they resolve
147 individual continuous harmonics. The rightward skew in the
148 low-frequency channels is the 'phase lag,' or 'propagation delay,' of
149 the cochlea, which arises because the narrower low-frequency filters
150 respond more slowly to input. The transmission line filterbank shows
151 more ringing in the valleys than the gammatone filterbank because of
152 its dynamic signal compression; as amplitude decreases the damping of
153 the basilar membrane is reduced to increase sensitivity and frequency
154 resolution.
155
156
157 B. The neural encoding stage
158
159 The second stage of AIM simulates the mechanical/neural transduction
160 process performed by the inner haircells. It converts the BMM into a
161 neural activity pattern (NAP), which is AIM's representation of the
162 afferent activity in the auditory nerve. Two alternative simulations
163 are provided for generating the NAP: a bank of two-dimensional
164 adaptive- thresholding units (Holdsworth and Patterson, 1993), or a
165 bank of inner haircell simulators (Meddis, 1988).
166
167 The adaptive thresholding mechanism is a functional representation of
168 neural encoding. It begins by rectifying and compressing the BMM; then
169 it applies adaptation in time and suppression across frequency. The
170 adaptation and suppression are coupled and they jointly sharpen
171 features like vowel formants in the compressed BMM representation.
172 Briefly, an adaptive threshold value is maintained for each channel
173 and updated at the sampling rate. The new value is the largest of a)
174 the previous value reduced by a fast-acting temporal decay factor, b)
175 the previous value reduced by a longer-term temporal decay factor, c)
176 the adapted level in the channel immediately above, reduced by a
177 frequency spread factor, or d) the adapted level in the channel
178 immediately below, reduced by the same frequency spread factor. The
179 mechanism produces output whenever the input exceeds the adaptive
180 threshold, and the output level is the difference between the input
181 and the adaptive threshold. The parameters that control the spread of
182 activity in time and frequency are options in AIM.
183
184 The Meddis (1988) module simulates the operation of an individual
185 inner haircell; specifically, it simulates the flow of
186 neurotransmitter across three reservoirs that are postulated to exist
187 in and around the haircell. The module reproduces important properties
188 of single afferent fibres such as two-component time adaptation and
189 phase-locking. The transmitter flow equations are solved using the
190 wave-digital-filter algorithm described in Gigure and Woodland
191 (1994a). There is one haircell simulator for each channel of the
192 filterbank. Options allow the user to shift the entire rate-intensity
193 function to a higher or lower level, and to specify the type of fibre
194 (medium or high spontaneous-rate).
195
196 The middle panels in Figures 2 and 3 show the NAPs obtained with
197 adaptive thresholding and the Meddis module in response to BMMs from
198 the gammatone and transmission line filterbanks of Figs 1 and 2,
199 respectively. The phase lag of the BMM is preserved in the NAP. The
200 positive half-cycles of the BMM waves have been sharpened in time, an
201 effect which is more obvious in the adaptive thresholding NAP.
202 Sharpening is also evident in the frequency dimension of the adaptive
203 thresholding NAP. The individual 'haircells' are not coupled across
204 channels in the Meddis module, and thus there is no frequency
205 sharpening in this case. The physiological NAP reveals that the
206 activity between glottal pulses in the high-frequency channels is due
207 to the strong sixth harmonic in the first formant of the vowel.
208
209
210 C. The temporal integration stage
211
212 Periodic sounds give rise to static, rather than oscillating,
213 perceptions indicating that temporal integration is applied to the NAP
214 in the production of our initial perception of a sound -- our auditory
215 image. Traditionally, auditory temporal integration is represented by
216 a simple leaky integration process and AIM provides a bank of lowpass
217 filters to enable the user to generate auditory spectra (Patterson,
218 1994a) and auditory spectrograms (Patterson et al., 1992b). However,
219 the leaky integrator removes the phase-locked fine structure observed
220 in the NAP, and this conflicts with perceptual data indicating that
221 the fine structure plays an important role in determining sound
222 quality and source identification (Patterson, 1994b; Patterson and
223 Akeroyd, 1995). As a result, AIM includes two modules which preserve
224 much of the time-interval information in the NAP during temporal
225 integration, and which produce a better representation of our auditory
226 images. In the functional version of AIM, this is accomplished with
227 strobed temporal integration (Patterson et al., 1992a,b); in the
228 physiological version, it is accomplished with a bank of
229 autocorrelators (Slaney and Lyon, 1990; Meddis and Hewitt, 1991).
230
231 In the case of strobed temporal integration (STI), a bank of delay
232 lines is used to form a buffer store for the NAP, one delay line per
233 channel, and as the NAP proceeds along the buffer it decays linearly
234 with time, at about 2.5 %/ms. Each channel of the buffer is assigned a
235 strobe unit which monitors activity in that channel looking for local
236 maxima in the stream of NAP pulses. When one is found, the unit
237 initiates temporal integration in that channel; that is, it transfers
238 a copy of the NAP at that instant to the corresponding channel of an
239 image buffer and adds it point-for-point with whatever is already
240 there. The local maximum itself is mapped to the 0-ms point in the
241 image buffer. The multi-channel version of this STI process produces
242 AIM's representation of our auditory image of a sound. Periodic and
243 quasi-periodic sounds cause regular strobing which leads to simulated
244 auditory images that are static, or nearly static, and which have the
245 same temporal resolution as the NAP. Dynamic sounds are represented
246 as a sequence of auditory image frames. If the rate of change in a
247 sound is not too rapid, as is diphthongs, features are seen to move
248 smoothly as the sound proceeds, much as characters move smoothly in
249 animated cartoons.
250
251 An alternative form of temporal integration is provided by the
252 correlogram (Slaney and Lyon, 1990; Meddis and Hewitt, 1991). It
253 extracts periodicity information and preserves intra-period fine
254 structure by autocorrelating each channel of the NAP. The correlogram
255 is the multi-channel version of this process. It was originally
256 introduced as a model of pitch perception (Licklider, 1951) with a
257 neural wiring diagram to illustrate that it was physiologically
258 plausible. To date, however, there is no physiological evidence for
259 autocorrelation in the auditory system, and the installation of the
260 module in the physiological route was a matter of convenience. The
261 current implementation is a recursive, or running, autocorrelation. A
262 functionally equivalent FFT-based method is also provided (Allerhand
263 and Patterson, 1992). A comparison of the correlogram in the bottom
264 panel of Fig. 3 with the auditory image in the bottom panel of Fig. 2
265 shows that the vowel structure is more symmetric in the correlogram
266 and there are larger level contrasts in the correlogram. It is not
267 yet known whether one of the representations is more realistic or more
268 useful. The present purpose is to note that the software package can
269 be used to compare auditory representations in a way not previously
270 possible.
271
272
273
274 II. THE SOFTWARE/HARDWARE PLATFORM
275
276 i. The software package: The code is distributed as a compressed
277 archive (in unix tar format), and can be obtained via ftp from the
278 address: ftp.mrc-apu.cam.ac.uk (Name=anonymous; Password=<your email
279 address>). All the software is contained in a single archive:
280 pub/aim/aim.tar.Z. The associated text file pub/aim/ReadMe contains
281 instructions for installing and compiling the software. The AIM
282 package consists of a makefile and several sub-directories. Five of
283 these (filter, glib, model, stitch and wdf) contain the C code for
284 AIM. An aim/tools directory contains C code for ancillary software
285 tools. These software tools are provided for pre/post-processing of
286 model input/output. A variety of functions are offered, including:
287 stimulus generation, signal processing, and data manipulation. An
288 aim/man directory contains on-line manual pages describing AIM and the
289 software tools. An aim/scripts directory contains demonstration
290 scripts for a guided tour through the model. Sounds used to test and
291 demonstrate the model are provided in the aim/waves directory. These
292 sounds were sampled at 20 kHz, and each sample is a 2-byte number in
293 little-endian byte order; a tool is provided to swap byte order when
294 necessary.
295
296 ii. System requirements: The software is written in C. The code
297 generated by the native C compilers included with Ultrix (version 4.3a
298 and above) and SunOS (version 4.1.3 and above) has been extensively
299 tested. The code from the GNU C compiler (version 2.5.7 and above) is
300 also reliable. The total disc usage of the AIM source code is about
301 700 kbytes. The package also includes 500 kbytes of sources for
302 ancillary software tools, and 200 kbytes of documentation. The
303 executable programs occupy about 1000 kbytes, and executable programs
304 for ancillary tools occupy 7000 kbytes. About 800 Kbytes of temporary
305 space are required for object files during compilation. The graphical
306 interface uses X11 (R4 and above) with either the OpenWindows or Motif
307 user interface. The programs can be compiled using the base Xlib
308 library (libX11.a), and will run on both 1- bit (mono) and multi-plane
309 (colour or greyscale) displays.
310
311 iii. Compilation and operation: The makefile includes targets to
312 compile the source code for AIM and the associated tools on a range of
313 machines (DEC, SUN, SGI, HP); the targets differ only in the pathnames
314 for the local X11 base library (libX11.a) and header files (X11/X.h
315 and X11/Xlib.h). AIM can be compiled without the display code if the
316 graphics interface is not required or if X11 is not available (make
317 noplot). The executable for AIM is called gen. Compilation also
318 generates symbolic links to gen, such as genbmm, gennap and gensai,
319 which are used to select the desired output (BMM, NAP or SAI). The
320 links and the executables for the aim/tools are installed in the
321 aim/bin directory after compilation. Options are specified as:
322 name=value on the command line; unspecified options are assigned
323 default values. The model output takes the form of binary data routed
324 by default to the model's graphical displays. Output can also be
325 routed to plotting hardware, or other post- processing software.
326
327
328
329 III. APPLICATIONS AND SUMMARY
330
331 In hearing research, the functional version of AIM has been used to
332 model phase perception (Patterson, 1987a), octave perception
333 (Patterson et al., 1993), and timbre perception (Patterson, 1994b).
334 The physiological version has been used to simulate cochlear hearing
335 loss (Gigure, Woodland, and Robinson, 1993; Gigure and Woodland,
336 1994b), and combination tones of cochlear origin (Gigure, Kunov, and
337 Smoorenburg, 1995). In speech research, the functional version has
338 been used to explain syllabic stress (Allerhand et al., 1992), and
339 both versions have been used as preprocessors for speech recognition
340 systems (e.g. Patterson, Anderson, and Allerhand, 1994; Gigure et
341 al., 1993). In summary, the AIM software package provides a modular
342 architecture for time- domain computational studies of peripheral
343 auditory processing.
344
345
346 * Instructions for acquiring the software package electronically are
347 presented in Section II. This document refers to AIM R7 which is the
348 first official release.
349
350
351 ACKNOWLEDGEMENTS
352
353 The gammatone filterbank, adaptive thresholding, and much of the
354 software platform were written by John Holdsworth; the options handler
355 is by Paul Manson, and the revised STI module by Jay Datta. Michael
356 Akeroyd extended the postscript facilities and developed the xreview
357 routine for auditory image cartoons. The software development was
358 supported by grants from DRA Farnborough (U.K.), Esprit BR 3207 (EEC),
359 and the Hearing Research Trust (U.K.). We thank Malcolm Slaney and
360 Michael Akeroyd for helpful comments on an earlier version of the
361 paper.
362
363
364 Allerhand, M., and Patterson, R.D. (1992). "Correlograms and auditory
365 images," Proc. Inst. Acoust. 14, 281-288.
366
367 Allerhand, M., Butterfield, S., Cutler, A., and Patterson, R.D.
368 (1992). "Assessing syllable strength via an auditory model," Proc.
369 Inst. Acoust. 14, 297-304.
370
371 Assmann, P.F., and Summerfield, Q. (1990). "Modelling the perception
372 of concurrent vowels: Vowels with different fundamental frequencies,"
373 J. Acoust. Soc. Am., 88, 680- 697.
374
375 Brown, G.J., and Cooke, M. (1994) "Computational auditory scene
376 analysis," Computer Speech and Language 8, 297-336.
377
378 Gigure, C., Woodland, P.C., and Robinson, A.J. (1993). "Application
379 of an auditory model to the computer simulation of hearing impairment:
380 Preliminary results," Can. Acoust. 21, 135-136.
381
382 Gigure, C., and Woodland, P.C. (1994a). "A computational model of
383 the auditory periphery for speech and hearing research. I. Ascending
384 path," J. Acoust. Soc. Am. 95, 331-342.
385
386 Gigure, C., and Woodland, P.C. (1994b). "A computational model of
387 the auditory periphery for speech and hearing research. II: Descending
388 paths,'' J. Acoust. Soc. Am. 95, 343-349.
389
390 Gigure, C., Kunov, H., and Smoorenburg, G.F. (1995). "Computational
391 modelling of psycho-acoustic combination tones and distortion-product
392 otoacoustic emissions," 15th Int. Cong. on Acoustics, Trondheim
393 (Norway), 26-30 June.
394
395 Glasberg, B.R., and Moore, B.C.J. (1990). "Derivation of auditory
396 filter shapes from notched-noise data," Hear. Res. 47, 103-38.
397
398 Greenwood, D.D. (1990). "A cochlear frequency-position function for
399 several species - 29 years later," J. Acoust. Soc. Am. 87, 2592-2605.
400
401 Holdsworth, J.W., and Patterson, R.D. (1991). "Analysis of
402 waveforms," UK Patent No. GB 2-234-078-A (23.1.91). London: UK
403 Patent Office.
404
405 Licklider, J. C. R. (1951). "A duplex theory of pitch perception,"
406 Experientia, 7, 128- 133.
407
408 Lutman, M.E. and Martin, A.M. (1979). "Development of an
409 electroacoustic analogue model of the middle ear and acoustic reflex,"
410 J. Sound. Vib. 64, 133-157.
411
412 Meddis, R. (1988). "Simulation of auditory-neural transduction:
413 Further studies," J. Acoust. Soc. Am. 83, 1056-1063.
414
415 Meddis, R. and Hewitt, M.J. (1991). "Modelling the perception of
416 concurrent vowels with different fundamental frequencies," J. Acoust.
417 Soc. Am. 91, 233-45.
418
419 Patterson, R.D. (1987). "A pulse ribbon model of monaural phase
420 perception," J. Acoust. Soc. Am., 82, 1560-1586.
421
422 Patterson, R.D. (1994a). "The sound of a sinusoid: Spectral models,"
423 J. Acoust. Soc. Am. 96, 1409-1418.
424
425 Patterson, R.D. (1994b). "The sound of a sinusoid: Time-interval
426 models." J. Acoust. Soc. Am. 96, 1419-1428.
427
428 Patterson, R.D. and Akeroyd, M. A. (1995). "Time-interval patterns and
429 sound quality," in: Advances in Hearing Research: Proceedings of the
430 10th International Symposium on Hearing, edited by G. Manley, G.
431 Klump, C. Koppl, H. Fastl, & H. Oeckinghaus, World Scientific,
432 Singapore, (in press).
433
434 Patterson, R.D., Anderson, T., and Allerhand, M. (1994). "The auditory
435 image model as a preprocessor for spoken language," in Proc. Third
436 ICSLP, Yokohama, Japan 1395- 1398.
437
438 Patterson, R.D., Milroy, R. and Allerhand, M. (1993). "What is the
439 octave of a harmonically rich note?" In: Proc. 2nd Int. Conf. on Music
440 and the Cognitive Sciences, edited by I. Cross and I Deliege (Harwood,
441 Switzerland) 69-81.
442
443 Patterson, R.D. and B.C.J. Moore (1986). "Auditory filters and
444 excitation patterns as representations of frequency resolution," in
445 Frequency Selectivity in Hearing, edited by B. C. J. Moore, (Academic,
446 London) pp. 123-177.
447
448 Patterson, R.D., Holdsworth, J. and Allerhand M. (1992) "Auditory
449 Models as preprocessors for speech recognition," In: The Auditory
450 Processing of Speech: From the auditory periphery to words, edited by
451 M. E. H. Schouten (Mouton de Gruyter, Berlin) 67-83.
452
453 Patterson, R.D., Robinson, K., Holdsworth, J., McKeown, D., Zhang, C.,
454 and Allerhand M. (1992) "Complex sounds and auditory images," In:
455 Auditory physiology and perception, edited by Y Cazals, L. Demany, and
456 K. Horner (Pergamon, Oxford) 429-446.
457
458 Slaney, M. and Lyon, R.F. (1990). "A perceptual pitch detector," in
459 Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing,
460 Albuquerque, New Mexico, April 1990.
461
462
463 Figure 1. The three-stage structure of the AIM software package.
464 Left-hand column: functional route, right-hand column: physiological
465 route. For each module, the figure shows the function (bold type), the
466 implementation (in the rectangle), and the simulation it produces
467 (italics).
468
469 Figure 2. Responses of the model to the vowel in 'hat' processed
470 through the functional route: (top) basilar membrane motion, (middle)
471 neural activity pattern, and (bottom) auditory image.
472
473 Figure 3. Responses of the model to the vowel in 'hat' processed
474 through the physiological route: (top) basilar membrane motion,
475 (middle) neural activity pattern, and (bottom) autocorrelogram image.
476