diff docs/PAG95.doc @ 0:5242703e91d3 tip

Initial checkin for AIM92 aimR8.2 (last updated May 1997).
author tomwalters
date Fri, 20 May 2011 15:19:45 +0100
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/docs/PAG95.doc	Fri May 20 15:19:45 2011 +0100
@@ -0,0 +1,476 @@
+Revised for JASA,  3 April 95		1
+
+
+Time-domain modelling of peripheral auditory processing:
+	A modular architecture and a software platform*
+
+Roy D. Patterson and Mike H. Allerhand
+MRC Applied Psychology Unit, 15 Chaucer Road, Cambridge  CB2 2EF, UK 
+
+Christian Gigure Laboratory of Experimental Audiology, University
+Hospital Utrecht, 3508 GA Utrecht, The Netherlands
+
+(Received		December, 1994)   (Revised 31 March 1995)
+
+A software package with a modular architecture has been developed to
+support perceptual modelling of the fine-grain spectro-temporal
+information observed in the auditory nerve. The package contains both
+functional and physiological modules to simulate auditory spectral
+analysis, neural encoding and temporal integration, including new
+forms of periodicity-sensitive temporal integration that generate
+stabilized auditory images. Combinations of the modules enable the
+user to approximate a wide variety of existing, time-domain, auditory
+models. Sequences of auditory images can be replayed to produce
+cartoons of auditory perceptions that illustrate the dynamic response
+of the auditory system to everyday sounds.
+
+PACS numbers: 43.64.Bt, 43.66.Ba, 43.71.An
+
+Running head: Auditory Image Model Software
+
+
+INTRODUCTION
+
+Several years ago, we developed a functional model of the cochlea to
+simulate the phase-locked activity that complex sounds produce in the
+auditory nerve. The purpose was to investigate the role of the
+fine-grain timing information in auditory perception generally
+(Patterson et al., 1992a; Patterson and Akeroyd, 1995), and in speech
+perception in particular (Patterson, Holdsworth and Allerhand, 1992b).
+The architecture of the resulting Auditory Image Model (AIM) is shown
+in the left-hand column of Fig. 1. The responses of the three modules
+to the vowel in 'hat' are shown in the three panels of Fig. 2.
+Briefly, the spectral analysis stage converts the sound wave into the
+model's representation of basilar membrane motion (BMM). For the vowel
+in 'hat', each glottal cycle generates a version of the basic vowel
+structure in the BMM (top panel).  The neural encoding stage
+stabilizes the BMM in level and sharpens features like vowel formants,
+to produce a simulation of the neural activity pattern (NAP) produced
+by the sound in the auditory nerve (middle panel).  The temporal
+integration stage stabilizes the repeating structure in the NAP and
+produces a simulation of our perception of the vowel (bottom panel),
+referred to as the auditory image.  Sequences of simulated images can
+be generated at regular intervals and replayed as an animated cartoon
+to show the dynamic behaviour of the auditory images produced by
+everyday sounds.  
+
+An earlier version of the AIM software was made available to
+collaborators via the Internet. From there it spread to the speech and
+music communities, indicating a more general interest in auditory
+models than we had originally anticipated. This has prompted us to
+prepare documentation and a formal release of the software (AIM R7).
+
+A number of users wanted to compare the outputs from the functional
+model, which is almost level independent, with those from
+physiological models of the cochlea, which are fundamentally level
+dependent. Others wanted to compare the auditory images produced by
+strobed temporal integration with correlograms. As a result, we have
+installed alternative modules for each of the three main stages as
+shown in the right-hand column of Fig. 1.  The alternative spectral
+analysis module is a non-linear, transmission line filterbank based on
+Gigure and Woodland (1994a). The neural encoding module is based on
+the inner haircell model of Meddis (1988).  The temporal integration
+module generates correlograms like those of Slaney and Lyon (1990) or
+Meddis and Hewitt (1991), using the algorithm proposed by Allerhand
+and Patterson (1992). The responses of the three modules to the vowel
+in 'hat' are shown in Fig. 3 for the case where the level of the vowel
+is 60 dB SPL. The patterns are broadly similar to those of the
+functional modules but the details differ, particularly at the output
+of the third stage. The differences grow more pronounced when the
+level of the vowel is reduced to 30 dB SPL or increased to 90 dB SPL.
+Figures 2 and 3 together illustrate how the software can be used to
+compare and contrast different auditory models.  The new modules also
+open the way to time-domain simulation of hearing impairment and
+distortion products of cochlear origin.
+
+Switches were installed to enable the user to shift from the
+functional to the physiological version of AIM at the output of each
+stage of the model. This architecture enables the system to implement
+other popular auditory models such as the gammatone- filterbank,
+Meddis-haircell, correlogram models proposed by Assmann and
+Summerfield (1990), Meddis and Hewitt (1991), and Brown and Cooke
+(1994). The remainder of this letter describes the integrated software
+package with emphasis on the functional and physiological routes, and
+on practical aspects of obtaining the software package.*
+
+
+
+I. THE AUDITORY IMAGE MODEL
+
+A. The spectral analysis stage 
+
+Spectral analysis is performed by a bank of auditory filters which
+converts a digitized wave into an array of filtered waves like those
+shown in the top panels of Figs 2 and 3.  The set of waves is AIM's
+representation of basilar membrane motion.  The software distributes
+the filters linearly along a frequency scale measured in Equivalent
+Rectangular Bandwidths (ERB's). The ERB scale was proposed by Glasberg
+and Moore (1990) based on physiological research summarized in
+Greenwood (1990) and psychoacoustic research summarized in Patterson
+and Moore (1986). The constants of the ERB function can also be set to
+produce a reasonable approximation to the Bark scale. Options enable
+the user to specify the number of channels in the filterbank and the
+minimum and maximum filter center frequencies.
+
+AIM provides both a functional auditory filter and a physiological
+auditory filter for generating the BMM: the former is a linear,
+gammatone filter (Patterson et al., 1992a); the latter is a
+non-linear, transmission-line filter (Gigure and Woodland, 1994a).
+The impulse response of the gammatone filter provides an excellent fit
+to the impulse response of primary auditory neurons in cats, and its
+amplitude characteristic is very similar to that of the 'roex' filter
+commonly used to represent the human auditory filter. The motivation
+for the gammatone filterbank and the available implementations are
+summarized in Patterson (1994a). The input wave is passed through an
+optional middle-ear filter adapted from Lutman and Martin (1979).
+
+In the physiological version, a 'wave digital filter' is used to
+implement the classical, one-dimensional, transmission-line
+approximation to cochlear hydrodynamics. A feedback circuit
+representing the fast motile response of the outer haircells generates
+level- dependent basilar membrane motion (Gigure and Woodland,
+1994a). The filterbank generates combination tones of the type
+f1-n(f2-f1) which propagate to the appropriate channel, and it has the
+potential to generate cochlear echoes. Options enable the user to
+customize the transmission line filter by specifying the feedback gain
+and saturation level of the outer haircell circuit. The middle ear
+filter forms an integral part of the simulation in this case.
+Together, it and the transmission line filterbank provide a
+bi-directional model of auditory spectral analysis.
+
+The upper panels of Figs 2 and 3 show the responses of the two
+filterbanks to the vowel in 'hat'. They have 75 channels covering the
+frequency range 100 to 6000 Hz (3.3 to 30.6 ERB's). In the
+high-frequency channels, the filters are broad and the glottal pulses
+generate impulse responses which decay relatively quickly. In the
+low-frequency channels, the filters are narrow and so they resolve
+individual continuous harmonics. The rightward skew in the
+low-frequency channels is the 'phase lag,' or 'propagation delay,' of
+the cochlea, which arises because the narrower low-frequency filters
+respond more slowly to input. The transmission line filterbank shows
+more ringing in the valleys than the gammatone filterbank because of
+its dynamic signal compression; as amplitude decreases the damping of
+the basilar membrane is reduced to increase sensitivity and frequency
+resolution.
+
+
+B. The neural encoding stage
+
+The second stage of AIM simulates the mechanical/neural transduction
+process performed by the inner haircells. It converts the BMM into a
+neural activity pattern (NAP), which is AIM's representation of the
+afferent activity in the auditory nerve. Two alternative simulations
+are provided for generating the NAP: a bank of two-dimensional
+adaptive- thresholding units (Holdsworth and Patterson, 1993), or a
+bank of inner haircell simulators (Meddis, 1988).
+
+The adaptive thresholding mechanism is a functional representation of
+neural encoding. It begins by rectifying and compressing the BMM; then
+it applies adaptation in time and suppression across frequency. The
+adaptation and suppression are coupled and they jointly sharpen
+features like vowel formants in the compressed BMM representation.
+Briefly, an adaptive threshold value is maintained for each channel
+and updated at the sampling rate. The new value is the largest of a)
+the previous value reduced by a fast-acting temporal decay factor, b)
+the previous value reduced by a longer-term temporal decay factor, c)
+the adapted level in the channel immediately above, reduced by a
+frequency spread factor, or d) the adapted level in the channel
+immediately below, reduced by the same frequency spread factor. The
+mechanism produces output whenever the input exceeds the adaptive
+threshold, and the output level is the difference between the input
+and the adaptive threshold. The parameters that control the spread of
+activity in time and frequency are options in AIM.
+
+The Meddis (1988) module simulates the operation of an individual
+inner haircell; specifically, it simulates the flow of
+neurotransmitter across three reservoirs that are postulated to exist
+in and around the haircell. The module reproduces important properties
+of single afferent fibres such as two-component time adaptation and
+phase-locking. The transmitter flow equations are solved using the
+wave-digital-filter algorithm described in Gigure and Woodland
+(1994a). There is one haircell simulator for each channel of the
+filterbank. Options allow the user to shift the entire rate-intensity
+function to a higher or lower level, and to specify the type of fibre
+(medium or high spontaneous-rate).
+
+The middle panels in Figures 2 and 3 show the NAPs obtained with
+adaptive thresholding and the Meddis module in response to BMMs from
+the gammatone and transmission line filterbanks of Figs 1 and 2,
+respectively. The phase lag of the BMM is preserved in the NAP. The
+positive half-cycles of the BMM waves have been sharpened in time, an
+effect which is more obvious in the adaptive thresholding NAP.
+Sharpening is also evident in the frequency dimension of the adaptive
+thresholding NAP. The individual 'haircells' are not coupled across
+channels in the Meddis module, and thus there is no frequency
+sharpening in this case. The physiological NAP reveals that the
+activity between glottal pulses in the high-frequency channels is due
+to the strong sixth harmonic in the first formant of the vowel.
+
+
+C. The temporal integration stage
+
+Periodic sounds give rise to static, rather than oscillating,
+perceptions indicating that temporal integration is applied to the NAP
+in the production of our initial perception of a sound -- our auditory
+image. Traditionally, auditory temporal integration is represented by
+a simple leaky integration process and AIM provides a bank of lowpass
+filters to enable the user to generate auditory spectra (Patterson,
+1994a) and auditory spectrograms (Patterson et al., 1992b). However,
+the leaky integrator removes the phase-locked fine structure observed
+in the NAP, and this conflicts with perceptual data indicating that
+the fine structure plays an important role in determining sound
+quality and source identification (Patterson, 1994b; Patterson and
+Akeroyd, 1995). As a result, AIM includes two modules which preserve
+much of the time-interval information in the NAP during temporal
+integration, and which produce a better representation of our auditory
+images. In the functional version of AIM, this is accomplished with
+strobed temporal integration (Patterson et al., 1992a,b); in the
+physiological version, it is accomplished with a bank of
+autocorrelators (Slaney and Lyon, 1990; Meddis and Hewitt, 1991).
+
+In the case of strobed temporal integration (STI), a bank of delay
+lines is used to form a buffer store for the NAP, one delay line per
+channel, and as the NAP proceeds along the buffer it decays linearly
+with time, at about 2.5 %/ms. Each channel of the buffer is assigned a
+strobe unit which monitors activity in that channel looking for local
+maxima in the stream of NAP pulses. When one is found, the unit
+initiates temporal integration in that channel; that is, it transfers
+a copy of the NAP at that instant to the corresponding channel of an
+image buffer and adds it point-for-point with whatever is already
+there. The local maximum itself is mapped to the 0-ms point in the
+image buffer. The multi-channel version of this STI process produces
+AIM's representation of our auditory image of a sound. Periodic and
+quasi-periodic sounds cause regular strobing which leads to simulated
+auditory images that are static, or nearly static, and which have the
+same temporal resolution as the NAP.  Dynamic sounds are represented
+as a sequence of auditory image frames. If the rate of change in a
+sound is not too rapid, as is diphthongs, features are seen to move
+smoothly as the sound proceeds, much as characters move smoothly in
+animated cartoons.
+
+An alternative form of temporal integration is provided by the
+correlogram (Slaney and Lyon, 1990; Meddis and Hewitt, 1991). It
+extracts periodicity information and preserves intra-period fine
+structure by autocorrelating each channel of the NAP. The correlogram
+is the multi-channel version of this process. It was originally
+introduced as a model of pitch perception (Licklider, 1951) with a
+neural wiring diagram to illustrate that it was physiologically
+plausible. To date, however, there is no physiological evidence for
+autocorrelation in the auditory system, and the installation of the
+module in the physiological route was a matter of convenience. The
+current implementation is a recursive, or running, autocorrelation. A
+functionally equivalent FFT-based method is also provided (Allerhand
+and Patterson, 1992). A comparison of the correlogram in the bottom
+panel of Fig. 3 with the auditory image in the bottom panel of Fig. 2
+shows that the vowel structure is more symmetric in the correlogram
+and there are larger level contrasts in the correlogram.  It is not
+yet known whether one of the representations is more realistic or more
+useful. The present purpose is to note that the software package can
+be used to compare auditory representations in a way not previously
+possible.
+
+
+
+II. THE SOFTWARE/HARDWARE PLATFORM
+
+i. The software package: The code is distributed as a compressed
+archive (in unix tar format), and can be obtained via ftp from the
+address: ftp.mrc-apu.cam.ac.uk (Name=anonymous; Password=<your email
+address>). All the software is contained in a single archive:
+pub/aim/aim.tar.Z. The associated text file pub/aim/ReadMe contains
+instructions for installing and compiling the software.  The AIM
+package consists of a makefile and several sub-directories.  Five of
+these (filter, glib, model, stitch and wdf) contain the C code for
+AIM. An aim/tools directory contains C code for ancillary software
+tools.  These software tools are provided for pre/post-processing of
+model input/output. A variety of functions are offered, including:
+stimulus generation, signal processing, and data manipulation.  An
+aim/man directory contains on-line manual pages describing AIM and the
+software tools.  An aim/scripts directory contains demonstration
+scripts for a guided tour through the model. Sounds used to test and
+demonstrate the model are provided in the aim/waves directory. These
+sounds were sampled at 20 kHz, and each sample is a 2-byte number in
+little-endian byte order; a tool is provided to swap byte order when
+necessary.
+
+ii. System requirements: The software is written in C. The code
+generated by the native C compilers included with Ultrix (version 4.3a
+and above) and SunOS (version 4.1.3 and above) has been extensively
+tested. The code from the GNU C compiler (version 2.5.7 and above) is
+also reliable.  The total disc usage of the AIM source code is about
+700 kbytes.  The package also includes 500 kbytes of sources for
+ancillary software tools, and 200 kbytes of documentation. The
+executable programs occupy about 1000 kbytes, and executable programs
+for ancillary tools occupy 7000 kbytes. About 800 Kbytes of temporary
+space are required for object files during compilation. The graphical
+interface uses X11 (R4 and above) with either the OpenWindows or Motif
+user interface. The programs can be compiled using the base Xlib
+library (libX11.a), and will run on both 1- bit (mono) and multi-plane
+(colour or greyscale) displays.
+
+iii. Compilation and operation: The makefile includes targets to
+compile the source code for AIM and the associated tools on a range of
+machines (DEC, SUN, SGI, HP); the targets differ only in the pathnames
+for the local X11 base library (libX11.a) and header files (X11/X.h
+and X11/Xlib.h).  AIM can be compiled without the display code if the
+graphics interface is not required or if X11 is not available (make
+noplot).  The executable for AIM is called gen. Compilation also
+generates symbolic links to gen, such as genbmm, gennap and gensai,
+which are used to select the desired output (BMM, NAP or SAI). The
+links and the executables for the aim/tools are installed in the
+aim/bin directory after compilation.  Options are specified as:
+name=value on the command line; unspecified options are assigned
+default values.  The model output takes the form of binary data routed
+by default to the model's graphical displays. Output can also be
+routed to plotting hardware, or other post- processing software.
+
+
+
+III. APPLICATIONS AND SUMMARY
+
+In hearing research, the functional version of AIM has been used to
+model phase perception (Patterson, 1987a), octave perception
+(Patterson et al., 1993), and timbre perception (Patterson, 1994b).
+The physiological version has been used to simulate cochlear hearing
+loss (Gigure, Woodland, and Robinson, 1993; Gigure and Woodland,
+1994b), and combination tones of cochlear origin (Gigure, Kunov, and
+Smoorenburg, 1995). In speech research, the functional version has
+been used to explain syllabic stress (Allerhand et al., 1992), and
+both versions have been used as preprocessors for speech recognition
+systems (e.g. Patterson, Anderson, and Allerhand, 1994; Gigure et
+al., 1993).  In summary, the AIM software package provides a modular
+architecture for time- domain computational studies of peripheral
+auditory processing.
+
+
+* Instructions for acquiring the software package electronically are
+presented in Section II.  This document refers to AIM R7 which is the
+first official release.
+
+
+ACKNOWLEDGEMENTS
+
+The gammatone filterbank, adaptive thresholding, and much of the
+software platform were written by John Holdsworth; the options handler
+is by Paul Manson, and the revised STI module by Jay Datta. Michael
+Akeroyd extended the postscript facilities and developed the xreview
+routine for auditory image cartoons. The software development was
+supported by grants from DRA Farnborough (U.K.), Esprit BR 3207 (EEC),
+and the Hearing Research Trust (U.K.). We thank Malcolm Slaney and
+Michael Akeroyd for helpful comments on an earlier version of the
+paper.
+
+
+Allerhand, M., and Patterson, R.D. (1992). "Correlograms and auditory
+images," Proc.  Inst. Acoust. 14, 281-288.
+
+Allerhand, M., Butterfield, S., Cutler, A., and Patterson, R.D.
+(1992). "Assessing syllable strength via an auditory model," Proc.
+Inst. Acoust. 14, 297-304.
+
+Assmann, P.F., and Summerfield, Q. (1990). "Modelling the perception
+of concurrent vowels: Vowels with different fundamental frequencies,"
+J. Acoust. Soc. Am., 88, 680- 697.
+
+Brown, G.J., and Cooke, M. (1994) "Computational auditory scene
+analysis," Computer Speech and Language 8, 297-336.
+
+Gigure, C., Woodland, P.C., and Robinson, A.J. (1993). "Application
+of an auditory model to the computer simulation of hearing impairment:
+Preliminary results," Can. Acoust.  21, 135-136.
+
+Gigure, C., and Woodland, P.C. (1994a). "A computational model of
+the auditory periphery for speech and hearing research. I. Ascending
+path," J. Acoust. Soc. Am. 95, 331-342.
+
+Gigure, C., and Woodland, P.C. (1994b). "A computational model of
+the auditory periphery for speech and hearing research. II: Descending
+paths,'' J. Acoust. Soc. Am.  95, 343-349.
+
+Gigure, C., Kunov, H., and Smoorenburg, G.F. (1995). "Computational
+modelling of psycho-acoustic combination tones and distortion-product
+otoacoustic emissions," 15th Int. Cong. on Acoustics, Trondheim
+(Norway), 26-30 June.
+
+Glasberg, B.R., and Moore, B.C.J. (1990). "Derivation of auditory
+filter shapes from notched-noise data," Hear. Res. 47, 103-38.
+
+Greenwood, D.D. (1990). "A cochlear frequency-position function for
+several species - 29 years later," J. Acoust. Soc. Am. 87, 2592-2605.
+
+Holdsworth, J.W., and Patterson, R.D. (1991).  "Analysis of
+waveforms," UK Patent No.  GB 2-234-078-A (23.1.91).  London: UK
+Patent Office.
+
+Licklider, J. C. R. (1951). "A duplex theory of pitch perception,"
+Experientia, 7, 128- 133.
+
+Lutman, M.E. and Martin, A.M. (1979). "Development of an
+electroacoustic analogue model of the middle ear and acoustic reflex,"
+J. Sound. Vib. 64, 133-157.
+
+Meddis, R. (1988). "Simulation of auditory-neural transduction:
+Further studies," J.  Acoust. Soc. Am. 83, 1056-1063.
+
+Meddis, R. and Hewitt, M.J. (1991). "Modelling the perception of
+concurrent vowels with different fundamental frequencies," J. Acoust.
+Soc. Am. 91, 233-45.
+
+Patterson, R.D. (1987). "A pulse ribbon model of monaural phase
+perception," J. Acoust.  Soc. Am., 82, 1560-1586.
+
+Patterson, R.D. (1994a). "The sound of a sinusoid: Spectral models,"
+J. Acoust. Soc. Am.  96, 1409-1418.
+
+Patterson, R.D. (1994b). "The sound of a sinusoid: Time-interval
+models." J. Acoust. Soc.  Am. 96, 1419-1428.
+
+Patterson, R.D. and Akeroyd, M. A. (1995). "Time-interval patterns and
+sound quality," in: Advances in Hearing Research: Proceedings of the
+10th International Symposium on Hearing, edited by G. Manley, G.
+Klump, C. Koppl, H. Fastl, & H. Oeckinghaus, World Scientific,
+Singapore, (in press).
+
+Patterson, R.D., Anderson, T., and Allerhand, M. (1994). "The auditory
+image model as a preprocessor for spoken language," in Proc. Third
+ICSLP, Yokohama, Japan 1395- 1398.
+
+Patterson, R.D., Milroy, R. and Allerhand, M. (1993). "What is the
+octave of a harmonically rich note?" In: Proc. 2nd Int. Conf. on Music
+and the Cognitive Sciences, edited by I. Cross and I Deliege (Harwood,
+Switzerland) 69-81.
+
+Patterson, R.D. and B.C.J. Moore (1986). "Auditory filters and
+excitation patterns as representations of frequency resolution," in
+Frequency Selectivity in Hearing, edited by B. C. J. Moore, (Academic,
+London) pp. 123-177.
+
+Patterson, R.D., Holdsworth, J. and Allerhand M. (1992) "Auditory
+Models as preprocessors for speech recognition," In: The Auditory
+Processing of Speech: From the auditory periphery to words, edited by
+M. E. H. Schouten (Mouton de Gruyter, Berlin) 67-83.
+
+Patterson, R.D., Robinson, K., Holdsworth, J., McKeown, D., Zhang, C.,
+and Allerhand M.  (1992) "Complex sounds and auditory images," In:
+Auditory physiology and perception, edited by Y Cazals, L. Demany, and
+K. Horner (Pergamon, Oxford) 429-446.
+
+Slaney, M. and Lyon, R.F. (1990).  "A perceptual pitch detector," in
+Proc. IEEE Int. Conf.  Acoust., Speech, Signal Processing,
+Albuquerque, New Mexico, April 1990.
+
+
+Figure 1. The three-stage structure of the AIM software package.
+Left-hand column: functional route, right-hand column: physiological
+route. For each module, the figure shows the function (bold type), the
+implementation (in the rectangle), and the simulation it produces
+(italics).
+
+Figure 2. Responses of the model to the vowel in 'hat' processed
+through the functional route: (top) basilar membrane motion, (middle)
+neural activity pattern, and (bottom) auditory image.
+
+Figure 3. Responses of the model to the vowel in 'hat' processed
+through the physiological route: (top) basilar membrane motion,
+(middle) neural activity pattern, and (bottom) autocorrelogram image.
+