tomwalters@0: .TH GENSAI 1 "26 May 1995"
tomwalters@0: .LP
tomwalters@0: .SH NAME
tomwalters@0: .LP
tomwalters@0: gensai \- generate stabilised auditory image
tomwalters@0: .LP
tomwalters@0: .SH SYNOPSIS/SYNTAX
tomwalters@0: .LP
tomwalters@0: gensai [ option=value  |  -option ]  filename
tomwalters@0: .LP
tomwalters@0: .SH DESCRIPTION
tomwalters@0: .LP
tomwalters@0: 
tomwalters@0: Periodic sounds give rise to static, rather than oscillating,
tomwalters@0: perceptions indicating that temporal integration is applied to the NAP
tomwalters@0: in the production of our initial perception of a sound -- our auditory
tomwalters@0: image. Traditionally, auditory temporal integration is represented by
tomwalters@0: a simple leaky integration process and AIM provides a bank of lowpass
tomwalters@0: filters to enable the user to generate auditory spectra (Patterson,
tomwalters@0: 1994a) and auditory spectrograms (Patterson et al., 1992b). However,
tomwalters@0: the leaky integrator removes the phase-locked fine structure observed
tomwalters@0: in the NAP, and this conflicts with perceptual data indicating that
tomwalters@0: the fine structure plays an important role in determining sound
tomwalters@0: quality and source identification (Patterson, 1994b; Patterson and
tomwalters@0: Akeroyd, 1995). As a result, AIM includes two modules which preserve
tomwalters@0: much of the time-interval information in the NAP during temporal
tomwalters@0: integration, and which produce a better representation of our auditory
tomwalters@0: images. In the functional version of AIM, this is accomplished with
tomwalters@0: strobed temporal integration (Patterson et al., 1992a,b), and this is
tomwalters@0: the topic of this manual entry.
tomwalters@0: 
tomwalters@0: .LP
tomwalters@0: 
tomwalters@0: In the physiological version of AIM, the auditory image is constructed
tomwalters@0: with a bank of autocorrelators (Slaney and Lyon, 1990; Meddis and
tomwalters@0: Hewitt, 1991).  The autocorrelation module is an aimTool rather than
tomwalters@0: an integral part of the main program 'gen'.  The appropriate tool is
tomwalters@0: 'acgram'.  Type 'manaim acgram' for the documentation. The module
tomwalters@0: extracts periodicity information and preserves intra-period fine
tomwalters@0: structure by autocorrelating each channel of the NAP separately. The
tomwalters@0: correlogram is the multi-channel version of this process. It was
tomwalters@0: originally introduced as a model of pitch perception (Licklider,
tomwalters@0: 1951). It is not yet known whether STI or autocorrelation is more
tomwalters@0: realistic, or more efficient, as a means of simulating our perceived
tomwalters@0: auditory images. At present, the purpose is to provide a software
tomwalters@0: package that can be used to compare these auditory representations in
tomwalters@0: a way not previously possible.
tomwalters@0: 
tomwalters@0: .RE
tomwalters@0: .LP
tomwalters@0: .SH STROBED TEMPORAL INTEGRATION
tomwalters@0: .PP  
tomwalters@0: 
tomwalters@0: In strobed temporal integration, a bank of delay lines is used to form
tomwalters@0: a buffer store for the NAP, one delay line per channel, and as the NAP
tomwalters@0: proceeds along the buffer it decays linearly with time, at about 2.5
tomwalters@0: %/ms. Each channel of the buffer is assigned a strobe unit which
tomwalters@0: monitors activity in that channel looking for local maxima in the
tomwalters@0: stream of NAP pulses. When one is found, the unit initiates temporal
tomwalters@0: integration in that channel; that is, it transfers a copy of the NAP
tomwalters@0: at that instant to the corresponding channel of an image buffer and
tomwalters@0: adds it point-for-point with whatever is already there. The local
tomwalters@0: maximum itself is mapped to the 0-ms point in the image buffer. The
tomwalters@0: multi-channel version of this STI process is AIM's representation of
tomwalters@0: our auditory image of a sound. Periodic and quasi-periodic sounds
tomwalters@0: cause regular strobing which leads to simulated auditory images that
tomwalters@0: are static, or nearly static, but with the same temporal resolution as
tomwalters@0: the NAP.  Dynamic sounds are represented as a sequence of auditory
tomwalters@0: image frames. If the rate of change in a sound is not too rapid, as is
tomwalters@0: diphthongs, features are seen to move smoothly as the sound proceeds,
tomwalters@0: much as objects move smoothly in animated cartoons.
tomwalters@0: 
tomwalters@0: .LP
tomwalters@0: It is important to emphasise, that the triggering done on a 
tomwalters@0: channel by channel basis and that triggering is asynchronous 
tomwalters@0: across channels, inasmuch as the major peaks in one channel occur 
tomwalters@0: at different times from the major peaks in other channels.  It 
tomwalters@0: is this aspect of the triggering process that causes the 
tomwalters@0: alignment of the auditory image and which accounts for the loss 
tomwalters@0: of phase information in the auditory system (Patterson, 1987).
tomwalters@0: 
tomwalters@0: .LP
tomwalters@0: 
tomwalters@0: The auditory image has the same vertical dimension as the neural
tomwalters@0: activity pattern (filter centre frequency).  The continuous time
tomwalters@0: dimension of the neural activity pattern becomes a local,
tomwalters@0: time-interval dimension in the auditory image; specifically, it is
tomwalters@0: "the time interval between a given pulse and the succeeding strobe
tomwalters@0: pulse". In order to preserve the direction of asymmetry of features
tomwalters@0: that appear in the NAP, the time-interval origin is plotted towards
tomwalters@0: the right-hand edge of the image, with increasing, positive time
tomwalters@0: intervals proceeding to towards the left.
tomwalters@0: 
tomwalters@0: .LP
tomwalters@0: .SH OPTIONS
tomwalters@0: .LP
tomwalters@0: .SS "Display options for the auditory image"
tomwalters@0: .PP
tomwalters@0: 
tomwalters@0: The options that control the positioning of the window in which the
tomwalters@0: auditory image appears are the same as those used to set up the
tomwalters@0: earlier windows, as are the options that control the level of the
tomwalters@0: image within the display.  In addition, there are three new options
tomwalters@0: that are required to present this new auditory representation. The
tomwalters@0: options are frstep_aid, pwid_aid, and nwid_aid; the suffix "_aid"
tomwalters@0: means "auditory image display". These options are described here
tomwalters@0: before the options that control the image construction process itself,
tomwalters@0: as they occur first in the options list. There are also three extra
tomwalters@0: display options for presenting the auditory image in its spiral form;
tomwalters@0: these options have the suffix "_spd" for "spiral display"; they are
tomwalters@0: described in the manual entry for 'genspl'.
tomwalters@0: 
tomwalters@0: .LP
tomwalters@0: .TP 17
tomwalters@0: frstep_aid
tomwalters@0: The frame step interval, or the update interval for the auditory image display 
tomwalters@0: .RS
tomwalters@0: Default units:  ms. Default value:  16 ms. 
tomwalters@0: .RE
tomwalters@0: .RS
tomwalters@0: 
tomwalters@0: Conceptually, the auditory image exists continuously in time.  The
tomwalters@0: simulation of the image produced by AIM is not continuous; rather it
tomwalters@0: is like an animated cartoon. Frames of the cartoon are calculated at
tomwalters@0: discrete points in time, and then the sequence of frames is replayed
tomwalters@0: to reveal the dynamics of the sound, or the lack of dynamics in the
tomwalters@0: case of periodic sounds.  When the sound is changing at a rate where
tomwalters@0: we hear smooth glides, the structures in the simulated auditory image
tomwalters@0: move much like objects in a cartoon.  frstep_aid determines the time
tomwalters@0: interval between frames of the auditory image cartoon. Frames are
tomwalters@0: calculated at time zero and integer multiples of segment_sai.
tomwalters@0: 
tomwalters@0: .RE
tomwalters@0: 
tomwalters@0: The default value (16 ms) is reasonable for musical sounds and speech
tomwalters@0: sounds.  For a detailed examination of the development of the image of
tomwalters@0: brief transient sounds frstep_aid should be decreased to 4 or even 2
tomwalters@0: ms.
tomwalters@0: .LP
tomwalters@0: .TP 16
tomwalters@0: pwidth_sai
tomwalters@0: 
tomwalters@0: The maximum positive time interval presented in the display of the
tomwalters@0: auditory image (to the left of 0 ms).
tomwalters@0: 
tomwalters@0: .RS
tomwalters@0: Default units:  ms. Default value: 35 ms. 
tomwalters@0: .RE
tomwalters@0: .LP
tomwalters@0: .TP 16
tomwalters@0: nwidth_sai
tomwalters@0: 
tomwalters@0: The maximum negative time interval presented in the display of the
tomwalters@0: auditory image (to the right of 0 ms).
tomwalters@0: 
tomwalters@0: .RS
tomwalters@0: Default units:  ms. Default value: -5 ms. 
tomwalters@0: .RE
tomwalters@0: 
tomwalters@0: .LP
tomwalters@0: .TP 12
tomwalters@0: animate
tomwalters@0: Present the frames of the simulated auditory image as a cartoon. 
tomwalters@0: .RS
tomwalters@0: Switch. Default off. 
tomwalters@0: .RE
tomwalters@0: .RS
tomwalters@0: 
tomwalters@0: With reasonable resolution and a reasonable frame rate, the auditory
tomwalters@0: cartoon for a second of sound will require on the order of 1 Mbyte of
tomwalters@0: storage. As a result, auditory cartoons are only stored at the
tomwalters@0: specific request of the user.  When the animate flag is set to `on',
tomwalters@0: the bit maps that constitute the frames the auditory cartoon are
tomwalters@0: stored in computer memory. They can then be replayed as an auditory
tomwalters@0: cartoon by pressing `carriage return'. To exit the instruction, type
tomwalters@0: "q" for `quit' or "control c". The bit maps are discarded unless
tomwalters@0: option bitmap=on.
tomwalters@0: 
tomwalters@0: .RE
tomwalters@0: .LP
tomwalters@0: .SS "Storage options for the auditory image "
tomwalters@0: .PP
tomwalters@0: 
tomwalters@0: A record of the auditory image can be stored in two ways depending on
tomwalters@0: the purpose for which it is stored.  The actual numerical values of
tomwalters@0: the auditory image can be stored as previously, by setting output=on.
tomwalters@0: In this case, a file with a .sai suffix will be created in accordance
tomwalters@0: with the conventions of the software.  These values can be recalled
tomwalters@0: for further processing with the aimTools.  In this regard the SAI
tomwalters@0: module is like any previous module.
tomwalters@0: 
tomwalters@0: .LP
tomwalters@0: It is also possible to store the bit maps which are displayed on 
tomwalters@0: the screen for the auditory image cartoon.  The bit maps require 
tomwalters@0: less storage space and reload more quickly, so this is the 
tomwalters@0: preferred mode of storage when one simply wants to review the 
tomwalters@0: visual image.  
tomwalters@0: .LP
tomwalters@0: .TP 10
tomwalters@0: bitmap
tomwalters@0: Produce a bit-map storage file 
tomwalters@0: .RS
tomwalters@0: Switch. Default value: off. 
tomwalters@0: .RE
tomwalters@0: .RS
tomwalters@0: 
tomwalters@0: When the bitmap option is set to `on', the bit maps are stored in a
tomwalters@0: file with the suffix .ctn. The bitmaps are reloaded into memory using
tomwalters@0: the commands review, or xreview, followed by the file name without the
tomwalters@0: suffix .ctn. The auditory image can then be replayed, as with animate,
tomwalters@0: by typing `carriage return'. xreview is the newer and preferred
tomwalters@0: display routine. It enables the user to select subsets of the cartoon
tomwalters@0: and to change the rate of play via a convenient control window.
tomwalters@0: 
tomwalters@0: 
tomwalters@0: 
tomwalters@0: .LP
tomwalters@0: The strobe mechanism is relatively simple.  A trigger threshold 
tomwalters@0: value is maintained for each channel and when a NAP pulse exceeds 
tomwalters@0: the threshold a trigger pulse is generated at the time associated 
tomwalters@0: with the maximum of the peak.  The threshold value is then reset 
tomwalters@0: to a value somewhat above the height of the current NAP peak and 
tomwalters@0: the threshold value decays exponentially with time thereafter.
tomwalters@0: 
tomwalters@0: 
tomwalters@0: 
tomwalters@0: There are six options with the suffix "_ai", short for
tomwalters@0: 'auditory image'. Four of these control STI itself -- stdecay_ai,
tomwalters@0: stcrit_ai, stthresh_ai and decay_ai. The option stinfo_ai is a switch
tomwalters@0: that causes the software to produce information about the current STI
tomwalters@0: analysis for demonstration or diagnostic purposes.  The final option,
tomwalters@0: napdecay_ai controls the decay rate for the NAP while it flows down
tomwalters@0: the NAP buffer. 
tomwalters@0: 
tomwalters@0: .LP
tomwalters@0: .TP 17
tomwalters@0: napdecay_ai
tomwalters@0: Decay rate for the neural activity pattern (NAP)
tomwalters@0: .RS
tomwalters@0: Default units: %/ms. Default value 2.5 %/ms. 
tomwalters@0: .RE
tomwalters@0: .RS
tomwalters@0: 
tomwalters@0: napdecay_ai determines the rate at which the information in the neural
tomwalters@0: activity pattern decays as it proceeds along the auditory buffer that
tomwalters@0: stores the NAP prior to temporal integration.
tomwalters@0: .RE
tomwalters@0: 
tomwalters@0: 
tomwalters@0: .LP
tomwalters@0: .TP 16
tomwalters@0: stdecay_ai
tomwalters@0: Strobe threshold decay rate 
tomwalters@0: .RS
tomwalters@0: Default units: %/ms. Default value:  5 %/ms. 
tomwalters@0: .RE
tomwalters@0: .RS
tomwalters@0: stdecay_sai determines the rate at which the strobe threshold decays. 
tomwalters@0: .RE
tomwalters@0: .LP
tomwalters@0: General purpose pitch mechanisms based on peak picking are 
tomwalters@0: notoriously difficult to design, and the trigger mechanism just 
tomwalters@0: described would not work well on an arbitrary acoustic waveform.  
tomwalters@0: The reason that this simple trigger mechanism is sufficient for 
tomwalters@0: the construction of the auditory image is that NAP functions are 
tomwalters@0: highly constrained.  The microstructure reveals a function that 
tomwalters@0: rises from zero to a local maximum smoothly and returns smoothly 
tomwalters@0: back to zero where it stays for more than half of a period of the 
tomwalters@0: centre frequency of that channel.  On the longer time scale, the 
tomwalters@0: amplitude of successive peaks changes only relatively slowly with 
tomwalters@0: respect to time.  As a result, for periodic sounds there tends 
tomwalters@0: to be one clear maximum per period in all but the lowest channels 
tomwalters@0: where there is an integer number of maxima per period.  The 
tomwalters@0: simplicity of the NAP functions follows from the fact that the 
tomwalters@0: acoustic waveform has passed through a narrow band filter and so 
tomwalters@0: it has a limited number of degrees of freedom.  In all but the 
tomwalters@0: highest frequency channels, the output of the auditory filter 
tomwalters@0: resembles a modulated sine wave whose frequency is near the 
tomwalters@0: centre frequency of the filter.  Thus the neural activity pattern 
tomwalters@0: is largely restricted to a set of peaks which are modified 
tomwalters@0: versions of the positive halves of a sine wave, and the remaining 
tomwalters@0: degrees of freedom appear as relatively slow changes in peak 
tomwalters@0: amplitude and relatively small changes in peak time (or phase). 
tomwalters@0: .LP
tomwalters@0: .LP
tomwalters@0: When the acoustic input terminates, the auditory image must 
tomwalters@0: decay.  In the ASP model the form of the decay is exponential and 
tomwalters@0: the decay rate is determined by decayrate_sai.  
tomwalters@0: .LP
tomwalters@0: .TP 18
tomwalters@0: decay_ai
tomwalters@0: SAI decay time constant 
tomwalters@0: .RS
tomwalters@0: Default units:  ms. Default value 30 ms. 
tomwalters@0: .RE
tomwalters@0: .RS
tomwalters@0: decay_ai determines the rate at which the auditory image decays. 
tomwalters@0: .RE
tomwalters@0: .RS
tomwalters@0: 
tomwalters@0: In addition, decay_ai determines the rate at which the strength of the
tomwalters@0: auditory image increases and the level to which it asymptotes if the
tomwalters@0: sound continues indefinitely. In an exponential process, the asymptote
tomwalters@0: is reached when the increment provided by each new cycle of the sound
tomwalters@0: equals the amount that the image decays over the same period.
tomwalters@0: 
tomwalters@0: .RE
tomwalters@0: .SH MOTIVATION
tomwalters@0: .LP
tomwalters@0: .SS "Auditory temporal integration: The problem "
tomwalters@0: .PP
tomwalters@0: Image stabilisation and temporal smearing.
tomwalters@0: .LP
tomwalters@0: When the input to the auditory system is a periodic sound like 
tomwalters@0: pt_8ms or ae_8ms, the output of the cochlea is a rapidly flowing 
tomwalters@0: neural activity pattern on which the information concerning the 
tomwalters@0: source repeats every 8 ms.  Consider the display problem that 
tomwalters@0: would arise if one attempted to present a one second sample of 
tomwalters@0: either pt_8ms or ae_8ms with the resolution and format of Figure 
tomwalters@0: 5.2.  In that figure each 8 ms period of the sound occupies about 
tomwalters@0: 4 cm of width.  There are 125 repetitions of the period in one 
tomwalters@0: second and so a paper version of the complete NAP would be 5 
tomwalters@0: metres in length.  If the NAP were presented as a real-time flow 
tomwalters@0: process, the paper would have to move past a typical window at 
tomwalters@0: the rate of 5 metres per second!  At this rate, the temporal 
tomwalters@0: detail within the cycle would be lost.  The image would be stable 
tomwalters@0: but the information would be reduced to horizontal banding.  The 
tomwalters@0: fine-grain temporal information is lost because the integration 
tomwalters@0: time of the visual system is long with respect to the rate of 
tomwalters@0: flow of information when the record is moving at 5 metres a 
tomwalters@0: second.
tomwalters@0: .LP
tomwalters@0: Traditional models of auditory temporal integration are similar 
tomwalters@0: to visual models.  They assume that we hear a stable auditory 
tomwalters@0: image in response to a periodic sound because the neural activity 
tomwalters@0: is passed through a temporal weighting function that integrates 
tomwalters@0: over time.  The output does not fluctuate if the integration time 
tomwalters@0: is long enough.  Unfortunately, the simple model of temporal 
tomwalters@0: integration does not work for the auditory system.  If the output 
tomwalters@0: is to be stable, the integrator must integrate over 10 or more 
tomwalters@0: cycles of the sound.  We hear stable images for pitches as low 
tomwalters@0: as, say 50 cycles per second, which suggests that the integration 
tomwalters@0: time of the auditory system would have to be 200 ms at the 
tomwalters@0: minimum.  Such an integrator would cause far more smearing of 
tomwalters@0: auditory information than we know occurs.  For example, phase 
tomwalters@0: shifts that produce small changes half way through the period of 
tomwalters@0: a pulse train are often audible (see Patterson, 1987, for a 
tomwalters@0: review).  Small changes of this sort would be obscured by lengthy 
tomwalters@0: temporal integration.
tomwalters@0: .LP
tomwalters@0: Thus the problem in modelling auditory temporal integration is 
tomwalters@0: to determine how the auditory system can integrate information 
tomwalters@0: to form a stable auditory image without losing the fine-grain 
tomwalters@0: temporal information within the individual cycles of periodic 
tomwalters@0: sounds.  In visual terms, the problem is how to present a neural 
tomwalters@0: activity pattern at a rate of 5 metres per second while at the 
tomwalters@0: same time enabling the viewer to see features within periods 
tomwalters@0: greater than about 4 ms.
tomwalters@0: .LP
tomwalters@0: .SS "Periodic sounds and information packets. "
tomwalters@0: .PP
tomwalters@0: Now consider temporal integration from an information processing 
tomwalters@0: perspective, and in particular, the problem of preserving formant 
tomwalters@0: information in the auditory image.  The shape of the neural 
tomwalters@0: activity pattern within the period of a vowel sound provides 
tomwalters@0: information about the resonances of the vocal tract (see Figure 
tomwalters@0: 3.6), and thus the identity of the vowel.  The information about 
tomwalters@0: the source arrives in packets whose duration is the period of the 
tomwalters@0: source.  Many of the sounds in speech and music have the property 
tomwalters@0: that the source information changes relatively slowly when 
tomwalters@0: compared with the repetition rate of the source wave (i.e. the 
tomwalters@0: pitch).  Thus, from an information processing point of view, one 
tomwalters@0: would like to combine source information from neighbouring 
tomwalters@0: packets, while at the same time taking care not to smear the 
tomwalters@0: source information contained within the individual packets.  In 
tomwalters@0: short, one would like to perform quantised temporal integration, 
tomwalters@0: integrating over cycles but not within cycles of the sound. 
tomwalters@0: .LP
tomwalters@0: .SH EXAMPLES
tomwalters@0: .LP
tomwalters@0: This first pair of examples is intended to illustrate the 
tomwalters@0: dominant forms of motion that appear in the auditory image, and 
tomwalters@0: the fact that shapes can be tracked across the image provided the 
tomwalters@0: rate of change is not excessive.  The first example is a pitch 
tomwalters@0: glide for a note with fixed timbre.  The second example involves 
tomwalters@0: formant motion (a form of timbre glide) in a monotone voice (i.e. 
tomwalters@0: for a relatively fixed pitch).
tomwalters@0: .LP
tomwalters@0: .SS "A pitch glide in the auditory image "
tomwalters@0: .PP
tomwalters@0: Up to this point, we have focussed on the way that TQTI can 
tomwalters@0: convert a fast flowing NAP pattern into a stabilised auditory 
tomwalters@0: image.  The mechanism is not, however, limited to continuous or 
tomwalters@0: stationary sounds.  The data file cegc contains pulse trains that 
tomwalters@0: produce pitches near the musical notes C3, E3, G3, and C4, along 
tomwalters@0: with glides from one note to the next.  The notes are relatively 
tomwalters@0: long and the pitch glides are relatively slow.  As a result, each 
tomwalters@0: note form a stabilised auditory image and there is smooth motion 
tomwalters@0: from one note image to the next.  The stimulus file cegc is 
tomwalters@0: intended to support several examples including ones involving the 
tomwalters@0: spiral representation of the auditory image and its relationship 
tomwalters@0: to musical consonance in the next chapter.  For brevity, the 
tomwalters@0: current example is limited to the transition from C to E near the 
tomwalters@0: start of the file.  The pitch of musical notes is determined by 
tomwalters@0: the lower harmonics when they are present and so the command for 
tomwalters@0: the example is:
tomwalters@0: .LP
tomwalters@0: gensai mag=16 min=100 max=2000 start=100 length=600 
tomwalters@0: duration_sai=32 cegc
tomwalters@0: .LP
tomwalters@0: In point of fact, the pulse train associated with the first note 
tomwalters@0: has a period of 8 ms like pt_8ms and so this "C" is actually a 
tomwalters@0: little below the musical note C3.  Since the initial C is the 
tomwalters@0: same as pt_8ms, the onset of the first note is the same as in the 
tomwalters@0: previous example; however, four cycles of the pulse train pattern 
tomwalters@0: build up in the window because it has been set to show 32 ms of 
tomwalters@0: 'auditory image time'.  During the transition, the period of the 
tomwalters@0: stimulus decreases from 32/4 ms down to 32/5 ms, and so the image 
tomwalters@0: stabilises with five cycles in the window.  The period of E is 
tomwalters@0: 4/5 that of C.  
tomwalters@0: .LP
tomwalters@0: During the transition, in the lower channels associated with the 
tomwalters@0: first and second harmonic, the individual SAI pulses march from 
tomwalters@0: left to right in time and, at the same time, they move up in 
tomwalters@0: frequency as the energy of these harmonics moves out of lower 
tomwalters@0: filters and into higher filters.  In these low channels the 
tomwalters@0: motion is relatively smooth because the SAI pulse has a duration 
tomwalters@0: which is a significant proportion of the period of the sound.  As 
tomwalters@0: the pitch rises and the periods get shorter, each new NAP cycle 
tomwalters@0: contributes a NAP pulse which is shifted a little to the right 
tomwalters@0: relative to the corresponding SAI pulse.  This increases the 
tomwalters@0: leading edge of the SAI pulse without contributing to the lagging 
tomwalters@0: edge.  As a result, the leading edge builds, the lagging edge 
tomwalters@0: decays, and the SAI pulse moves to the right.  The SAI pulses are 
tomwalters@0: asymmetric during the motion, with the trailing edge more shallow 
tomwalters@0: than the leading edge, and the effect is greater towards the left 
tomwalters@0: edge of the image because the discrepancies over four cycles are 
tomwalters@0: larger than the discrepancies over one cycle.  The effects are 
tomwalters@0: larger for the second harmonic than for the first harmonic 
tomwalters@0: because the width of the pulses of the second harmonic are a 
tomwalters@0: smaller proportion of the period.  During the pitch glide the SAI 
tomwalters@0: pulses have a reduced peak height because the activity is 
tomwalters@0: distributed over more channels and over longer durations.
tomwalters@0: .LP
tomwalters@0: The SAI pulses associated with the higher harmonics are 
tomwalters@0: relatively narrow with regard to the changes in period during the 
tomwalters@0: pitch glide.  As a result there is more blurring of the image 
tomwalters@0: during the glide in the higher channels.  Towards the right-hand 
tomwalters@0: edge, for the column that shows correlations over one cycle, the 
tomwalters@0: blurring is minimal.  Towards the left-hand edge the details of 
tomwalters@0: the pattern are blurred and we see mainly activity moving in 
tomwalters@0: vertical bands from left to right.  When the glide terminates the 
tomwalters@0: fine structure reforms from right to left across the image and 
tomwalters@0: the stationary image for the note E appears.  
tomwalters@0: .LP
tomwalters@0: The details of the motion are more readily observed when the 
tomwalters@0: image is played in slow motion.  If the disc space is available 
tomwalters@0: (about 1.3 Mbytes), it is useful to generate a cegc.img file 
tomwalters@0: using the image option.  The auditory image can then be played 
tomwalters@0: in slow motion using the review command and the slow down option 
tomwalters@0: "-".  
tomwalters@0: .LP
tomwalters@0: .LP
tomwalters@0: .SS "Formant motion in the auditory image "
tomwalters@0: .PP
tomwalters@0: The vowels of speech are quasi-periodic sounds and the period for 
tomwalters@0: the average male speaker is on the order of 8ms.  As the 
tomwalters@0: articulators change the shape of the vocal tract during speech, 
tomwalters@0: formants appear in the auditory image and move about.  The 
tomwalters@0: position and motion of the formants represent the speech 
tomwalters@0: information conveyed by the voiced parts of speech.  When the 
tomwalters@0: speaker uses a monotone voice, the pitch remains relatively 
tomwalters@0: steady and the motion of the formants is essentially in the 
tomwalters@0: vertical dimension.  An example of monotone voiced speech is 
tomwalters@0: provided in the file leo which is the acoustic waveform of the 
tomwalters@0: word 'leo'.  The auditory image of leo can be produced using the 
tomwalters@0: command 
tomwalters@0: .LP
tomwalters@0: gensai mag=12 segment=40 duration_sai=20 leo
tomwalters@0: .LP
tomwalters@0: The dominant impression on first observing the auditory image of 
tomwalters@0: leo is the motion in the formation of the "e" sound, the 
tomwalters@0: transition from "e" to "o", and the formation of the "o" sound.
tomwalters@0: .LP
tomwalters@0: The vocal chords come on at the start of the "l" sound but the 
tomwalters@0: tip of the tongue is pressed against the roof of the mouth just 
tomwalters@0: behind the teeth and so it restricts the air flow and the start 
tomwalters@0: of the "l" does not contain much energy.  As a result, in the 
tomwalters@0: auditory image, the presence of the "l" is primarily observed in 
tomwalters@0: the transition from the "l" to the "e".  That is, as the three 
tomwalters@0: formants in the auditory image of the "e" come on and grow 
tomwalters@0: stronger, the second formant glides into its "e" position from 
tomwalters@0: below, indicating that the second formant was recently at a lower 
tomwalters@0: frequency for the previous sound.
tomwalters@0: .LP
tomwalters@0: In the "e", the first formant is low, centred on the third 
tomwalters@0: harmonic at the bottom of the auditory image.  The second formant 
tomwalters@0: is high, up near the third formant.  The lower portion of the 
tomwalters@0: fourth formant shows along the upper edge of the image.  
tomwalters@0: Recognition systems that ignore temporal fine structure often 
tomwalters@0: have difficulty determining whether a high frequency 
tomwalters@0: concentration of energy is a single broad formant or a pair of 
tomwalters@0: narrower formants close together.  This makes it more difficult 
tomwalters@0: to distinguish "e".  In the auditory image, information about the 
tomwalters@0: pulsing of the vocal chords is maintained and the temporal 
tomwalters@0: fluctuation of the formant shapes makes it easier to distinguish 
tomwalters@0: that there are two overlapping formants rather than a single 
tomwalters@0: large formant.
tomwalters@0: .LP
tomwalters@0: As the "e" changes into the "o", the second formant moves back 
tomwalters@0: down onto the eighth harmonic and the first formant moves up to 
tomwalters@0: a position between the third and fourth harmonics.  The third and 
tomwalters@0: fourth formants remain relatively fixed in frequency but they 
tomwalters@0: become softer as the "o" takes over.  During the transition, the 
tomwalters@0: second formant becomes fuzzy and moves down a set of vertical 
tomwalters@0: ridges at multiples of the period.  
tomwalters@0: .LP
tomwalters@0: .LP
tomwalters@0: .SS "The vowel triangle: aiua "
tomwalters@0: .PP
tomwalters@0: In speech research, the vowels are specified by the centre 
tomwalters@0: frequencies of their formants.  The first two formants carry the 
tomwalters@0: most information and it is common to see sets of vowels 
tomwalters@0: represented on a graph whose axes are the centre frequencies of 
tomwalters@0: the first and second formant.  Not all combinations of these 
tomwalters@0: formant frequencies occur in speech; rather, the vowels occupy a 
tomwalters@0: triangular region within this vowel space and the points of the 
tomwalters@0: triangle are represented by /a/ as in paw /i/ as in beet, /u/ as 
tomwalters@0: in toot.  The file aiua contains a synthetic speech wave that 
tomwalters@0: provides a tour around the vowel triangle from /a/ to /i/ to /u/ 
tomwalters@0: and back to /a/, and there are smooth transitions from one vowel 
tomwalters@0: to the next.  The auditory image of aiua can be generated using 
tomwalters@0: the command
tomwalters@0: .LP
tomwalters@0: gensai mag=12 segment=40 duration=20 aiua
tomwalters@0: .LP
tomwalters@0: The initial vowel /a/ has a high first formant centred on the 
tomwalters@0: fifth harmonic and a low second formant centred between the 
tomwalters@0: seventh and eighth harmonics (for these low formants the harmonic 
tomwalters@0: number can be determined by counting the number of SAI peaks in 
tomwalters@0: one period of the image).  The third formant is at the top of the 
tomwalters@0: image and it is reasonably strong, although relatively short in 
tomwalters@0: duration.  As the sound changes from /a/ to /i/, the first formant 
tomwalters@0: moves successively down through the low harmonics and comes to 
tomwalters@0: rest on the second harmonic.  At the same time the second formant 
tomwalters@0: moves all the way up to a position adjacent to the third formant, 
tomwalters@0: similar to the "e" in leo.  All three of the formants are 
tomwalters@0: relatively strong.  During the transition from the /i/ to the /
tomwalters@0: u/, the third formant becomes much weaker;.  The second formant 
tomwalters@0: moves down onto the seventh harmonic and it remains relatively 
tomwalters@0: weak.  The first formant remains centred on the second harmonic 
tomwalters@0: and it is relatively strong.  Finally, the formants return to 
tomwalters@0: their /a/ positions.
tomwalters@0: .LP
tomwalters@0: .LP
tomwalters@0: .SS "Speaker separation in the auditory image "
tomwalters@0: .PP
tomwalters@0: One of the more intriguing aspects of speech recognition is our 
tomwalters@0: ability to hear out one voice in the presence of competing voices 
tomwalters@0: -- the proverbial cocktail party phenomenon.  It is assumed that 
tomwalters@0: we use pitch differences to help separate the voices.  In support 
tomwalters@0: of this view, several researchers have presented listeners with 
tomwalters@0: pairs of vowels and shown that they can discriminate the vowels 
tomwalters@0: better when they have different pitches (Summerfield & Assman, 
tomwalters@0: 1989).  The final example involves a double vowel stimulus, /a/ 
tomwalters@0: with /i/, and it shows that stable images of the dominant 
tomwalters@0: formants of both vowels appear in the image.  The file dblvow 
tomwalters@0: (double vowel) contains seven double-vowel pulses.  The amplitude 
tomwalters@0: of the /a/ is fixed at a moderate level; the amplitude of the /
tomwalters@0: i/ begins at a level 12 dB greater than that of the /a/ and it 
tomwalters@0: decreases 4 dB with each successive pulse, and so they are equal 
tomwalters@0: in level in the fourth pulse.  Each pulse is 200 ms in duration 
tomwalters@0: with 20 ms rise and fall times that are included within the 200 
tomwalters@0: ms.  There are 80 ms silent gaps between pulses and a gap of 80 
tomwalters@0: ms at the start of the file.  The auditory image can be generated 
tomwalters@0: with the command
tomwalters@0: .LP
tomwalters@0: gensai mag=12 samplerate=10000 segment=40 duration=20 dblvow
tomwalters@0: .LP
tomwalters@0: The pitch of the /a/ and the /i/ are 100 and 125 Hz, respectively.  
tomwalters@0: The image reveals a strong first formant centred on the second 
tomwalters@0: harmonic of 125 Hz (8 ms), and strong third and fourth formants 
tomwalters@0: with a period of 8 ms (125 Hz).  These are the formants of the /
tomwalters@0: e/ which is the stronger of the two vowels at this point.  In 
tomwalters@0: between the first and second formants of the /i/ are the first 
tomwalters@0: and second formants of the /a/ at a somewhat lower level.  The 
tomwalters@0: formants of the /a/ show their proper period, 10 ms.  The 
tomwalters@0: triggering mechanism can stabilise the formants of both vowels 
tomwalters@0: at their proper periods because the triggering is done on a 
tomwalters@0: channel by channel basis.  The upper formants of the /a/ fall in 
tomwalters@0: the same channels as the upper formants of the /i/ and since they 
tomwalters@0: are much weaker, they are repressed by the /i/ formants.
tomwalters@0: .LP
tomwalters@0: As the example proceeds, the formants of the /e/ become 
tomwalters@0: progressively weaker.  In the image of the fifth burst of the 
tomwalters@0: double vowel we see evidence of both the upper formants of the /
tomwalters@0: i/ and the upper formants of the /a/ in the same channel.  
tomwalters@0: Finally, in the last burst the first formant of the /i/ has 
tomwalters@0: disappeared from the lowest channels entirely.  There is still 
tomwalters@0: some evidence of /e/ in the region of the upper formants but it 
tomwalters@0: is the formants of the /a/ that now dominate in the high frequency 
tomwalters@0: region.
tomwalters@0: .LP
tomwalters@0: .SH SEE ALSO
tomwalters@0: .LP
tomwalters@0: .SH COPYRIGHT
tomwalters@0: .LP
tomwalters@0: Copyright (c) Applied Psychology Unit, Medical Research Council, 1995
tomwalters@0: .LP
tomwalters@0: Permission to use, copy, modify, and distribute this software without fee 
tomwalters@0: is hereby granted for research purposes, provided that this copyright 
tomwalters@0: notice appears in all copies and in all supporting documentation, and that 
tomwalters@0: the software is not redistributed for any fee (except for a nominal 
tomwalters@0: shipping charge). Anyone wanting to incorporate all or part of this 
tomwalters@0: software in a commercial product must obtain a license from the Medical 
tomwalters@0: Research Council.
tomwalters@0: .LP
tomwalters@0: The MRC makes no representations about the suitability of this 
tomwalters@0: software for any purpose.  It is provided "as is" without express or 
tomwalters@0: implied warranty.
tomwalters@0: .LP
tomwalters@0: THE MRC DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING 
tomwalters@0: ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL 
tomwalters@0: THE A.P.U. BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES 
tomwalters@0: OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, 
tomwalters@0: WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, 
tomwalters@0: ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS 
tomwalters@0: SOFTWARE.
tomwalters@0: .LP
tomwalters@0: .SH ACKNOWLEDGEMENTS
tomwalters@0: .LP
tomwalters@0: The AIM software was developed for Unix workstations by John
tomwalters@0: Holdsworth and Mike Allerhand of the MRC APU, under the direction of
tomwalters@0: Roy Patterson. The physiological version of AIM was developed by
tomwalters@0: Christian Giguere. The options handler is by Paul Manson. The revised
tomwalters@0: SAI module is by Jay Datta. Michael Akeroyd extended the postscript
tomwalters@0: facilites and developed the xreview routine for auditory image
tomwalters@0: cartoons.
tomwalters@0: .LP
tomwalters@0: The project was supported by the MRC and grants from the U.K. Defense
tomwalters@0: Research Agency, Farnborough (Research Contract 2239); the EEC Esprit
tomwalters@0: BR Porgramme, Project ACTS (3207); and the U.K. Hearing Research Trust.
tomwalters@0: