tomwalters@0: .TH GENSAI 1 "26 May 1995" tomwalters@0: .LP tomwalters@0: .SH NAME tomwalters@0: .LP tomwalters@0: gensai \- generate stabilised auditory image tomwalters@0: .LP tomwalters@0: .SH SYNOPSIS/SYNTAX tomwalters@0: .LP tomwalters@0: gensai [ option=value | -option ] filename tomwalters@0: .LP tomwalters@0: .SH DESCRIPTION tomwalters@0: .LP tomwalters@0: tomwalters@0: Periodic sounds give rise to static, rather than oscillating, tomwalters@0: perceptions indicating that temporal integration is applied to the NAP tomwalters@0: in the production of our initial perception of a sound -- our auditory tomwalters@0: image. Traditionally, auditory temporal integration is represented by tomwalters@0: a simple leaky integration process and AIM provides a bank of lowpass tomwalters@0: filters to enable the user to generate auditory spectra (Patterson, tomwalters@0: 1994a) and auditory spectrograms (Patterson et al., 1992b). However, tomwalters@0: the leaky integrator removes the phase-locked fine structure observed tomwalters@0: in the NAP, and this conflicts with perceptual data indicating that tomwalters@0: the fine structure plays an important role in determining sound tomwalters@0: quality and source identification (Patterson, 1994b; Patterson and tomwalters@0: Akeroyd, 1995). As a result, AIM includes two modules which preserve tomwalters@0: much of the time-interval information in the NAP during temporal tomwalters@0: integration, and which produce a better representation of our auditory tomwalters@0: images. In the functional version of AIM, this is accomplished with tomwalters@0: strobed temporal integration (Patterson et al., 1992a,b), and this is tomwalters@0: the topic of this manual entry. tomwalters@0: tomwalters@0: .LP tomwalters@0: tomwalters@0: In the physiological version of AIM, the auditory image is constructed tomwalters@0: with a bank of autocorrelators (Slaney and Lyon, 1990; Meddis and tomwalters@0: Hewitt, 1991). The autocorrelation module is an aimTool rather than tomwalters@0: an integral part of the main program 'gen'. The appropriate tool is tomwalters@0: 'acgram'. Type 'manaim acgram' for the documentation. The module tomwalters@0: extracts periodicity information and preserves intra-period fine tomwalters@0: structure by autocorrelating each channel of the NAP separately. The tomwalters@0: correlogram is the multi-channel version of this process. It was tomwalters@0: originally introduced as a model of pitch perception (Licklider, tomwalters@0: 1951). It is not yet known whether STI or autocorrelation is more tomwalters@0: realistic, or more efficient, as a means of simulating our perceived tomwalters@0: auditory images. At present, the purpose is to provide a software tomwalters@0: package that can be used to compare these auditory representations in tomwalters@0: a way not previously possible. tomwalters@0: tomwalters@0: .RE tomwalters@0: .LP tomwalters@0: .SH STROBED TEMPORAL INTEGRATION tomwalters@0: .PP tomwalters@0: tomwalters@0: In strobed temporal integration, a bank of delay lines is used to form tomwalters@0: a buffer store for the NAP, one delay line per channel, and as the NAP tomwalters@0: proceeds along the buffer it decays linearly with time, at about 2.5 tomwalters@0: %/ms. Each channel of the buffer is assigned a strobe unit which tomwalters@0: monitors activity in that channel looking for local maxima in the tomwalters@0: stream of NAP pulses. When one is found, the unit initiates temporal tomwalters@0: integration in that channel; that is, it transfers a copy of the NAP tomwalters@0: at that instant to the corresponding channel of an image buffer and tomwalters@0: adds it point-for-point with whatever is already there. The local tomwalters@0: maximum itself is mapped to the 0-ms point in the image buffer. The tomwalters@0: multi-channel version of this STI process is AIM's representation of tomwalters@0: our auditory image of a sound. Periodic and quasi-periodic sounds tomwalters@0: cause regular strobing which leads to simulated auditory images that tomwalters@0: are static, or nearly static, but with the same temporal resolution as tomwalters@0: the NAP. Dynamic sounds are represented as a sequence of auditory tomwalters@0: image frames. If the rate of change in a sound is not too rapid, as is tomwalters@0: diphthongs, features are seen to move smoothly as the sound proceeds, tomwalters@0: much as objects move smoothly in animated cartoons. tomwalters@0: tomwalters@0: .LP tomwalters@0: It is important to emphasise, that the triggering done on a tomwalters@0: channel by channel basis and that triggering is asynchronous tomwalters@0: across channels, inasmuch as the major peaks in one channel occur tomwalters@0: at different times from the major peaks in other channels. It tomwalters@0: is this aspect of the triggering process that causes the tomwalters@0: alignment of the auditory image and which accounts for the loss tomwalters@0: of phase information in the auditory system (Patterson, 1987). tomwalters@0: tomwalters@0: .LP tomwalters@0: tomwalters@0: The auditory image has the same vertical dimension as the neural tomwalters@0: activity pattern (filter centre frequency). The continuous time tomwalters@0: dimension of the neural activity pattern becomes a local, tomwalters@0: time-interval dimension in the auditory image; specifically, it is tomwalters@0: "the time interval between a given pulse and the succeeding strobe tomwalters@0: pulse". In order to preserve the direction of asymmetry of features tomwalters@0: that appear in the NAP, the time-interval origin is plotted towards tomwalters@0: the right-hand edge of the image, with increasing, positive time tomwalters@0: intervals proceeding to towards the left. tomwalters@0: tomwalters@0: .LP tomwalters@0: .SH OPTIONS tomwalters@0: .LP tomwalters@0: .SS "Display options for the auditory image" tomwalters@0: .PP tomwalters@0: tomwalters@0: The options that control the positioning of the window in which the tomwalters@0: auditory image appears are the same as those used to set up the tomwalters@0: earlier windows, as are the options that control the level of the tomwalters@0: image within the display. In addition, there are three new options tomwalters@0: that are required to present this new auditory representation. The tomwalters@0: options are frstep_aid, pwid_aid, and nwid_aid; the suffix "_aid" tomwalters@0: means "auditory image display". These options are described here tomwalters@0: before the options that control the image construction process itself, tomwalters@0: as they occur first in the options list. There are also three extra tomwalters@0: display options for presenting the auditory image in its spiral form; tomwalters@0: these options have the suffix "_spd" for "spiral display"; they are tomwalters@0: described in the manual entry for 'genspl'. tomwalters@0: tomwalters@0: .LP tomwalters@0: .TP 17 tomwalters@0: frstep_aid tomwalters@0: The frame step interval, or the update interval for the auditory image display tomwalters@0: .RS tomwalters@0: Default units: ms. Default value: 16 ms. tomwalters@0: .RE tomwalters@0: .RS tomwalters@0: tomwalters@0: Conceptually, the auditory image exists continuously in time. The tomwalters@0: simulation of the image produced by AIM is not continuous; rather it tomwalters@0: is like an animated cartoon. Frames of the cartoon are calculated at tomwalters@0: discrete points in time, and then the sequence of frames is replayed tomwalters@0: to reveal the dynamics of the sound, or the lack of dynamics in the tomwalters@0: case of periodic sounds. When the sound is changing at a rate where tomwalters@0: we hear smooth glides, the structures in the simulated auditory image tomwalters@0: move much like objects in a cartoon. frstep_aid determines the time tomwalters@0: interval between frames of the auditory image cartoon. Frames are tomwalters@0: calculated at time zero and integer multiples of segment_sai. tomwalters@0: tomwalters@0: .RE tomwalters@0: tomwalters@0: The default value (16 ms) is reasonable for musical sounds and speech tomwalters@0: sounds. For a detailed examination of the development of the image of tomwalters@0: brief transient sounds frstep_aid should be decreased to 4 or even 2 tomwalters@0: ms. tomwalters@0: .LP tomwalters@0: .TP 16 tomwalters@0: pwidth_sai tomwalters@0: tomwalters@0: The maximum positive time interval presented in the display of the tomwalters@0: auditory image (to the left of 0 ms). tomwalters@0: tomwalters@0: .RS tomwalters@0: Default units: ms. Default value: 35 ms. tomwalters@0: .RE tomwalters@0: .LP tomwalters@0: .TP 16 tomwalters@0: nwidth_sai tomwalters@0: tomwalters@0: The maximum negative time interval presented in the display of the tomwalters@0: auditory image (to the right of 0 ms). tomwalters@0: tomwalters@0: .RS tomwalters@0: Default units: ms. Default value: -5 ms. tomwalters@0: .RE tomwalters@0: tomwalters@0: .LP tomwalters@0: .TP 12 tomwalters@0: animate tomwalters@0: Present the frames of the simulated auditory image as a cartoon. tomwalters@0: .RS tomwalters@0: Switch. Default off. tomwalters@0: .RE tomwalters@0: .RS tomwalters@0: tomwalters@0: With reasonable resolution and a reasonable frame rate, the auditory tomwalters@0: cartoon for a second of sound will require on the order of 1 Mbyte of tomwalters@0: storage. As a result, auditory cartoons are only stored at the tomwalters@0: specific request of the user. When the animate flag is set to `on', tomwalters@0: the bit maps that constitute the frames the auditory cartoon are tomwalters@0: stored in computer memory. They can then be replayed as an auditory tomwalters@0: cartoon by pressing `carriage return'. To exit the instruction, type tomwalters@0: "q" for `quit' or "control c". The bit maps are discarded unless tomwalters@0: option bitmap=on. tomwalters@0: tomwalters@0: .RE tomwalters@0: .LP tomwalters@0: .SS "Storage options for the auditory image " tomwalters@0: .PP tomwalters@0: tomwalters@0: A record of the auditory image can be stored in two ways depending on tomwalters@0: the purpose for which it is stored. The actual numerical values of tomwalters@0: the auditory image can be stored as previously, by setting output=on. tomwalters@0: In this case, a file with a .sai suffix will be created in accordance tomwalters@0: with the conventions of the software. These values can be recalled tomwalters@0: for further processing with the aimTools. In this regard the SAI tomwalters@0: module is like any previous module. tomwalters@0: tomwalters@0: .LP tomwalters@0: It is also possible to store the bit maps which are displayed on tomwalters@0: the screen for the auditory image cartoon. The bit maps require tomwalters@0: less storage space and reload more quickly, so this is the tomwalters@0: preferred mode of storage when one simply wants to review the tomwalters@0: visual image. tomwalters@0: .LP tomwalters@0: .TP 10 tomwalters@0: bitmap tomwalters@0: Produce a bit-map storage file tomwalters@0: .RS tomwalters@0: Switch. Default value: off. tomwalters@0: .RE tomwalters@0: .RS tomwalters@0: tomwalters@0: When the bitmap option is set to `on', the bit maps are stored in a tomwalters@0: file with the suffix .ctn. The bitmaps are reloaded into memory using tomwalters@0: the commands review, or xreview, followed by the file name without the tomwalters@0: suffix .ctn. The auditory image can then be replayed, as with animate, tomwalters@0: by typing `carriage return'. xreview is the newer and preferred tomwalters@0: display routine. It enables the user to select subsets of the cartoon tomwalters@0: and to change the rate of play via a convenient control window. tomwalters@0: tomwalters@0: tomwalters@0: tomwalters@0: .LP tomwalters@0: The strobe mechanism is relatively simple. A trigger threshold tomwalters@0: value is maintained for each channel and when a NAP pulse exceeds tomwalters@0: the threshold a trigger pulse is generated at the time associated tomwalters@0: with the maximum of the peak. The threshold value is then reset tomwalters@0: to a value somewhat above the height of the current NAP peak and tomwalters@0: the threshold value decays exponentially with time thereafter. tomwalters@0: tomwalters@0: tomwalters@0: tomwalters@0: There are six options with the suffix "_ai", short for tomwalters@0: 'auditory image'. Four of these control STI itself -- stdecay_ai, tomwalters@0: stcrit_ai, stthresh_ai and decay_ai. The option stinfo_ai is a switch tomwalters@0: that causes the software to produce information about the current STI tomwalters@0: analysis for demonstration or diagnostic purposes. The final option, tomwalters@0: napdecay_ai controls the decay rate for the NAP while it flows down tomwalters@0: the NAP buffer. tomwalters@0: tomwalters@0: .LP tomwalters@0: .TP 17 tomwalters@0: napdecay_ai tomwalters@0: Decay rate for the neural activity pattern (NAP) tomwalters@0: .RS tomwalters@0: Default units: %/ms. Default value 2.5 %/ms. tomwalters@0: .RE tomwalters@0: .RS tomwalters@0: tomwalters@0: napdecay_ai determines the rate at which the information in the neural tomwalters@0: activity pattern decays as it proceeds along the auditory buffer that tomwalters@0: stores the NAP prior to temporal integration. tomwalters@0: .RE tomwalters@0: tomwalters@0: tomwalters@0: .LP tomwalters@0: .TP 16 tomwalters@0: stdecay_ai tomwalters@0: Strobe threshold decay rate tomwalters@0: .RS tomwalters@0: Default units: %/ms. Default value: 5 %/ms. tomwalters@0: .RE tomwalters@0: .RS tomwalters@0: stdecay_sai determines the rate at which the strobe threshold decays. tomwalters@0: .RE tomwalters@0: .LP tomwalters@0: General purpose pitch mechanisms based on peak picking are tomwalters@0: notoriously difficult to design, and the trigger mechanism just tomwalters@0: described would not work well on an arbitrary acoustic waveform. tomwalters@0: The reason that this simple trigger mechanism is sufficient for tomwalters@0: the construction of the auditory image is that NAP functions are tomwalters@0: highly constrained. The microstructure reveals a function that tomwalters@0: rises from zero to a local maximum smoothly and returns smoothly tomwalters@0: back to zero where it stays for more than half of a period of the tomwalters@0: centre frequency of that channel. On the longer time scale, the tomwalters@0: amplitude of successive peaks changes only relatively slowly with tomwalters@0: respect to time. As a result, for periodic sounds there tends tomwalters@0: to be one clear maximum per period in all but the lowest channels tomwalters@0: where there is an integer number of maxima per period. The tomwalters@0: simplicity of the NAP functions follows from the fact that the tomwalters@0: acoustic waveform has passed through a narrow band filter and so tomwalters@0: it has a limited number of degrees of freedom. In all but the tomwalters@0: highest frequency channels, the output of the auditory filter tomwalters@0: resembles a modulated sine wave whose frequency is near the tomwalters@0: centre frequency of the filter. Thus the neural activity pattern tomwalters@0: is largely restricted to a set of peaks which are modified tomwalters@0: versions of the positive halves of a sine wave, and the remaining tomwalters@0: degrees of freedom appear as relatively slow changes in peak tomwalters@0: amplitude and relatively small changes in peak time (or phase). tomwalters@0: .LP tomwalters@0: .LP tomwalters@0: When the acoustic input terminates, the auditory image must tomwalters@0: decay. In the ASP model the form of the decay is exponential and tomwalters@0: the decay rate is determined by decayrate_sai. tomwalters@0: .LP tomwalters@0: .TP 18 tomwalters@0: decay_ai tomwalters@0: SAI decay time constant tomwalters@0: .RS tomwalters@0: Default units: ms. Default value 30 ms. tomwalters@0: .RE tomwalters@0: .RS tomwalters@0: decay_ai determines the rate at which the auditory image decays. tomwalters@0: .RE tomwalters@0: .RS tomwalters@0: tomwalters@0: In addition, decay_ai determines the rate at which the strength of the tomwalters@0: auditory image increases and the level to which it asymptotes if the tomwalters@0: sound continues indefinitely. In an exponential process, the asymptote tomwalters@0: is reached when the increment provided by each new cycle of the sound tomwalters@0: equals the amount that the image decays over the same period. tomwalters@0: tomwalters@0: .RE tomwalters@0: .SH MOTIVATION tomwalters@0: .LP tomwalters@0: .SS "Auditory temporal integration: The problem " tomwalters@0: .PP tomwalters@0: Image stabilisation and temporal smearing. tomwalters@0: .LP tomwalters@0: When the input to the auditory system is a periodic sound like tomwalters@0: pt_8ms or ae_8ms, the output of the cochlea is a rapidly flowing tomwalters@0: neural activity pattern on which the information concerning the tomwalters@0: source repeats every 8 ms. Consider the display problem that tomwalters@0: would arise if one attempted to present a one second sample of tomwalters@0: either pt_8ms or ae_8ms with the resolution and format of Figure tomwalters@0: 5.2. In that figure each 8 ms period of the sound occupies about tomwalters@0: 4 cm of width. There are 125 repetitions of the period in one tomwalters@0: second and so a paper version of the complete NAP would be 5 tomwalters@0: metres in length. If the NAP were presented as a real-time flow tomwalters@0: process, the paper would have to move past a typical window at tomwalters@0: the rate of 5 metres per second! At this rate, the temporal tomwalters@0: detail within the cycle would be lost. The image would be stable tomwalters@0: but the information would be reduced to horizontal banding. The tomwalters@0: fine-grain temporal information is lost because the integration tomwalters@0: time of the visual system is long with respect to the rate of tomwalters@0: flow of information when the record is moving at 5 metres a tomwalters@0: second. tomwalters@0: .LP tomwalters@0: Traditional models of auditory temporal integration are similar tomwalters@0: to visual models. They assume that we hear a stable auditory tomwalters@0: image in response to a periodic sound because the neural activity tomwalters@0: is passed through a temporal weighting function that integrates tomwalters@0: over time. The output does not fluctuate if the integration time tomwalters@0: is long enough. Unfortunately, the simple model of temporal tomwalters@0: integration does not work for the auditory system. If the output tomwalters@0: is to be stable, the integrator must integrate over 10 or more tomwalters@0: cycles of the sound. We hear stable images for pitches as low tomwalters@0: as, say 50 cycles per second, which suggests that the integration tomwalters@0: time of the auditory system would have to be 200 ms at the tomwalters@0: minimum. Such an integrator would cause far more smearing of tomwalters@0: auditory information than we know occurs. For example, phase tomwalters@0: shifts that produce small changes half way through the period of tomwalters@0: a pulse train are often audible (see Patterson, 1987, for a tomwalters@0: review). Small changes of this sort would be obscured by lengthy tomwalters@0: temporal integration. tomwalters@0: .LP tomwalters@0: Thus the problem in modelling auditory temporal integration is tomwalters@0: to determine how the auditory system can integrate information tomwalters@0: to form a stable auditory image without losing the fine-grain tomwalters@0: temporal information within the individual cycles of periodic tomwalters@0: sounds. In visual terms, the problem is how to present a neural tomwalters@0: activity pattern at a rate of 5 metres per second while at the tomwalters@0: same time enabling the viewer to see features within periods tomwalters@0: greater than about 4 ms. tomwalters@0: .LP tomwalters@0: .SS "Periodic sounds and information packets. " tomwalters@0: .PP tomwalters@0: Now consider temporal integration from an information processing tomwalters@0: perspective, and in particular, the problem of preserving formant tomwalters@0: information in the auditory image. The shape of the neural tomwalters@0: activity pattern within the period of a vowel sound provides tomwalters@0: information about the resonances of the vocal tract (see Figure tomwalters@0: 3.6), and thus the identity of the vowel. The information about tomwalters@0: the source arrives in packets whose duration is the period of the tomwalters@0: source. Many of the sounds in speech and music have the property tomwalters@0: that the source information changes relatively slowly when tomwalters@0: compared with the repetition rate of the source wave (i.e. the tomwalters@0: pitch). Thus, from an information processing point of view, one tomwalters@0: would like to combine source information from neighbouring tomwalters@0: packets, while at the same time taking care not to smear the tomwalters@0: source information contained within the individual packets. In tomwalters@0: short, one would like to perform quantised temporal integration, tomwalters@0: integrating over cycles but not within cycles of the sound. tomwalters@0: .LP tomwalters@0: .SH EXAMPLES tomwalters@0: .LP tomwalters@0: This first pair of examples is intended to illustrate the tomwalters@0: dominant forms of motion that appear in the auditory image, and tomwalters@0: the fact that shapes can be tracked across the image provided the tomwalters@0: rate of change is not excessive. The first example is a pitch tomwalters@0: glide for a note with fixed timbre. The second example involves tomwalters@0: formant motion (a form of timbre glide) in a monotone voice (i.e. tomwalters@0: for a relatively fixed pitch). tomwalters@0: .LP tomwalters@0: .SS "A pitch glide in the auditory image " tomwalters@0: .PP tomwalters@0: Up to this point, we have focussed on the way that TQTI can tomwalters@0: convert a fast flowing NAP pattern into a stabilised auditory tomwalters@0: image. The mechanism is not, however, limited to continuous or tomwalters@0: stationary sounds. The data file cegc contains pulse trains that tomwalters@0: produce pitches near the musical notes C3, E3, G3, and C4, along tomwalters@0: with glides from one note to the next. The notes are relatively tomwalters@0: long and the pitch glides are relatively slow. As a result, each tomwalters@0: note form a stabilised auditory image and there is smooth motion tomwalters@0: from one note image to the next. The stimulus file cegc is tomwalters@0: intended to support several examples including ones involving the tomwalters@0: spiral representation of the auditory image and its relationship tomwalters@0: to musical consonance in the next chapter. For brevity, the tomwalters@0: current example is limited to the transition from C to E near the tomwalters@0: start of the file. The pitch of musical notes is determined by tomwalters@0: the lower harmonics when they are present and so the command for tomwalters@0: the example is: tomwalters@0: .LP tomwalters@0: gensai mag=16 min=100 max=2000 start=100 length=600 tomwalters@0: duration_sai=32 cegc tomwalters@0: .LP tomwalters@0: In point of fact, the pulse train associated with the first note tomwalters@0: has a period of 8 ms like pt_8ms and so this "C" is actually a tomwalters@0: little below the musical note C3. Since the initial C is the tomwalters@0: same as pt_8ms, the onset of the first note is the same as in the tomwalters@0: previous example; however, four cycles of the pulse train pattern tomwalters@0: build up in the window because it has been set to show 32 ms of tomwalters@0: 'auditory image time'. During the transition, the period of the tomwalters@0: stimulus decreases from 32/4 ms down to 32/5 ms, and so the image tomwalters@0: stabilises with five cycles in the window. The period of E is tomwalters@0: 4/5 that of C. tomwalters@0: .LP tomwalters@0: During the transition, in the lower channels associated with the tomwalters@0: first and second harmonic, the individual SAI pulses march from tomwalters@0: left to right in time and, at the same time, they move up in tomwalters@0: frequency as the energy of these harmonics moves out of lower tomwalters@0: filters and into higher filters. In these low channels the tomwalters@0: motion is relatively smooth because the SAI pulse has a duration tomwalters@0: which is a significant proportion of the period of the sound. As tomwalters@0: the pitch rises and the periods get shorter, each new NAP cycle tomwalters@0: contributes a NAP pulse which is shifted a little to the right tomwalters@0: relative to the corresponding SAI pulse. This increases the tomwalters@0: leading edge of the SAI pulse without contributing to the lagging tomwalters@0: edge. As a result, the leading edge builds, the lagging edge tomwalters@0: decays, and the SAI pulse moves to the right. The SAI pulses are tomwalters@0: asymmetric during the motion, with the trailing edge more shallow tomwalters@0: than the leading edge, and the effect is greater towards the left tomwalters@0: edge of the image because the discrepancies over four cycles are tomwalters@0: larger than the discrepancies over one cycle. The effects are tomwalters@0: larger for the second harmonic than for the first harmonic tomwalters@0: because the width of the pulses of the second harmonic are a tomwalters@0: smaller proportion of the period. During the pitch glide the SAI tomwalters@0: pulses have a reduced peak height because the activity is tomwalters@0: distributed over more channels and over longer durations. tomwalters@0: .LP tomwalters@0: The SAI pulses associated with the higher harmonics are tomwalters@0: relatively narrow with regard to the changes in period during the tomwalters@0: pitch glide. As a result there is more blurring of the image tomwalters@0: during the glide in the higher channels. Towards the right-hand tomwalters@0: edge, for the column that shows correlations over one cycle, the tomwalters@0: blurring is minimal. Towards the left-hand edge the details of tomwalters@0: the pattern are blurred and we see mainly activity moving in tomwalters@0: vertical bands from left to right. When the glide terminates the tomwalters@0: fine structure reforms from right to left across the image and tomwalters@0: the stationary image for the note E appears. tomwalters@0: .LP tomwalters@0: The details of the motion are more readily observed when the tomwalters@0: image is played in slow motion. If the disc space is available tomwalters@0: (about 1.3 Mbytes), it is useful to generate a cegc.img file tomwalters@0: using the image option. The auditory image can then be played tomwalters@0: in slow motion using the review command and the slow down option tomwalters@0: "-". tomwalters@0: .LP tomwalters@0: .LP tomwalters@0: .SS "Formant motion in the auditory image " tomwalters@0: .PP tomwalters@0: The vowels of speech are quasi-periodic sounds and the period for tomwalters@0: the average male speaker is on the order of 8ms. As the tomwalters@0: articulators change the shape of the vocal tract during speech, tomwalters@0: formants appear in the auditory image and move about. The tomwalters@0: position and motion of the formants represent the speech tomwalters@0: information conveyed by the voiced parts of speech. When the tomwalters@0: speaker uses a monotone voice, the pitch remains relatively tomwalters@0: steady and the motion of the formants is essentially in the tomwalters@0: vertical dimension. An example of monotone voiced speech is tomwalters@0: provided in the file leo which is the acoustic waveform of the tomwalters@0: word 'leo'. The auditory image of leo can be produced using the tomwalters@0: command tomwalters@0: .LP tomwalters@0: gensai mag=12 segment=40 duration_sai=20 leo tomwalters@0: .LP tomwalters@0: The dominant impression on first observing the auditory image of tomwalters@0: leo is the motion in the formation of the "e" sound, the tomwalters@0: transition from "e" to "o", and the formation of the "o" sound. tomwalters@0: .LP tomwalters@0: The vocal chords come on at the start of the "l" sound but the tomwalters@0: tip of the tongue is pressed against the roof of the mouth just tomwalters@0: behind the teeth and so it restricts the air flow and the start tomwalters@0: of the "l" does not contain much energy. As a result, in the tomwalters@0: auditory image, the presence of the "l" is primarily observed in tomwalters@0: the transition from the "l" to the "e". That is, as the three tomwalters@0: formants in the auditory image of the "e" come on and grow tomwalters@0: stronger, the second formant glides into its "e" position from tomwalters@0: below, indicating that the second formant was recently at a lower tomwalters@0: frequency for the previous sound. tomwalters@0: .LP tomwalters@0: In the "e", the first formant is low, centred on the third tomwalters@0: harmonic at the bottom of the auditory image. The second formant tomwalters@0: is high, up near the third formant. The lower portion of the tomwalters@0: fourth formant shows along the upper edge of the image. tomwalters@0: Recognition systems that ignore temporal fine structure often tomwalters@0: have difficulty determining whether a high frequency tomwalters@0: concentration of energy is a single broad formant or a pair of tomwalters@0: narrower formants close together. This makes it more difficult tomwalters@0: to distinguish "e". In the auditory image, information about the tomwalters@0: pulsing of the vocal chords is maintained and the temporal tomwalters@0: fluctuation of the formant shapes makes it easier to distinguish tomwalters@0: that there are two overlapping formants rather than a single tomwalters@0: large formant. tomwalters@0: .LP tomwalters@0: As the "e" changes into the "o", the second formant moves back tomwalters@0: down onto the eighth harmonic and the first formant moves up to tomwalters@0: a position between the third and fourth harmonics. The third and tomwalters@0: fourth formants remain relatively fixed in frequency but they tomwalters@0: become softer as the "o" takes over. During the transition, the tomwalters@0: second formant becomes fuzzy and moves down a set of vertical tomwalters@0: ridges at multiples of the period. tomwalters@0: .LP tomwalters@0: .LP tomwalters@0: .SS "The vowel triangle: aiua " tomwalters@0: .PP tomwalters@0: In speech research, the vowels are specified by the centre tomwalters@0: frequencies of their formants. The first two formants carry the tomwalters@0: most information and it is common to see sets of vowels tomwalters@0: represented on a graph whose axes are the centre frequencies of tomwalters@0: the first and second formant. Not all combinations of these tomwalters@0: formant frequencies occur in speech; rather, the vowels occupy a tomwalters@0: triangular region within this vowel space and the points of the tomwalters@0: triangle are represented by /a/ as in paw /i/ as in beet, /u/ as tomwalters@0: in toot. The file aiua contains a synthetic speech wave that tomwalters@0: provides a tour around the vowel triangle from /a/ to /i/ to /u/ tomwalters@0: and back to /a/, and there are smooth transitions from one vowel tomwalters@0: to the next. The auditory image of aiua can be generated using tomwalters@0: the command tomwalters@0: .LP tomwalters@0: gensai mag=12 segment=40 duration=20 aiua tomwalters@0: .LP tomwalters@0: The initial vowel /a/ has a high first formant centred on the tomwalters@0: fifth harmonic and a low second formant centred between the tomwalters@0: seventh and eighth harmonics (for these low formants the harmonic tomwalters@0: number can be determined by counting the number of SAI peaks in tomwalters@0: one period of the image). The third formant is at the top of the tomwalters@0: image and it is reasonably strong, although relatively short in tomwalters@0: duration. As the sound changes from /a/ to /i/, the first formant tomwalters@0: moves successively down through the low harmonics and comes to tomwalters@0: rest on the second harmonic. At the same time the second formant tomwalters@0: moves all the way up to a position adjacent to the third formant, tomwalters@0: similar to the "e" in leo. All three of the formants are tomwalters@0: relatively strong. During the transition from the /i/ to the / tomwalters@0: u/, the third formant becomes much weaker;. The second formant tomwalters@0: moves down onto the seventh harmonic and it remains relatively tomwalters@0: weak. The first formant remains centred on the second harmonic tomwalters@0: and it is relatively strong. Finally, the formants return to tomwalters@0: their /a/ positions. tomwalters@0: .LP tomwalters@0: .LP tomwalters@0: .SS "Speaker separation in the auditory image " tomwalters@0: .PP tomwalters@0: One of the more intriguing aspects of speech recognition is our tomwalters@0: ability to hear out one voice in the presence of competing voices tomwalters@0: -- the proverbial cocktail party phenomenon. It is assumed that tomwalters@0: we use pitch differences to help separate the voices. In support tomwalters@0: of this view, several researchers have presented listeners with tomwalters@0: pairs of vowels and shown that they can discriminate the vowels tomwalters@0: better when they have different pitches (Summerfield & Assman, tomwalters@0: 1989). The final example involves a double vowel stimulus, /a/ tomwalters@0: with /i/, and it shows that stable images of the dominant tomwalters@0: formants of both vowels appear in the image. The file dblvow tomwalters@0: (double vowel) contains seven double-vowel pulses. The amplitude tomwalters@0: of the /a/ is fixed at a moderate level; the amplitude of the / tomwalters@0: i/ begins at a level 12 dB greater than that of the /a/ and it tomwalters@0: decreases 4 dB with each successive pulse, and so they are equal tomwalters@0: in level in the fourth pulse. Each pulse is 200 ms in duration tomwalters@0: with 20 ms rise and fall times that are included within the 200 tomwalters@0: ms. There are 80 ms silent gaps between pulses and a gap of 80 tomwalters@0: ms at the start of the file. The auditory image can be generated tomwalters@0: with the command tomwalters@0: .LP tomwalters@0: gensai mag=12 samplerate=10000 segment=40 duration=20 dblvow tomwalters@0: .LP tomwalters@0: The pitch of the /a/ and the /i/ are 100 and 125 Hz, respectively. tomwalters@0: The image reveals a strong first formant centred on the second tomwalters@0: harmonic of 125 Hz (8 ms), and strong third and fourth formants tomwalters@0: with a period of 8 ms (125 Hz). These are the formants of the / tomwalters@0: e/ which is the stronger of the two vowels at this point. In tomwalters@0: between the first and second formants of the /i/ are the first tomwalters@0: and second formants of the /a/ at a somewhat lower level. The tomwalters@0: formants of the /a/ show their proper period, 10 ms. The tomwalters@0: triggering mechanism can stabilise the formants of both vowels tomwalters@0: at their proper periods because the triggering is done on a tomwalters@0: channel by channel basis. The upper formants of the /a/ fall in tomwalters@0: the same channels as the upper formants of the /i/ and since they tomwalters@0: are much weaker, they are repressed by the /i/ formants. tomwalters@0: .LP tomwalters@0: As the example proceeds, the formants of the /e/ become tomwalters@0: progressively weaker. In the image of the fifth burst of the tomwalters@0: double vowel we see evidence of both the upper formants of the / tomwalters@0: i/ and the upper formants of the /a/ in the same channel. tomwalters@0: Finally, in the last burst the first formant of the /i/ has tomwalters@0: disappeared from the lowest channels entirely. There is still tomwalters@0: some evidence of /e/ in the region of the upper formants but it tomwalters@0: is the formants of the /a/ that now dominate in the high frequency tomwalters@0: region. tomwalters@0: .LP tomwalters@0: .SH SEE ALSO tomwalters@0: .LP tomwalters@0: .SH COPYRIGHT tomwalters@0: .LP tomwalters@0: Copyright (c) Applied Psychology Unit, Medical Research Council, 1995 tomwalters@0: .LP tomwalters@0: Permission to use, copy, modify, and distribute this software without fee tomwalters@0: is hereby granted for research purposes, provided that this copyright tomwalters@0: notice appears in all copies and in all supporting documentation, and that tomwalters@0: the software is not redistributed for any fee (except for a nominal tomwalters@0: shipping charge). Anyone wanting to incorporate all or part of this tomwalters@0: software in a commercial product must obtain a license from the Medical tomwalters@0: Research Council. tomwalters@0: .LP tomwalters@0: The MRC makes no representations about the suitability of this tomwalters@0: software for any purpose. It is provided "as is" without express or tomwalters@0: implied warranty. tomwalters@0: .LP tomwalters@0: THE MRC DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING tomwalters@0: ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL tomwalters@0: THE A.P.U. BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES tomwalters@0: OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, tomwalters@0: WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, tomwalters@0: ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS tomwalters@0: SOFTWARE. tomwalters@0: .LP tomwalters@0: .SH ACKNOWLEDGEMENTS tomwalters@0: .LP tomwalters@0: The AIM software was developed for Unix workstations by John tomwalters@0: Holdsworth and Mike Allerhand of the MRC APU, under the direction of tomwalters@0: Roy Patterson. The physiological version of AIM was developed by tomwalters@0: Christian Giguere. The options handler is by Paul Manson. The revised tomwalters@0: SAI module is by Jay Datta. Michael Akeroyd extended the postscript tomwalters@0: facilites and developed the xreview routine for auditory image tomwalters@0: cartoons. tomwalters@0: .LP tomwalters@0: The project was supported by the MRC and grants from the U.K. Defense tomwalters@0: Research Agency, Farnborough (Research Contract 2239); the EEC Esprit tomwalters@0: BR Porgramme, Project ACTS (3207); and the U.K. Hearing Research Trust. tomwalters@0: