Mercurial > hg > aim92
view man/man1/gensai.1 @ 0:5242703e91d3 tip
Initial checkin for AIM92 aimR8.2 (last updated May 1997).
author | tomwalters |
---|---|
date | Fri, 20 May 2011 15:19:45 +0100 |
parents | |
children |
line wrap: on
line source
.TH GENSAI 1 "26 May 1995" .LP .SH NAME .LP gensai \- generate stabilised auditory image .LP .SH SYNOPSIS/SYNTAX .LP gensai [ option=value | -option ] filename .LP .SH DESCRIPTION .LP Periodic sounds give rise to static, rather than oscillating, perceptions indicating that temporal integration is applied to the NAP in the production of our initial perception of a sound -- our auditory image. Traditionally, auditory temporal integration is represented by a simple leaky integration process and AIM provides a bank of lowpass filters to enable the user to generate auditory spectra (Patterson, 1994a) and auditory spectrograms (Patterson et al., 1992b). However, the leaky integrator removes the phase-locked fine structure observed in the NAP, and this conflicts with perceptual data indicating that the fine structure plays an important role in determining sound quality and source identification (Patterson, 1994b; Patterson and Akeroyd, 1995). As a result, AIM includes two modules which preserve much of the time-interval information in the NAP during temporal integration, and which produce a better representation of our auditory images. In the functional version of AIM, this is accomplished with strobed temporal integration (Patterson et al., 1992a,b), and this is the topic of this manual entry. .LP In the physiological version of AIM, the auditory image is constructed with a bank of autocorrelators (Slaney and Lyon, 1990; Meddis and Hewitt, 1991). The autocorrelation module is an aimTool rather than an integral part of the main program 'gen'. The appropriate tool is 'acgram'. Type 'manaim acgram' for the documentation. The module extracts periodicity information and preserves intra-period fine structure by autocorrelating each channel of the NAP separately. The correlogram is the multi-channel version of this process. It was originally introduced as a model of pitch perception (Licklider, 1951). It is not yet known whether STI or autocorrelation is more realistic, or more efficient, as a means of simulating our perceived auditory images. At present, the purpose is to provide a software package that can be used to compare these auditory representations in a way not previously possible. .RE .LP .SH STROBED TEMPORAL INTEGRATION .PP In strobed temporal integration, a bank of delay lines is used to form a buffer store for the NAP, one delay line per channel, and as the NAP proceeds along the buffer it decays linearly with time, at about 2.5 %/ms. Each channel of the buffer is assigned a strobe unit which monitors activity in that channel looking for local maxima in the stream of NAP pulses. When one is found, the unit initiates temporal integration in that channel; that is, it transfers a copy of the NAP at that instant to the corresponding channel of an image buffer and adds it point-for-point with whatever is already there. The local maximum itself is mapped to the 0-ms point in the image buffer. The multi-channel version of this STI process is AIM's representation of our auditory image of a sound. Periodic and quasi-periodic sounds cause regular strobing which leads to simulated auditory images that are static, or nearly static, but with the same temporal resolution as the NAP. Dynamic sounds are represented as a sequence of auditory image frames. If the rate of change in a sound is not too rapid, as is diphthongs, features are seen to move smoothly as the sound proceeds, much as objects move smoothly in animated cartoons. .LP It is important to emphasise, that the triggering done on a channel by channel basis and that triggering is asynchronous across channels, inasmuch as the major peaks in one channel occur at different times from the major peaks in other channels. It is this aspect of the triggering process that causes the alignment of the auditory image and which accounts for the loss of phase information in the auditory system (Patterson, 1987). .LP The auditory image has the same vertical dimension as the neural activity pattern (filter centre frequency). The continuous time dimension of the neural activity pattern becomes a local, time-interval dimension in the auditory image; specifically, it is "the time interval between a given pulse and the succeeding strobe pulse". In order to preserve the direction of asymmetry of features that appear in the NAP, the time-interval origin is plotted towards the right-hand edge of the image, with increasing, positive time intervals proceeding to towards the left. .LP .SH OPTIONS .LP .SS "Display options for the auditory image" .PP The options that control the positioning of the window in which the auditory image appears are the same as those used to set up the earlier windows, as are the options that control the level of the image within the display. In addition, there are three new options that are required to present this new auditory representation. The options are frstep_aid, pwid_aid, and nwid_aid; the suffix "_aid" means "auditory image display". These options are described here before the options that control the image construction process itself, as they occur first in the options list. There are also three extra display options for presenting the auditory image in its spiral form; these options have the suffix "_spd" for "spiral display"; they are described in the manual entry for 'genspl'. .LP .TP 17 frstep_aid The frame step interval, or the update interval for the auditory image display .RS Default units: ms. Default value: 16 ms. .RE .RS Conceptually, the auditory image exists continuously in time. The simulation of the image produced by AIM is not continuous; rather it is like an animated cartoon. Frames of the cartoon are calculated at discrete points in time, and then the sequence of frames is replayed to reveal the dynamics of the sound, or the lack of dynamics in the case of periodic sounds. When the sound is changing at a rate where we hear smooth glides, the structures in the simulated auditory image move much like objects in a cartoon. frstep_aid determines the time interval between frames of the auditory image cartoon. Frames are calculated at time zero and integer multiples of segment_sai. .RE The default value (16 ms) is reasonable for musical sounds and speech sounds. For a detailed examination of the development of the image of brief transient sounds frstep_aid should be decreased to 4 or even 2 ms. .LP .TP 16 pwidth_sai The maximum positive time interval presented in the display of the auditory image (to the left of 0 ms). .RS Default units: ms. Default value: 35 ms. .RE .LP .TP 16 nwidth_sai The maximum negative time interval presented in the display of the auditory image (to the right of 0 ms). .RS Default units: ms. Default value: -5 ms. .RE .LP .TP 12 animate Present the frames of the simulated auditory image as a cartoon. .RS Switch. Default off. .RE .RS With reasonable resolution and a reasonable frame rate, the auditory cartoon for a second of sound will require on the order of 1 Mbyte of storage. As a result, auditory cartoons are only stored at the specific request of the user. When the animate flag is set to `on', the bit maps that constitute the frames the auditory cartoon are stored in computer memory. They can then be replayed as an auditory cartoon by pressing `carriage return'. To exit the instruction, type "q" for `quit' or "control c". The bit maps are discarded unless option bitmap=on. .RE .LP .SS "Storage options for the auditory image " .PP A record of the auditory image can be stored in two ways depending on the purpose for which it is stored. The actual numerical values of the auditory image can be stored as previously, by setting output=on. In this case, a file with a .sai suffix will be created in accordance with the conventions of the software. These values can be recalled for further processing with the aimTools. In this regard the SAI module is like any previous module. .LP It is also possible to store the bit maps which are displayed on the screen for the auditory image cartoon. The bit maps require less storage space and reload more quickly, so this is the preferred mode of storage when one simply wants to review the visual image. .LP .TP 10 bitmap Produce a bit-map storage file .RS Switch. Default value: off. .RE .RS When the bitmap option is set to `on', the bit maps are stored in a file with the suffix .ctn. The bitmaps are reloaded into memory using the commands review, or xreview, followed by the file name without the suffix .ctn. The auditory image can then be replayed, as with animate, by typing `carriage return'. xreview is the newer and preferred display routine. It enables the user to select subsets of the cartoon and to change the rate of play via a convenient control window. .LP The strobe mechanism is relatively simple. A trigger threshold value is maintained for each channel and when a NAP pulse exceeds the threshold a trigger pulse is generated at the time associated with the maximum of the peak. The threshold value is then reset to a value somewhat above the height of the current NAP peak and the threshold value decays exponentially with time thereafter. There are six options with the suffix "_ai", short for 'auditory image'. Four of these control STI itself -- stdecay_ai, stcrit_ai, stthresh_ai and decay_ai. The option stinfo_ai is a switch that causes the software to produce information about the current STI analysis for demonstration or diagnostic purposes. The final option, napdecay_ai controls the decay rate for the NAP while it flows down the NAP buffer. .LP .TP 17 napdecay_ai Decay rate for the neural activity pattern (NAP) .RS Default units: %/ms. Default value 2.5 %/ms. .RE .RS napdecay_ai determines the rate at which the information in the neural activity pattern decays as it proceeds along the auditory buffer that stores the NAP prior to temporal integration. .RE .LP .TP 16 stdecay_ai Strobe threshold decay rate .RS Default units: %/ms. Default value: 5 %/ms. .RE .RS stdecay_sai determines the rate at which the strobe threshold decays. .RE .LP General purpose pitch mechanisms based on peak picking are notoriously difficult to design, and the trigger mechanism just described would not work well on an arbitrary acoustic waveform. The reason that this simple trigger mechanism is sufficient for the construction of the auditory image is that NAP functions are highly constrained. The microstructure reveals a function that rises from zero to a local maximum smoothly and returns smoothly back to zero where it stays for more than half of a period of the centre frequency of that channel. On the longer time scale, the amplitude of successive peaks changes only relatively slowly with respect to time. As a result, for periodic sounds there tends to be one clear maximum per period in all but the lowest channels where there is an integer number of maxima per period. The simplicity of the NAP functions follows from the fact that the acoustic waveform has passed through a narrow band filter and so it has a limited number of degrees of freedom. In all but the highest frequency channels, the output of the auditory filter resembles a modulated sine wave whose frequency is near the centre frequency of the filter. Thus the neural activity pattern is largely restricted to a set of peaks which are modified versions of the positive halves of a sine wave, and the remaining degrees of freedom appear as relatively slow changes in peak amplitude and relatively small changes in peak time (or phase). .LP .LP When the acoustic input terminates, the auditory image must decay. In the ASP model the form of the decay is exponential and the decay rate is determined by decayrate_sai. .LP .TP 18 decay_ai SAI decay time constant .RS Default units: ms. Default value 30 ms. .RE .RS decay_ai determines the rate at which the auditory image decays. .RE .RS In addition, decay_ai determines the rate at which the strength of the auditory image increases and the level to which it asymptotes if the sound continues indefinitely. In an exponential process, the asymptote is reached when the increment provided by each new cycle of the sound equals the amount that the image decays over the same period. .RE .SH MOTIVATION .LP .SS "Auditory temporal integration: The problem " .PP Image stabilisation and temporal smearing. .LP When the input to the auditory system is a periodic sound like pt_8ms or ae_8ms, the output of the cochlea is a rapidly flowing neural activity pattern on which the information concerning the source repeats every 8 ms. Consider the display problem that would arise if one attempted to present a one second sample of either pt_8ms or ae_8ms with the resolution and format of Figure 5.2. In that figure each 8 ms period of the sound occupies about 4 cm of width. There are 125 repetitions of the period in one second and so a paper version of the complete NAP would be 5 metres in length. If the NAP were presented as a real-time flow process, the paper would have to move past a typical window at the rate of 5 metres per second! At this rate, the temporal detail within the cycle would be lost. The image would be stable but the information would be reduced to horizontal banding. The fine-grain temporal information is lost because the integration time of the visual system is long with respect to the rate of flow of information when the record is moving at 5 metres a second. .LP Traditional models of auditory temporal integration are similar to visual models. They assume that we hear a stable auditory image in response to a periodic sound because the neural activity is passed through a temporal weighting function that integrates over time. The output does not fluctuate if the integration time is long enough. Unfortunately, the simple model of temporal integration does not work for the auditory system. If the output is to be stable, the integrator must integrate over 10 or more cycles of the sound. We hear stable images for pitches as low as, say 50 cycles per second, which suggests that the integration time of the auditory system would have to be 200 ms at the minimum. Such an integrator would cause far more smearing of auditory information than we know occurs. For example, phase shifts that produce small changes half way through the period of a pulse train are often audible (see Patterson, 1987, for a review). Small changes of this sort would be obscured by lengthy temporal integration. .LP Thus the problem in modelling auditory temporal integration is to determine how the auditory system can integrate information to form a stable auditory image without losing the fine-grain temporal information within the individual cycles of periodic sounds. In visual terms, the problem is how to present a neural activity pattern at a rate of 5 metres per second while at the same time enabling the viewer to see features within periods greater than about 4 ms. .LP .SS "Periodic sounds and information packets. " .PP Now consider temporal integration from an information processing perspective, and in particular, the problem of preserving formant information in the auditory image. The shape of the neural activity pattern within the period of a vowel sound provides information about the resonances of the vocal tract (see Figure 3.6), and thus the identity of the vowel. The information about the source arrives in packets whose duration is the period of the source. Many of the sounds in speech and music have the property that the source information changes relatively slowly when compared with the repetition rate of the source wave (i.e. the pitch). Thus, from an information processing point of view, one would like to combine source information from neighbouring packets, while at the same time taking care not to smear the source information contained within the individual packets. In short, one would like to perform quantised temporal integration, integrating over cycles but not within cycles of the sound. .LP .SH EXAMPLES .LP This first pair of examples is intended to illustrate the dominant forms of motion that appear in the auditory image, and the fact that shapes can be tracked across the image provided the rate of change is not excessive. The first example is a pitch glide for a note with fixed timbre. The second example involves formant motion (a form of timbre glide) in a monotone voice (i.e. for a relatively fixed pitch). .LP .SS "A pitch glide in the auditory image " .PP Up to this point, we have focussed on the way that TQTI can convert a fast flowing NAP pattern into a stabilised auditory image. The mechanism is not, however, limited to continuous or stationary sounds. The data file cegc contains pulse trains that produce pitches near the musical notes C3, E3, G3, and C4, along with glides from one note to the next. The notes are relatively long and the pitch glides are relatively slow. As a result, each note form a stabilised auditory image and there is smooth motion from one note image to the next. The stimulus file cegc is intended to support several examples including ones involving the spiral representation of the auditory image and its relationship to musical consonance in the next chapter. For brevity, the current example is limited to the transition from C to E near the start of the file. The pitch of musical notes is determined by the lower harmonics when they are present and so the command for the example is: .LP gensai mag=16 min=100 max=2000 start=100 length=600 duration_sai=32 cegc .LP In point of fact, the pulse train associated with the first note has a period of 8 ms like pt_8ms and so this "C" is actually a little below the musical note C3. Since the initial C is the same as pt_8ms, the onset of the first note is the same as in the previous example; however, four cycles of the pulse train pattern build up in the window because it has been set to show 32 ms of 'auditory image time'. During the transition, the period of the stimulus decreases from 32/4 ms down to 32/5 ms, and so the image stabilises with five cycles in the window. The period of E is 4/5 that of C. .LP During the transition, in the lower channels associated with the first and second harmonic, the individual SAI pulses march from left to right in time and, at the same time, they move up in frequency as the energy of these harmonics moves out of lower filters and into higher filters. In these low channels the motion is relatively smooth because the SAI pulse has a duration which is a significant proportion of the period of the sound. As the pitch rises and the periods get shorter, each new NAP cycle contributes a NAP pulse which is shifted a little to the right relative to the corresponding SAI pulse. This increases the leading edge of the SAI pulse without contributing to the lagging edge. As a result, the leading edge builds, the lagging edge decays, and the SAI pulse moves to the right. The SAI pulses are asymmetric during the motion, with the trailing edge more shallow than the leading edge, and the effect is greater towards the left edge of the image because the discrepancies over four cycles are larger than the discrepancies over one cycle. The effects are larger for the second harmonic than for the first harmonic because the width of the pulses of the second harmonic are a smaller proportion of the period. During the pitch glide the SAI pulses have a reduced peak height because the activity is distributed over more channels and over longer durations. .LP The SAI pulses associated with the higher harmonics are relatively narrow with regard to the changes in period during the pitch glide. As a result there is more blurring of the image during the glide in the higher channels. Towards the right-hand edge, for the column that shows correlations over one cycle, the blurring is minimal. Towards the left-hand edge the details of the pattern are blurred and we see mainly activity moving in vertical bands from left to right. When the glide terminates the fine structure reforms from right to left across the image and the stationary image for the note E appears. .LP The details of the motion are more readily observed when the image is played in slow motion. If the disc space is available (about 1.3 Mbytes), it is useful to generate a cegc.img file using the image option. The auditory image can then be played in slow motion using the review command and the slow down option "-". .LP .LP .SS "Formant motion in the auditory image " .PP The vowels of speech are quasi-periodic sounds and the period for the average male speaker is on the order of 8ms. As the articulators change the shape of the vocal tract during speech, formants appear in the auditory image and move about. The position and motion of the formants represent the speech information conveyed by the voiced parts of speech. When the speaker uses a monotone voice, the pitch remains relatively steady and the motion of the formants is essentially in the vertical dimension. An example of monotone voiced speech is provided in the file leo which is the acoustic waveform of the word 'leo'. The auditory image of leo can be produced using the command .LP gensai mag=12 segment=40 duration_sai=20 leo .LP The dominant impression on first observing the auditory image of leo is the motion in the formation of the "e" sound, the transition from "e" to "o", and the formation of the "o" sound. .LP The vocal chords come on at the start of the "l" sound but the tip of the tongue is pressed against the roof of the mouth just behind the teeth and so it restricts the air flow and the start of the "l" does not contain much energy. As a result, in the auditory image, the presence of the "l" is primarily observed in the transition from the "l" to the "e". That is, as the three formants in the auditory image of the "e" come on and grow stronger, the second formant glides into its "e" position from below, indicating that the second formant was recently at a lower frequency for the previous sound. .LP In the "e", the first formant is low, centred on the third harmonic at the bottom of the auditory image. The second formant is high, up near the third formant. The lower portion of the fourth formant shows along the upper edge of the image. Recognition systems that ignore temporal fine structure often have difficulty determining whether a high frequency concentration of energy is a single broad formant or a pair of narrower formants close together. This makes it more difficult to distinguish "e". In the auditory image, information about the pulsing of the vocal chords is maintained and the temporal fluctuation of the formant shapes makes it easier to distinguish that there are two overlapping formants rather than a single large formant. .LP As the "e" changes into the "o", the second formant moves back down onto the eighth harmonic and the first formant moves up to a position between the third and fourth harmonics. The third and fourth formants remain relatively fixed in frequency but they become softer as the "o" takes over. During the transition, the second formant becomes fuzzy and moves down a set of vertical ridges at multiples of the period. .LP .LP .SS "The vowel triangle: aiua " .PP In speech research, the vowels are specified by the centre frequencies of their formants. The first two formants carry the most information and it is common to see sets of vowels represented on a graph whose axes are the centre frequencies of the first and second formant. Not all combinations of these formant frequencies occur in speech; rather, the vowels occupy a triangular region within this vowel space and the points of the triangle are represented by /a/ as in paw /i/ as in beet, /u/ as in toot. The file aiua contains a synthetic speech wave that provides a tour around the vowel triangle from /a/ to /i/ to /u/ and back to /a/, and there are smooth transitions from one vowel to the next. The auditory image of aiua can be generated using the command .LP gensai mag=12 segment=40 duration=20 aiua .LP The initial vowel /a/ has a high first formant centred on the fifth harmonic and a low second formant centred between the seventh and eighth harmonics (for these low formants the harmonic number can be determined by counting the number of SAI peaks in one period of the image). The third formant is at the top of the image and it is reasonably strong, although relatively short in duration. As the sound changes from /a/ to /i/, the first formant moves successively down through the low harmonics and comes to rest on the second harmonic. At the same time the second formant moves all the way up to a position adjacent to the third formant, similar to the "e" in leo. All three of the formants are relatively strong. During the transition from the /i/ to the / u/, the third formant becomes much weaker;. The second formant moves down onto the seventh harmonic and it remains relatively weak. The first formant remains centred on the second harmonic and it is relatively strong. Finally, the formants return to their /a/ positions. .LP .LP .SS "Speaker separation in the auditory image " .PP One of the more intriguing aspects of speech recognition is our ability to hear out one voice in the presence of competing voices -- the proverbial cocktail party phenomenon. It is assumed that we use pitch differences to help separate the voices. In support of this view, several researchers have presented listeners with pairs of vowels and shown that they can discriminate the vowels better when they have different pitches (Summerfield & Assman, 1989). The final example involves a double vowel stimulus, /a/ with /i/, and it shows that stable images of the dominant formants of both vowels appear in the image. The file dblvow (double vowel) contains seven double-vowel pulses. The amplitude of the /a/ is fixed at a moderate level; the amplitude of the / i/ begins at a level 12 dB greater than that of the /a/ and it decreases 4 dB with each successive pulse, and so they are equal in level in the fourth pulse. Each pulse is 200 ms in duration with 20 ms rise and fall times that are included within the 200 ms. There are 80 ms silent gaps between pulses and a gap of 80 ms at the start of the file. The auditory image can be generated with the command .LP gensai mag=12 samplerate=10000 segment=40 duration=20 dblvow .LP The pitch of the /a/ and the /i/ are 100 and 125 Hz, respectively. The image reveals a strong first formant centred on the second harmonic of 125 Hz (8 ms), and strong third and fourth formants with a period of 8 ms (125 Hz). These are the formants of the / e/ which is the stronger of the two vowels at this point. In between the first and second formants of the /i/ are the first and second formants of the /a/ at a somewhat lower level. The formants of the /a/ show their proper period, 10 ms. The triggering mechanism can stabilise the formants of both vowels at their proper periods because the triggering is done on a channel by channel basis. The upper formants of the /a/ fall in the same channels as the upper formants of the /i/ and since they are much weaker, they are repressed by the /i/ formants. .LP As the example proceeds, the formants of the /e/ become progressively weaker. In the image of the fifth burst of the double vowel we see evidence of both the upper formants of the / i/ and the upper formants of the /a/ in the same channel. Finally, in the last burst the first formant of the /i/ has disappeared from the lowest channels entirely. There is still some evidence of /e/ in the region of the upper formants but it is the formants of the /a/ that now dominate in the high frequency region. .LP .SH SEE ALSO .LP .SH COPYRIGHT .LP Copyright (c) Applied Psychology Unit, Medical Research Council, 1995 .LP Permission to use, copy, modify, and distribute this software without fee is hereby granted for research purposes, provided that this copyright notice appears in all copies and in all supporting documentation, and that the software is not redistributed for any fee (except for a nominal shipping charge). Anyone wanting to incorporate all or part of this software in a commercial product must obtain a license from the Medical Research Council. .LP The MRC makes no representations about the suitability of this software for any purpose. It is provided "as is" without express or implied warranty. .LP THE MRC DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL THE A.P.U. BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. .LP .SH ACKNOWLEDGEMENTS .LP The AIM software was developed for Unix workstations by John Holdsworth and Mike Allerhand of the MRC APU, under the direction of Roy Patterson. The physiological version of AIM was developed by Christian Giguere. The options handler is by Paul Manson. The revised SAI module is by Jay Datta. Michael Akeroyd extended the postscript facilites and developed the xreview routine for auditory image cartoons. .LP The project was supported by the MRC and grants from the U.K. Defense Research Agency, Farnborough (Research Contract 2239); the EEC Esprit BR Porgramme, Project ACTS (3207); and the U.K. Hearing Research Trust.