aim92: man/man1/gensai.1 annotate

annotate man/man1/gensai.1 @ 0:5242703e91d3 tip

Initial checkin for AIM92 aimR8.2 (last updated May 1997).

author	tomwalters
date	Fri, 20 May 2011 15:19:45 +0100
parents
children

rev	line source
tomwalters@0	1 .TH GENSAI 1 "26 May 1995"
tomwalters@0	2 .LP
tomwalters@0	3 .SH NAME
tomwalters@0	4 .LP
tomwalters@0	5 gensai \- generate stabilised auditory image
tomwalters@0	6 .LP
tomwalters@0	7 .SH SYNOPSIS/SYNTAX
tomwalters@0	8 .LP
tomwalters@0	9 gensai [ option=value \| -option ] filename
tomwalters@0	10 .LP
tomwalters@0	11 .SH DESCRIPTION
tomwalters@0	12 .LP
tomwalters@0	13
tomwalters@0	14 Periodic sounds give rise to static, rather than oscillating,
tomwalters@0	15 perceptions indicating that temporal integration is applied to the NAP
tomwalters@0	16 in the production of our initial perception of a sound -- our auditory
tomwalters@0	17 image. Traditionally, auditory temporal integration is represented by
tomwalters@0	18 a simple leaky integration process and AIM provides a bank of lowpass
tomwalters@0	19 filters to enable the user to generate auditory spectra (Patterson,
tomwalters@0	20 1994a) and auditory spectrograms (Patterson et al., 1992b). However,
tomwalters@0	21 the leaky integrator removes the phase-locked fine structure observed
tomwalters@0	22 in the NAP, and this conflicts with perceptual data indicating that
tomwalters@0	23 the fine structure plays an important role in determining sound
tomwalters@0	24 quality and source identification (Patterson, 1994b; Patterson and
tomwalters@0	25 Akeroyd, 1995). As a result, AIM includes two modules which preserve
tomwalters@0	26 much of the time-interval information in the NAP during temporal
tomwalters@0	27 integration, and which produce a better representation of our auditory
tomwalters@0	28 images. In the functional version of AIM, this is accomplished with
tomwalters@0	29 strobed temporal integration (Patterson et al., 1992a,b), and this is
tomwalters@0	30 the topic of this manual entry.
tomwalters@0	31
tomwalters@0	32 .LP
tomwalters@0	33
tomwalters@0	34 In the physiological version of AIM, the auditory image is constructed
tomwalters@0	35 with a bank of autocorrelators (Slaney and Lyon, 1990; Meddis and
tomwalters@0	36 Hewitt, 1991). The autocorrelation module is an aimTool rather than
tomwalters@0	37 an integral part of the main program 'gen'. The appropriate tool is
tomwalters@0	38 'acgram'. Type 'manaim acgram' for the documentation. The module
tomwalters@0	39 extracts periodicity information and preserves intra-period fine
tomwalters@0	40 structure by autocorrelating each channel of the NAP separately. The
tomwalters@0	41 correlogram is the multi-channel version of this process. It was
tomwalters@0	42 originally introduced as a model of pitch perception (Licklider,
tomwalters@0	43 1951). It is not yet known whether STI or autocorrelation is more
tomwalters@0	44 realistic, or more efficient, as a means of simulating our perceived
tomwalters@0	45 auditory images. At present, the purpose is to provide a software
tomwalters@0	46 package that can be used to compare these auditory representations in
tomwalters@0	47 a way not previously possible.
tomwalters@0	48
tomwalters@0	49 .RE
tomwalters@0	50 .LP
tomwalters@0	51 .SH STROBED TEMPORAL INTEGRATION
tomwalters@0	52 .PP
tomwalters@0	53
tomwalters@0	54 In strobed temporal integration, a bank of delay lines is used to form
tomwalters@0	55 a buffer store for the NAP, one delay line per channel, and as the NAP
tomwalters@0	56 proceeds along the buffer it decays linearly with time, at about 2.5
tomwalters@0	57 %/ms. Each channel of the buffer is assigned a strobe unit which
tomwalters@0	58 monitors activity in that channel looking for local maxima in the
tomwalters@0	59 stream of NAP pulses. When one is found, the unit initiates temporal
tomwalters@0	60 integration in that channel; that is, it transfers a copy of the NAP
tomwalters@0	61 at that instant to the corresponding channel of an image buffer and
tomwalters@0	62 adds it point-for-point with whatever is already there. The local
tomwalters@0	63 maximum itself is mapped to the 0-ms point in the image buffer. The
tomwalters@0	64 multi-channel version of this STI process is AIM's representation of
tomwalters@0	65 our auditory image of a sound. Periodic and quasi-periodic sounds
tomwalters@0	66 cause regular strobing which leads to simulated auditory images that
tomwalters@0	67 are static, or nearly static, but with the same temporal resolution as
tomwalters@0	68 the NAP. Dynamic sounds are represented as a sequence of auditory
tomwalters@0	69 image frames. If the rate of change in a sound is not too rapid, as is
tomwalters@0	70 diphthongs, features are seen to move smoothly as the sound proceeds,
tomwalters@0	71 much as objects move smoothly in animated cartoons.
tomwalters@0	72
tomwalters@0	73 .LP
tomwalters@0	74 It is important to emphasise, that the triggering done on a
tomwalters@0	75 channel by channel basis and that triggering is asynchronous
tomwalters@0	76 across channels, inasmuch as the major peaks in one channel occur
tomwalters@0	77 at different times from the major peaks in other channels. It
tomwalters@0	78 is this aspect of the triggering process that causes the
tomwalters@0	79 alignment of the auditory image and which accounts for the loss
tomwalters@0	80 of phase information in the auditory system (Patterson, 1987).
tomwalters@0	81
tomwalters@0	82 .LP
tomwalters@0	83
tomwalters@0	84 The auditory image has the same vertical dimension as the neural
tomwalters@0	85 activity pattern (filter centre frequency). The continuous time
tomwalters@0	86 dimension of the neural activity pattern becomes a local,
tomwalters@0	87 time-interval dimension in the auditory image; specifically, it is
tomwalters@0	88 "the time interval between a given pulse and the succeeding strobe
tomwalters@0	89 pulse". In order to preserve the direction of asymmetry of features
tomwalters@0	90 that appear in the NAP, the time-interval origin is plotted towards
tomwalters@0	91 the right-hand edge of the image, with increasing, positive time
tomwalters@0	92 intervals proceeding to towards the left.
tomwalters@0	93
tomwalters@0	94 .LP
tomwalters@0	95 .SH OPTIONS
tomwalters@0	96 .LP
tomwalters@0	97 .SS "Display options for the auditory image"
tomwalters@0	98 .PP
tomwalters@0	99
tomwalters@0	100 The options that control the positioning of the window in which the
tomwalters@0	101 auditory image appears are the same as those used to set up the
tomwalters@0	102 earlier windows, as are the options that control the level of the
tomwalters@0	103 image within the display. In addition, there are three new options
tomwalters@0	104 that are required to present this new auditory representation. The
tomwalters@0	105 options are frstep_aid, pwid_aid, and nwid_aid; the suffix "_aid"
tomwalters@0	106 means "auditory image display". These options are described here
tomwalters@0	107 before the options that control the image construction process itself,
tomwalters@0	108 as they occur first in the options list. There are also three extra
tomwalters@0	109 display options for presenting the auditory image in its spiral form;
tomwalters@0	110 these options have the suffix "_spd" for "spiral display"; they are
tomwalters@0	111 described in the manual entry for 'genspl'.
tomwalters@0	112
tomwalters@0	113 .LP
tomwalters@0	114 .TP 17
tomwalters@0	115 frstep_aid
tomwalters@0	116 The frame step interval, or the update interval for the auditory image display
tomwalters@0	117 .RS
tomwalters@0	118 Default units: ms. Default value: 16 ms.
tomwalters@0	119 .RE
tomwalters@0	120 .RS
tomwalters@0	121
tomwalters@0	122 Conceptually, the auditory image exists continuously in time. The
tomwalters@0	123 simulation of the image produced by AIM is not continuous; rather it
tomwalters@0	124 is like an animated cartoon. Frames of the cartoon are calculated at
tomwalters@0	125 discrete points in time, and then the sequence of frames is replayed
tomwalters@0	126 to reveal the dynamics of the sound, or the lack of dynamics in the
tomwalters@0	127 case of periodic sounds. When the sound is changing at a rate where
tomwalters@0	128 we hear smooth glides, the structures in the simulated auditory image
tomwalters@0	129 move much like objects in a cartoon. frstep_aid determines the time
tomwalters@0	130 interval between frames of the auditory image cartoon. Frames are
tomwalters@0	131 calculated at time zero and integer multiples of segment_sai.
tomwalters@0	132
tomwalters@0	133 .RE
tomwalters@0	134
tomwalters@0	135 The default value (16 ms) is reasonable for musical sounds and speech
tomwalters@0	136 sounds. For a detailed examination of the development of the image of
tomwalters@0	137 brief transient sounds frstep_aid should be decreased to 4 or even 2
tomwalters@0	138 ms.
tomwalters@0	139 .LP
tomwalters@0	140 .TP 16
tomwalters@0	141 pwidth_sai
tomwalters@0	142
tomwalters@0	143 The maximum positive time interval presented in the display of the
tomwalters@0	144 auditory image (to the left of 0 ms).
tomwalters@0	145
tomwalters@0	146 .RS
tomwalters@0	147 Default units: ms. Default value: 35 ms.
tomwalters@0	148 .RE
tomwalters@0	149 .LP
tomwalters@0	150 .TP 16
tomwalters@0	151 nwidth_sai
tomwalters@0	152
tomwalters@0	153 The maximum negative time interval presented in the display of the
tomwalters@0	154 auditory image (to the right of 0 ms).
tomwalters@0	155
tomwalters@0	156 .RS
tomwalters@0	157 Default units: ms. Default value: -5 ms.
tomwalters@0	158 .RE
tomwalters@0	159
tomwalters@0	160 .LP
tomwalters@0	161 .TP 12
tomwalters@0	162 animate
tomwalters@0	163 Present the frames of the simulated auditory image as a cartoon.
tomwalters@0	164 .RS
tomwalters@0	165 Switch. Default off.
tomwalters@0	166 .RE
tomwalters@0	167 .RS
tomwalters@0	168
tomwalters@0	169 With reasonable resolution and a reasonable frame rate, the auditory
tomwalters@0	170 cartoon for a second of sound will require on the order of 1 Mbyte of
tomwalters@0	171 storage. As a result, auditory cartoons are only stored at the
tomwalters@0	172 specific request of the user. When the animate flag is set to `on',
tomwalters@0	173 the bit maps that constitute the frames the auditory cartoon are
tomwalters@0	174 stored in computer memory. They can then be replayed as an auditory
tomwalters@0	175 cartoon by pressing `carriage return'. To exit the instruction, type
tomwalters@0	176 "q" for `quit' or "control c". The bit maps are discarded unless
tomwalters@0	177 option bitmap=on.
tomwalters@0	178
tomwalters@0	179 .RE
tomwalters@0	180 .LP
tomwalters@0	181 .SS "Storage options for the auditory image "
tomwalters@0	182 .PP
tomwalters@0	183
tomwalters@0	184 A record of the auditory image can be stored in two ways depending on
tomwalters@0	185 the purpose for which it is stored. The actual numerical values of
tomwalters@0	186 the auditory image can be stored as previously, by setting output=on.
tomwalters@0	187 In this case, a file with a .sai suffix will be created in accordance
tomwalters@0	188 with the conventions of the software. These values can be recalled
tomwalters@0	189 for further processing with the aimTools. In this regard the SAI
tomwalters@0	190 module is like any previous module.
tomwalters@0	191
tomwalters@0	192 .LP
tomwalters@0	193 It is also possible to store the bit maps which are displayed on
tomwalters@0	194 the screen for the auditory image cartoon. The bit maps require
tomwalters@0	195 less storage space and reload more quickly, so this is the
tomwalters@0	196 preferred mode of storage when one simply wants to review the
tomwalters@0	197 visual image.
tomwalters@0	198 .LP
tomwalters@0	199 .TP 10
tomwalters@0	200 bitmap
tomwalters@0	201 Produce a bit-map storage file
tomwalters@0	202 .RS
tomwalters@0	203 Switch. Default value: off.
tomwalters@0	204 .RE
tomwalters@0	205 .RS
tomwalters@0	206
tomwalters@0	207 When the bitmap option is set to `on', the bit maps are stored in a
tomwalters@0	208 file with the suffix .ctn. The bitmaps are reloaded into memory using
tomwalters@0	209 the commands review, or xreview, followed by the file name without the
tomwalters@0	210 suffix .ctn. The auditory image can then be replayed, as with animate,
tomwalters@0	211 by typing `carriage return'. xreview is the newer and preferred
tomwalters@0	212 display routine. It enables the user to select subsets of the cartoon
tomwalters@0	213 and to change the rate of play via a convenient control window.
tomwalters@0	214
tomwalters@0	215
tomwalters@0	216
tomwalters@0	217 .LP
tomwalters@0	218 The strobe mechanism is relatively simple. A trigger threshold
tomwalters@0	219 value is maintained for each channel and when a NAP pulse exceeds
tomwalters@0	220 the threshold a trigger pulse is generated at the time associated
tomwalters@0	221 with the maximum of the peak. The threshold value is then reset
tomwalters@0	222 to a value somewhat above the height of the current NAP peak and
tomwalters@0	223 the threshold value decays exponentially with time thereafter.
tomwalters@0	224
tomwalters@0	225
tomwalters@0	226
tomwalters@0	227 There are six options with the suffix "_ai", short for
tomwalters@0	228 'auditory image'. Four of these control STI itself -- stdecay_ai,
tomwalters@0	229 stcrit_ai, stthresh_ai and decay_ai. The option stinfo_ai is a switch
tomwalters@0	230 that causes the software to produce information about the current STI
tomwalters@0	231 analysis for demonstration or diagnostic purposes. The final option,
tomwalters@0	232 napdecay_ai controls the decay rate for the NAP while it flows down
tomwalters@0	233 the NAP buffer.
tomwalters@0	234
tomwalters@0	235 .LP
tomwalters@0	236 .TP 17
tomwalters@0	237 napdecay_ai
tomwalters@0	238 Decay rate for the neural activity pattern (NAP)
tomwalters@0	239 .RS
tomwalters@0	240 Default units: %/ms. Default value 2.5 %/ms.
tomwalters@0	241 .RE
tomwalters@0	242 .RS
tomwalters@0	243
tomwalters@0	244 napdecay_ai determines the rate at which the information in the neural
tomwalters@0	245 activity pattern decays as it proceeds along the auditory buffer that
tomwalters@0	246 stores the NAP prior to temporal integration.
tomwalters@0	247 .RE
tomwalters@0	248
tomwalters@0	249
tomwalters@0	250 .LP
tomwalters@0	251 .TP 16
tomwalters@0	252 stdecay_ai
tomwalters@0	253 Strobe threshold decay rate
tomwalters@0	254 .RS
tomwalters@0	255 Default units: %/ms. Default value: 5 %/ms.
tomwalters@0	256 .RE
tomwalters@0	257 .RS
tomwalters@0	258 stdecay_sai determines the rate at which the strobe threshold decays.
tomwalters@0	259 .RE
tomwalters@0	260 .LP
tomwalters@0	261 General purpose pitch mechanisms based on peak picking are
tomwalters@0	262 notoriously difficult to design, and the trigger mechanism just
tomwalters@0	263 described would not work well on an arbitrary acoustic waveform.
tomwalters@0	264 The reason that this simple trigger mechanism is sufficient for
tomwalters@0	265 the construction of the auditory image is that NAP functions are
tomwalters@0	266 highly constrained. The microstructure reveals a function that
tomwalters@0	267 rises from zero to a local maximum smoothly and returns smoothly
tomwalters@0	268 back to zero where it stays for more than half of a period of the
tomwalters@0	269 centre frequency of that channel. On the longer time scale, the
tomwalters@0	270 amplitude of successive peaks changes only relatively slowly with
tomwalters@0	271 respect to time. As a result, for periodic sounds there tends
tomwalters@0	272 to be one clear maximum per period in all but the lowest channels
tomwalters@0	273 where there is an integer number of maxima per period. The
tomwalters@0	274 simplicity of the NAP functions follows from the fact that the
tomwalters@0	275 acoustic waveform has passed through a narrow band filter and so
tomwalters@0	276 it has a limited number of degrees of freedom. In all but the
tomwalters@0	277 highest frequency channels, the output of the auditory filter
tomwalters@0	278 resembles a modulated sine wave whose frequency is near the
tomwalters@0	279 centre frequency of the filter. Thus the neural activity pattern
tomwalters@0	280 is largely restricted to a set of peaks which are modified
tomwalters@0	281 versions of the positive halves of a sine wave, and the remaining
tomwalters@0	282 degrees of freedom appear as relatively slow changes in peak
tomwalters@0	283 amplitude and relatively small changes in peak time (or phase).
tomwalters@0	284 .LP
tomwalters@0	285 .LP
tomwalters@0	286 When the acoustic input terminates, the auditory image must
tomwalters@0	287 decay. In the ASP model the form of the decay is exponential and
tomwalters@0	288 the decay rate is determined by decayrate_sai.
tomwalters@0	289 .LP
tomwalters@0	290 .TP 18
tomwalters@0	291 decay_ai
tomwalters@0	292 SAI decay time constant
tomwalters@0	293 .RS
tomwalters@0	294 Default units: ms. Default value 30 ms.
tomwalters@0	295 .RE
tomwalters@0	296 .RS
tomwalters@0	297 decay_ai determines the rate at which the auditory image decays.
tomwalters@0	298 .RE
tomwalters@0	299 .RS
tomwalters@0	300
tomwalters@0	301 In addition, decay_ai determines the rate at which the strength of the
tomwalters@0	302 auditory image increases and the level to which it asymptotes if the
tomwalters@0	303 sound continues indefinitely. In an exponential process, the asymptote
tomwalters@0	304 is reached when the increment provided by each new cycle of the sound
tomwalters@0	305 equals the amount that the image decays over the same period.
tomwalters@0	306
tomwalters@0	307 .RE
tomwalters@0	308 .SH MOTIVATION
tomwalters@0	309 .LP
tomwalters@0	310 .SS "Auditory temporal integration: The problem "
tomwalters@0	311 .PP
tomwalters@0	312 Image stabilisation and temporal smearing.
tomwalters@0	313 .LP
tomwalters@0	314 When the input to the auditory system is a periodic sound like
tomwalters@0	315 pt_8ms or ae_8ms, the output of the cochlea is a rapidly flowing
tomwalters@0	316 neural activity pattern on which the information concerning the
tomwalters@0	317 source repeats every 8 ms. Consider the display problem that
tomwalters@0	318 would arise if one attempted to present a one second sample of
tomwalters@0	319 either pt_8ms or ae_8ms with the resolution and format of Figure
tomwalters@0	320 5.2. In that figure each 8 ms period of the sound occupies about
tomwalters@0	321 4 cm of width. There are 125 repetitions of the period in one
tomwalters@0	322 second and so a paper version of the complete NAP would be 5
tomwalters@0	323 metres in length. If the NAP were presented as a real-time flow
tomwalters@0	324 process, the paper would have to move past a typical window at
tomwalters@0	325 the rate of 5 metres per second! At this rate, the temporal
tomwalters@0	326 detail within the cycle would be lost. The image would be stable
tomwalters@0	327 but the information would be reduced to horizontal banding. The
tomwalters@0	328 fine-grain temporal information is lost because the integration
tomwalters@0	329 time of the visual system is long with respect to the rate of
tomwalters@0	330 flow of information when the record is moving at 5 metres a
tomwalters@0	331 second.
tomwalters@0	332 .LP
tomwalters@0	333 Traditional models of auditory temporal integration are similar
tomwalters@0	334 to visual models. They assume that we hear a stable auditory
tomwalters@0	335 image in response to a periodic sound because the neural activity
tomwalters@0	336 is passed through a temporal weighting function that integrates
tomwalters@0	337 over time. The output does not fluctuate if the integration time
tomwalters@0	338 is long enough. Unfortunately, the simple model of temporal
tomwalters@0	339 integration does not work for the auditory system. If the output
tomwalters@0	340 is to be stable, the integrator must integrate over 10 or more
tomwalters@0	341 cycles of the sound. We hear stable images for pitches as low
tomwalters@0	342 as, say 50 cycles per second, which suggests that the integration
tomwalters@0	343 time of the auditory system would have to be 200 ms at the
tomwalters@0	344 minimum. Such an integrator would cause far more smearing of
tomwalters@0	345 auditory information than we know occurs. For example, phase
tomwalters@0	346 shifts that produce small changes half way through the period of
tomwalters@0	347 a pulse train are often audible (see Patterson, 1987, for a
tomwalters@0	348 review). Small changes of this sort would be obscured by lengthy
tomwalters@0	349 temporal integration.
tomwalters@0	350 .LP
tomwalters@0	351 Thus the problem in modelling auditory temporal integration is
tomwalters@0	352 to determine how the auditory system can integrate information
tomwalters@0	353 to form a stable auditory image without losing the fine-grain
tomwalters@0	354 temporal information within the individual cycles of periodic
tomwalters@0	355 sounds. In visual terms, the problem is how to present a neural
tomwalters@0	356 activity pattern at a rate of 5 metres per second while at the
tomwalters@0	357 same time enabling the viewer to see features within periods
tomwalters@0	358 greater than about 4 ms.
tomwalters@0	359 .LP
tomwalters@0	360 .SS "Periodic sounds and information packets. "
tomwalters@0	361 .PP
tomwalters@0	362 Now consider temporal integration from an information processing
tomwalters@0	363 perspective, and in particular, the problem of preserving formant
tomwalters@0	364 information in the auditory image. The shape of the neural
tomwalters@0	365 activity pattern within the period of a vowel sound provides
tomwalters@0	366 information about the resonances of the vocal tract (see Figure
tomwalters@0	367 3.6), and thus the identity of the vowel. The information about
tomwalters@0	368 the source arrives in packets whose duration is the period of the
tomwalters@0	369 source. Many of the sounds in speech and music have the property
tomwalters@0	370 that the source information changes relatively slowly when
tomwalters@0	371 compared with the repetition rate of the source wave (i.e. the
tomwalters@0	372 pitch). Thus, from an information processing point of view, one
tomwalters@0	373 would like to combine source information from neighbouring
tomwalters@0	374 packets, while at the same time taking care not to smear the
tomwalters@0	375 source information contained within the individual packets. In
tomwalters@0	376 short, one would like to perform quantised temporal integration,
tomwalters@0	377 integrating over cycles but not within cycles of the sound.
tomwalters@0	378 .LP
tomwalters@0	379 .SH EXAMPLES
tomwalters@0	380 .LP
tomwalters@0	381 This first pair of examples is intended to illustrate the
tomwalters@0	382 dominant forms of motion that appear in the auditory image, and
tomwalters@0	383 the fact that shapes can be tracked across the image provided the
tomwalters@0	384 rate of change is not excessive. The first example is a pitch
tomwalters@0	385 glide for a note with fixed timbre. The second example involves
tomwalters@0	386 formant motion (a form of timbre glide) in a monotone voice (i.e.
tomwalters@0	387 for a relatively fixed pitch).
tomwalters@0	388 .LP
tomwalters@0	389 .SS "A pitch glide in the auditory image "
tomwalters@0	390 .PP
tomwalters@0	391 Up to this point, we have focussed on the way that TQTI can
tomwalters@0	392 convert a fast flowing NAP pattern into a stabilised auditory
tomwalters@0	393 image. The mechanism is not, however, limited to continuous or
tomwalters@0	394 stationary sounds. The data file cegc contains pulse trains that
tomwalters@0	395 produce pitches near the musical notes C3, E3, G3, and C4, along
tomwalters@0	396 with glides from one note to the next. The notes are relatively
tomwalters@0	397 long and the pitch glides are relatively slow. As a result, each
tomwalters@0	398 note form a stabilised auditory image and there is smooth motion
tomwalters@0	399 from one note image to the next. The stimulus file cegc is
tomwalters@0	400 intended to support several examples including ones involving the
tomwalters@0	401 spiral representation of the auditory image and its relationship
tomwalters@0	402 to musical consonance in the next chapter. For brevity, the
tomwalters@0	403 current example is limited to the transition from C to E near the
tomwalters@0	404 start of the file. The pitch of musical notes is determined by
tomwalters@0	405 the lower harmonics when they are present and so the command for
tomwalters@0	406 the example is:
tomwalters@0	407 .LP
tomwalters@0	408 gensai mag=16 min=100 max=2000 start=100 length=600
tomwalters@0	409 duration_sai=32 cegc
tomwalters@0	410 .LP
tomwalters@0	411 In point of fact, the pulse train associated with the first note
tomwalters@0	412 has a period of 8 ms like pt_8ms and so this "C" is actually a
tomwalters@0	413 little below the musical note C3. Since the initial C is the
tomwalters@0	414 same as pt_8ms, the onset of the first note is the same as in the
tomwalters@0	415 previous example; however, four cycles of the pulse train pattern
tomwalters@0	416 build up in the window because it has been set to show 32 ms of
tomwalters@0	417 'auditory image time'. During the transition, the period of the
tomwalters@0	418 stimulus decreases from 32/4 ms down to 32/5 ms, and so the image
tomwalters@0	419 stabilises with five cycles in the window. The period of E is
tomwalters@0	420 4/5 that of C.
tomwalters@0	421 .LP
tomwalters@0	422 During the transition, in the lower channels associated with the
tomwalters@0	423 first and second harmonic, the individual SAI pulses march from
tomwalters@0	424 left to right in time and, at the same time, they move up in
tomwalters@0	425 frequency as the energy of these harmonics moves out of lower
tomwalters@0	426 filters and into higher filters. In these low channels the
tomwalters@0	427 motion is relatively smooth because the SAI pulse has a duration
tomwalters@0	428 which is a significant proportion of the period of the sound. As
tomwalters@0	429 the pitch rises and the periods get shorter, each new NAP cycle
tomwalters@0	430 contributes a NAP pulse which is shifted a little to the right
tomwalters@0	431 relative to the corresponding SAI pulse. This increases the
tomwalters@0	432 leading edge of the SAI pulse without contributing to the lagging
tomwalters@0	433 edge. As a result, the leading edge builds, the lagging edge
tomwalters@0	434 decays, and the SAI pulse moves to the right. The SAI pulses are
tomwalters@0	435 asymmetric during the motion, with the trailing edge more shallow
tomwalters@0	436 than the leading edge, and the effect is greater towards the left
tomwalters@0	437 edge of the image because the discrepancies over four cycles are
tomwalters@0	438 larger than the discrepancies over one cycle. The effects are
tomwalters@0	439 larger for the second harmonic than for the first harmonic
tomwalters@0	440 because the width of the pulses of the second harmonic are a
tomwalters@0	441 smaller proportion of the period. During the pitch glide the SAI
tomwalters@0	442 pulses have a reduced peak height because the activity is
tomwalters@0	443 distributed over more channels and over longer durations.
tomwalters@0	444 .LP
tomwalters@0	445 The SAI pulses associated with the higher harmonics are
tomwalters@0	446 relatively narrow with regard to the changes in period during the
tomwalters@0	447 pitch glide. As a result there is more blurring of the image
tomwalters@0	448 during the glide in the higher channels. Towards the right-hand
tomwalters@0	449 edge, for the column that shows correlations over one cycle, the
tomwalters@0	450 blurring is minimal. Towards the left-hand edge the details of
tomwalters@0	451 the pattern are blurred and we see mainly activity moving in
tomwalters@0	452 vertical bands from left to right. When the glide terminates the
tomwalters@0	453 fine structure reforms from right to left across the image and
tomwalters@0	454 the stationary image for the note E appears.
tomwalters@0	455 .LP
tomwalters@0	456 The details of the motion are more readily observed when the
tomwalters@0	457 image is played in slow motion. If the disc space is available
tomwalters@0	458 (about 1.3 Mbytes), it is useful to generate a cegc.img file
tomwalters@0	459 using the image option. The auditory image can then be played
tomwalters@0	460 in slow motion using the review command and the slow down option
tomwalters@0	461 "-".
tomwalters@0	462 .LP
tomwalters@0	463 .LP
tomwalters@0	464 .SS "Formant motion in the auditory image "
tomwalters@0	465 .PP
tomwalters@0	466 The vowels of speech are quasi-periodic sounds and the period for
tomwalters@0	467 the average male speaker is on the order of 8ms. As the
tomwalters@0	468 articulators change the shape of the vocal tract during speech,
tomwalters@0	469 formants appear in the auditory image and move about. The
tomwalters@0	470 position and motion of the formants represent the speech
tomwalters@0	471 information conveyed by the voiced parts of speech. When the
tomwalters@0	472 speaker uses a monotone voice, the pitch remains relatively
tomwalters@0	473 steady and the motion of the formants is essentially in the
tomwalters@0	474 vertical dimension. An example of monotone voiced speech is
tomwalters@0	475 provided in the file leo which is the acoustic waveform of the
tomwalters@0	476 word 'leo'. The auditory image of leo can be produced using the
tomwalters@0	477 command
tomwalters@0	478 .LP
tomwalters@0	479 gensai mag=12 segment=40 duration_sai=20 leo
tomwalters@0	480 .LP
tomwalters@0	481 The dominant impression on first observing the auditory image of
tomwalters@0	482 leo is the motion in the formation of the "e" sound, the
tomwalters@0	483 transition from "e" to "o", and the formation of the "o" sound.
tomwalters@0	484 .LP
tomwalters@0	485 The vocal chords come on at the start of the "l" sound but the
tomwalters@0	486 tip of the tongue is pressed against the roof of the mouth just
tomwalters@0	487 behind the teeth and so it restricts the air flow and the start
tomwalters@0	488 of the "l" does not contain much energy. As a result, in the
tomwalters@0	489 auditory image, the presence of the "l" is primarily observed in
tomwalters@0	490 the transition from the "l" to the "e". That is, as the three
tomwalters@0	491 formants in the auditory image of the "e" come on and grow
tomwalters@0	492 stronger, the second formant glides into its "e" position from
tomwalters@0	493 below, indicating that the second formant was recently at a lower
tomwalters@0	494 frequency for the previous sound.
tomwalters@0	495 .LP
tomwalters@0	496 In the "e", the first formant is low, centred on the third
tomwalters@0	497 harmonic at the bottom of the auditory image. The second formant
tomwalters@0	498 is high, up near the third formant. The lower portion of the
tomwalters@0	499 fourth formant shows along the upper edge of the image.
tomwalters@0	500 Recognition systems that ignore temporal fine structure often
tomwalters@0	501 have difficulty determining whether a high frequency
tomwalters@0	502 concentration of energy is a single broad formant or a pair of
tomwalters@0	503 narrower formants close together. This makes it more difficult
tomwalters@0	504 to distinguish "e". In the auditory image, information about the
tomwalters@0	505 pulsing of the vocal chords is maintained and the temporal
tomwalters@0	506 fluctuation of the formant shapes makes it easier to distinguish
tomwalters@0	507 that there are two overlapping formants rather than a single
tomwalters@0	508 large formant.
tomwalters@0	509 .LP
tomwalters@0	510 As the "e" changes into the "o", the second formant moves back
tomwalters@0	511 down onto the eighth harmonic and the first formant moves up to
tomwalters@0	512 a position between the third and fourth harmonics. The third and
tomwalters@0	513 fourth formants remain relatively fixed in frequency but they
tomwalters@0	514 become softer as the "o" takes over. During the transition, the
tomwalters@0	515 second formant becomes fuzzy and moves down a set of vertical
tomwalters@0	516 ridges at multiples of the period.
tomwalters@0	517 .LP
tomwalters@0	518 .LP
tomwalters@0	519 .SS "The vowel triangle: aiua "
tomwalters@0	520 .PP
tomwalters@0	521 In speech research, the vowels are specified by the centre
tomwalters@0	522 frequencies of their formants. The first two formants carry the
tomwalters@0	523 most information and it is common to see sets of vowels
tomwalters@0	524 represented on a graph whose axes are the centre frequencies of
tomwalters@0	525 the first and second formant. Not all combinations of these
tomwalters@0	526 formant frequencies occur in speech; rather, the vowels occupy a
tomwalters@0	527 triangular region within this vowel space and the points of the
tomwalters@0	528 triangle are represented by /a/ as in paw /i/ as in beet, /u/ as
tomwalters@0	529 in toot. The file aiua contains a synthetic speech wave that
tomwalters@0	530 provides a tour around the vowel triangle from /a/ to /i/ to /u/
tomwalters@0	531 and back to /a/, and there are smooth transitions from one vowel
tomwalters@0	532 to the next. The auditory image of aiua can be generated using
tomwalters@0	533 the command
tomwalters@0	534 .LP
tomwalters@0	535 gensai mag=12 segment=40 duration=20 aiua
tomwalters@0	536 .LP
tomwalters@0	537 The initial vowel /a/ has a high first formant centred on the
tomwalters@0	538 fifth harmonic and a low second formant centred between the
tomwalters@0	539 seventh and eighth harmonics (for these low formants the harmonic
tomwalters@0	540 number can be determined by counting the number of SAI peaks in
tomwalters@0	541 one period of the image). The third formant is at the top of the
tomwalters@0	542 image and it is reasonably strong, although relatively short in
tomwalters@0	543 duration. As the sound changes from /a/ to /i/, the first formant
tomwalters@0	544 moves successively down through the low harmonics and comes to
tomwalters@0	545 rest on the second harmonic. At the same time the second formant
tomwalters@0	546 moves all the way up to a position adjacent to the third formant,
tomwalters@0	547 similar to the "e" in leo. All three of the formants are
tomwalters@0	548 relatively strong. During the transition from the /i/ to the /
tomwalters@0	549 u/, the third formant becomes much weaker;. The second formant
tomwalters@0	550 moves down onto the seventh harmonic and it remains relatively
tomwalters@0	551 weak. The first formant remains centred on the second harmonic
tomwalters@0	552 and it is relatively strong. Finally, the formants return to
tomwalters@0	553 their /a/ positions.
tomwalters@0	554 .LP
tomwalters@0	555 .LP
tomwalters@0	556 .SS "Speaker separation in the auditory image "
tomwalters@0	557 .PP
tomwalters@0	558 One of the more intriguing aspects of speech recognition is our
tomwalters@0	559 ability to hear out one voice in the presence of competing voices
tomwalters@0	560 -- the proverbial cocktail party phenomenon. It is assumed that
tomwalters@0	561 we use pitch differences to help separate the voices. In support
tomwalters@0	562 of this view, several researchers have presented listeners with
tomwalters@0	563 pairs of vowels and shown that they can discriminate the vowels
tomwalters@0	564 better when they have different pitches (Summerfield & Assman,
tomwalters@0	565 1989). The final example involves a double vowel stimulus, /a/
tomwalters@0	566 with /i/, and it shows that stable images of the dominant
tomwalters@0	567 formants of both vowels appear in the image. The file dblvow
tomwalters@0	568 (double vowel) contains seven double-vowel pulses. The amplitude
tomwalters@0	569 of the /a/ is fixed at a moderate level; the amplitude of the /
tomwalters@0	570 i/ begins at a level 12 dB greater than that of the /a/ and it
tomwalters@0	571 decreases 4 dB with each successive pulse, and so they are equal
tomwalters@0	572 in level in the fourth pulse. Each pulse is 200 ms in duration
tomwalters@0	573 with 20 ms rise and fall times that are included within the 200
tomwalters@0	574 ms. There are 80 ms silent gaps between pulses and a gap of 80
tomwalters@0	575 ms at the start of the file. The auditory image can be generated
tomwalters@0	576 with the command
tomwalters@0	577 .LP
tomwalters@0	578 gensai mag=12 samplerate=10000 segment=40 duration=20 dblvow
tomwalters@0	579 .LP
tomwalters@0	580 The pitch of the /a/ and the /i/ are 100 and 125 Hz, respectively.
tomwalters@0	581 The image reveals a strong first formant centred on the second
tomwalters@0	582 harmonic of 125 Hz (8 ms), and strong third and fourth formants
tomwalters@0	583 with a period of 8 ms (125 Hz). These are the formants of the /
tomwalters@0	584 e/ which is the stronger of the two vowels at this point. In
tomwalters@0	585 between the first and second formants of the /i/ are the first
tomwalters@0	586 and second formants of the /a/ at a somewhat lower level. The
tomwalters@0	587 formants of the /a/ show their proper period, 10 ms. The
tomwalters@0	588 triggering mechanism can stabilise the formants of both vowels
tomwalters@0	589 at their proper periods because the triggering is done on a
tomwalters@0	590 channel by channel basis. The upper formants of the /a/ fall in
tomwalters@0	591 the same channels as the upper formants of the /i/ and since they
tomwalters@0	592 are much weaker, they are repressed by the /i/ formants.
tomwalters@0	593 .LP
tomwalters@0	594 As the example proceeds, the formants of the /e/ become
tomwalters@0	595 progressively weaker. In the image of the fifth burst of the
tomwalters@0	596 double vowel we see evidence of both the upper formants of the /
tomwalters@0	597 i/ and the upper formants of the /a/ in the same channel.
tomwalters@0	598 Finally, in the last burst the first formant of the /i/ has
tomwalters@0	599 disappeared from the lowest channels entirely. There is still
tomwalters@0	600 some evidence of /e/ in the region of the upper formants but it
tomwalters@0	601 is the formants of the /a/ that now dominate in the high frequency
tomwalters@0	602 region.
tomwalters@0	603 .LP
tomwalters@0	604 .SH SEE ALSO
tomwalters@0	605 .LP
tomwalters@0	606 .SH COPYRIGHT
tomwalters@0	607 .LP
tomwalters@0	608 Copyright (c) Applied Psychology Unit, Medical Research Council, 1995
tomwalters@0	609 .LP
tomwalters@0	610 Permission to use, copy, modify, and distribute this software without fee
tomwalters@0	611 is hereby granted for research purposes, provided that this copyright
tomwalters@0	612 notice appears in all copies and in all supporting documentation, and that
tomwalters@0	613 the software is not redistributed for any fee (except for a nominal
tomwalters@0	614 shipping charge). Anyone wanting to incorporate all or part of this
tomwalters@0	615 software in a commercial product must obtain a license from the Medical
tomwalters@0	616 Research Council.
tomwalters@0	617 .LP
tomwalters@0	618 The MRC makes no representations about the suitability of this
tomwalters@0	619 software for any purpose. It is provided "as is" without express or
tomwalters@0	620 implied warranty.
tomwalters@0	621 .LP
tomwalters@0	622 THE MRC DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING
tomwalters@0	623 ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL
tomwalters@0	624 THE A.P.U. BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES
tomwalters@0	625 OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
tomwalters@0	626 WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,
tomwalters@0	627 ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS
tomwalters@0	628 SOFTWARE.
tomwalters@0	629 .LP
tomwalters@0	630 .SH ACKNOWLEDGEMENTS
tomwalters@0	631 .LP
tomwalters@0	632 The AIM software was developed for Unix workstations by John
tomwalters@0	633 Holdsworth and Mike Allerhand of the MRC APU, under the direction of
tomwalters@0	634 Roy Patterson. The physiological version of AIM was developed by
tomwalters@0	635 Christian Giguere. The options handler is by Paul Manson. The revised
tomwalters@0	636 SAI module is by Jay Datta. Michael Akeroyd extended the postscript
tomwalters@0	637 facilites and developed the xreview routine for auditory image
tomwalters@0	638 cartoons.
tomwalters@0	639 .LP
tomwalters@0	640 The project was supported by the MRC and grants from the U.K. Defense
tomwalters@0	641 Research Agency, Farnborough (Research Contract 2239); the EEC Esprit
tomwalters@0	642 BR Porgramme, Project ACTS (3207); and the U.K. Hearing Research Trust.
tomwalters@0	643

Mercurial > hg > aim92

annotate man/man1/gensai.1 @ 0:5242703e91d3 tip