Frequently Asked Questions in Acoustics of Speech and Hearing - Part 1

Why do Speech Therapists need to study Acoustics?

There are a number of reasons why an understanding of sound and speech production, hearing and perception are relevant to speech and language therapy:

  • The acoustic form of language is part of the speech chain that links speaker to hearer. If you didn't study acoustics, your understanding would stop at articulation and start again at phonetic transcription.
  • Speech is highly variable and this influences its effectiveness as a means of communication. Effective communication means proper sound generation, transmission, analysis and decoding. To understand, say, how one vowel sound is 'clear' while another is 'muffled' you need to appreciate how the sound energy is generated and shaped by the vocal tract, how it can be affected by its transmission to the listener, how the listener is able to tell which vowel has been articulated.
  • Science is built on quantitative analysis; on measurements not just opinions. How can we measure speech without instruments to capture and represent the acoustic signal? How can you do science without the knowledge about how such instruments operate and what it is possible and practical to measure?
  • Clinical work is not about saying speech is 'good' or 'bad' but by understanding what about the speech is different to normal, what the possible causes for a disorder might be, and what the consequences are for normal communication. In acoustics we provide the concepts which explain in what ways, for example, disorders of the vocal folds, disorders of hearing, or background noise affect the ability to communicate.
  • Clinical work and research work require you to write reports on your activities in a scientific (i.e. non-subjective) manner. In Acoustics we get you to do simple experimental work and then try to describe what you have done, what you found out and why you did it.
  • Clinical work will require you to use machines such as tape recorders, computers or even an electro-laryngograph. An understanding of how these machines work and are used will enhance your effectiveness.
  • RCSLT specifies that you have to do it.

What is a decibel?

Basically a decibel is a measure of power that can be applied in a number of fields and to a number of physical phenomena. It is used in Acoustics, as we have seen, to measure the intensity of a sound; it is also used in radio telecommunications to measure the intensity of a radio signal, or used in telecommunications to measure the power in a signal communicated over a telephone line.

What makes a decibel different from other units is that it always expresses the power in the signal as a ratio to some standard or reference power. In other words, rather than being like length and having units such as metres, the decibel describes a power as being 'ten-times' or 'one-hundredth' of the power of some reference power. Thus when we say that the gain in an amplifier is 20 decibels, we mean that the amplifier has changed the power in the signal by 20 decibel units, or that the output of the amplifier is 20 decibels of power greater than the input to the amplifier. This is more informative than to say that the output of the amplifier is 20 watts of power, since we don't know how many watts of power were put in!

The other useful aspect of decibels is that they express these ratios to a standard power using a logarithmic scale rather than a linear scale. This may seem more awkward than useful to you, but in fact it arises because many physical processes work multiplicatively on power rather than as addition. If you imagine a sound signal going through a wall, clearly it loses power (it has more intensity on one side than on the other) but the question is: does the wall remove a fixed amount of power (say 20watts) or does it change the power by a constant fraction (say reduce by a factor of 2)? Well, the latter is the right answer. We can express the effect of the wall on the sound as a fixed ratio of powers, but we can't express it as a fixed number of watts. Thus, if we put 10 watts in we might get 5 watts out, and if we put 20 watts in we might get 10watts out. Using decibels we might say that the wall has attenuated the signal by 3 decibels, and this would be true regardless of the input power.

Finally, some mathematics. If we have a power P and some reference power Pref, then the ratio expressed in decibels is just

10 log10 (P/Pref)

That is, we divide the measured power by the reference power, take the logarithm base 10, then multiply by 10.

Sometimes we do not have measurements of the power in the signal, but we do have measurements of the amplitude of the signal. Fortunately we can use the fact that the power in the signal is proportional to the square of the amplitude of the signal (don't worry about this, it is because power is current times voltage, and that both the voltage in the signal and the current the signal can generate are proportional to amplitude). Thus we can substitute the ratio of powers in the decibel formula with a squared ratio of amplitudes, which because of the effect of the log reduces to:

20 log10 (A/Aref)

Where A is the measured amplitude and Aref is the reference amplitude.

Finally, some useful identities:

Doubling power in signal = add 3 decibels

Halving power signal = subtract 3 decibels

Doubling amplitude of signal = add 6 decibels

Halving amplitude of signal = subtract 6 decibels

Multiplying amplitude of signal by 10 = add 20 decibels

Dividing amplitude of signal by 10 = subtract 20 decibels.

What is a logarithmic scale?

A ruler is a linear scale: it has marks on it corresponding to equal quantities of distance. One way of expressing this is to say that the ratio of successive intervals is equal to one. A logarithmic scale is different in that the ratio of successive intervals is not equal to one. Each interval on a logarithmic scale is some common factor larger than the previous interval. A typical ratio is 10, so that the marks on the scale read: 1, 10, 100, 1000, 10000, etc. Such a scale is useful if you are plotting a graph of values which have a very large range.

Since many aspects of perception are related to proportional change, logarithmic scales are very common in psychophysics. A graph of many perceptual scales against the logarithm of the stimulus size is a straight line over some range (this is known as Weber's law). A scale of perceptual pitch against log (Hz) is a good example.

What is the difference between loudness and intensity?

Loudness is a perceptual or subjective quality of a sound; intensity is a physical or objective property. Although changes in intensity can cause changes in loudness, they are clearly two different scales. In particular, sounds which are below the threshold of audibility have a non-zero intensity, but zero loudness.

Intensity is measured in Wm-2, but we usually prefer to use the Sound Pressure Level scale (dBSPL). Loudness can be measured in units called phons, where 10 phons is the perceived loudness associated with a pure tone of 1000Hz at 10dB above the threshold of audibility.

Why do we need to multiply by 20 in the decibel scale?

The number 20 has two causes: one that gives us a multiplier of 10, and one that gives us a multiplier of 2.

The factor of 10 is easy - we are working in decibels not bels. One bel (named after Alexander Graham Bell, by the way) is rather a large unit, roughly equal to a tripling in amplitude. So we multiply by 10 and work in tenths of a bel to give us more sensitivity.

The factor of 2 is there because we have ignored the fact that the correct definition of decibels is as a logarithmic ratio of powers not amplitudes. The correct definition would look like:

amplitude (dB) = 10.log10(measured_power/reference_power)

In sound, we prefer to work in Pascals: units of pressure rather than units of power. Fortunately, it is fairly easy to show that the power in a sound signal is proportional to the square of the pressure (imagine the air pressing against a membrane with some force F and moving it through some distance d; if the membrane is elastic, then d is proportional to the pressure; and the energy transferred is then just the force on the membrane F times d. But since both F and d are proportional to pressure, then the energy is proportional to the pressure squared). Thus the power ratio is numerically equal to the square of the pressure ratio, and we can write:

amplitude (dB) = 10.log10(measured_pressure/reference_pressure)2

or, with the magic of logarithms:

amplitude (dB) = 20.log10(measured_pressure/reference_pressure)

Why doesn't 0dB mean that nothing is measured?

The decibel scale is a logarithmic ratio scale: we start with a ratio of pressures, then take the logarithm and finally multiply by 20. If the the ratio is a number greater than one, then the logarithm must produce a value greater than 0 (e.g. log(10)=1). If the ratio is a number less than one, then the logarithm must produce a value less than 0 (e.g. the log(0.1)=-1). If the ratio is equal to one (i.e. that the two pressures are equal) then the logarithm returns 0 (since log(1)=0). Thus the zero point on the decibel scale is simply the point at which the measured amplitude is equal to the reference amplitude.

Note that it doesn't make sense to do the calculation when there is no measurable pressure change, since the logarithm of zero is not defined (minus infinity).

Why is log(20) = 1.3?

Answer A: Well all this is saying is that 101.3 is equal to 20. Is this really so strange? After all 101.0 = 10, and 102.0 = 100, so 101.3 has got to be a number between 10 and 100 hasn't it?

Answer B: We can try to work out the logarithm of 20 approximately by the following method. We are seeking a value x in the expression:

10x = 20

Let's first divide out a factor of 10:

10(x-1) = 2

And now raise to the tenth power:

1010(x-1) = 210

Since 210 = 1024, which is just over 103, we can now make the following approximation (~= means approximately):

1010(x-1) ~= 103

Or, in other words:

10(x-1) ~= 3

(x-1) ~= 0.3


x ~= 1.3

How do we convert decibel values on the SPL scale back to pressure?

We need to invert the formula for decibels:

Amplitude(dBSPL) = 20.log10(Amplitude(Pa)/20µPa)

OK, first divide by 20:

Amplitude(dBSPL)/20 = log10(Amplitude(Pa)/20µPa)

Raise each side to a power of 10:

10Amplitude(dBSPL)/20 = Amplitude(Pa)/20µPa

And, finally multiply by 20µPa:

Amplitude(Pa) = 20µPa.10Amplitude(dBSPL)/20

For example, 60dBSPL is

= 20µPa.1060/20

= 20µPa.1000

= 20,000µPa

= 0.02Pa

How can you calculate the natural frequency of a simple resonator?

Although the concept of a simple resonator applies to a number of different simple systems (e.g. a pendulum, a tuning fork, a mass on a spring), there is no single formula which allows to calculate what the natural frequency of a resonator will be given some measurement of its size or composition. For example: the natural frequency of a pendulum is controlled exclusively by the length of the pendulum; however the natural frequency of a mass on a spring will depend on the weight of the mass and the stiffness of the spring, and the natural frequency of tuning fork will depend on both the length and the stiffness of the tines.

For a pendulum, the formula for its period is actually quite simple:

where T=period(s), l=length(m) and g=acceleration due to gravity (9.81ms-2).

But on the whole, we are better off measuring the resonant frequency rather than calculating it, using the method of forced oscillation.

How can you measure the resonant frequency of a simple resonator?

What we mean by a resonator is simply a system that exhibits a preference for the frequencies at which likes to vibrate. This is the key to measuring its resonant frequency: shake it at different frequencies and find out which one causes the resonator to vibrate most. This is called measurement by 'forced oscillation'. In the laboratory we measure the resonant frequency of the acoustic resonator by feeding a sinusoidal pressure wave into the cavity and measuring the size of the resulting pressure variations in the cavity. Since we can assume that our sinewave generator produces the same amplitude vibrations for every frequency, we can simply say that the frequency at which we get most output amplitude is the resonant frequency.

How are period and frequency inter-related?

Our definition of the period of something is simply how long it takes, and is measured in seconds. For a periodic waveform, we define its fundamental period (known simply as its period) as the time it takes to comple one cycle of vibration.

Our definition of the frequency of something is simply how many times it occurs within some space of time (for example the frequency of a bus service is expressed in 'number of buses per hour'). For sound waves, the vibrations are often very rapid and we get a large number of vibrations occurring within one second. So we measure the fundamental frequency of a periodic sound in terms of how many cycles of vibration occur within one second, or in other words, units of 'per second' or s-1. However, this unit also has a special name in the S.I. system, called Hertz (Hz).

Given these two definitions, then, we can say the the fundamental frequency in Hertz of a periodic waveform is simply the number of fundamental periods it completes in one second, i.e.

frequency (per second, or Hertz) = 1 (second) / period (seconds)

What is the difference between frequency, resonant frequency, natural frequency, and fundamental frequency?

Fundamental frequency is the correct name given to the repetition frequency of a complex periodic waveform, i.e. how many cycles of the waveform occur in one second.

Resonant frequency, or natural frequency, is the name given to the frequency which is 'most preferred' by a simple resonator, i.e. the frequency which it most likes to vibrate at, or equivalently, the stimulating frequency which gives the biggest response.

Otherwise, we should only use the term 'frequency' to describe simple periodic waveforms: i.e. sinewaves. This is why we can say that the fundamental frequency of a vowel is X Hz, but not that the frequency of a vowel is X Hz, because vowels are not simple periodic waveforms.

What is the relationship between period, frequency and wavelength of a periodic sound?

If we think of a periodic sound being generated by a loudspeaker, and the sound pressure waves travelling out from the speaker into space, then it is easy to see that in one second the sound will have travelled a distance numerically equal to the speed of sound, i.e. if the speed of sound is 330ms-1, then in one second it will have travelled 330m.

If the sound has a fundamental frequency of f Hz, then in that 330m of sound spead out from the speaker there will be exactly f cycles, or in other words, each cycle will be spread along 330/f metres. This distance is called the wavelength:

wavelength = speed of sound / frequency

or, in symbols:

l = c / f

Another way to think about this is to say that the wavelength must be equal to the distance the sound travels in one period, or:

wavelength = speed of sound . period


l = c . T

What is pitch?

Pitch is one of the three subjective attributes of sound. That is to say our auditory system provides us with sensations caused by changes in air pressure, and those sensations seem to vary along three basic dimensions of loudness, pitch and timbre. Thus when we hear two sounds that are different, we can associate those differences to a change in loudness, a change in pitch, a change in timbre, or some combination of these.

Pitch is related to the "musical" quality of a sound, that aspect of sounds generated by musical instruments that we label with "notes" such as "middle-C". When we sing we try to make noises that change in pitch in the same proportions as musical instruments change in pitch. We also use these changes in pitch when we speak to indicate the "tone" of our voice, for contrast, for emphasis or for questions.

Acoustically, pitch is related to the repetition frequency of the sound wave; that is how many pattern cycles complete per second. A low-pitched sound has few repetition cycles per second: perhaps a hundred. A high-pitched sound has many repetition cycles per second: perhaps a few thousand. Pitch doesn't tell us about the shape of the cycles, only how often they repeat. To speak at different pitches, I control the tension of the vocal folds in my larynx: if I tense my folds they vibrate more frequently, have a higher repetition frequency and generate sounds that give a sensation of a higher pitch.

What is timbre?

Timbre has a technical and a non-technical definition. Non-technically it is to do with the 'quality' of a sound, rather than its pitch or loudness. Different musical instruments are said to have a different quality or timbre even for the same musical note: e.g. a trumpet and a flute, or a guitar and a violin. Technically, we can say that a difference in timbre is the name given to our perception of the difference between two sounds which have the same perceived pitch and perceived loudness. There are both spectral and temporal aspects to timbre: this is easy to show by playing a piece of music backwards: the musical notes will be the same loudness, the same pitch and have the same spectral properties, but the instruments will still sound different - this is due to our sensitivity to how a sound builds up and dies away as well as to its spectral content.

How do you find and measure the individual harmonics in a complex periodic waveform?

We can use a special system called a band-pass filter to selectively remove all but small frequency regions from the input signal. A band-pass filter, as its name suggests, passes only sinusoidal components of an input signal which happen to lie within its operating band. We can build a bank of band-pass filters to examine in turn each frequency region, then measure the amplitude of the output signal. This output amplitude will tell us the amplitude of the sinusoidal components of the input signal in that region.

Explain harmonic numbering

A harmonic is simply a sinewave component of a complex periodic waveform. One important characteristics of harmonics is that they only occur at frequencies which are whole number multiples of the fundamental frequency of the complex. This means that if we know the fundamental frequency, say F, then the harmonics must occur at 1F, 2F, 3F, 4F, etc. We adopt a simple numbering system to identify harmonics: the first harmonic occurs at the fundamental frequency; the second harmonic occurs at twice the fundamental frequency, the third at three times, etc. Using this terminology we can make statements such as: "damping the vibrations of a guitar string half way along its length removes the first harmonic and all odd harmonics, leaving mainly the second and other even harmonics, effectively doubling the fundamental frequency of the note produced."

What units are used to measure response?

Response is defined simple as the ratio of the size of the signal coming out of a system to the size of the signal going in. Thus it is a dimensionless quantity. Like many other ratio scales in speech, we often like to use decibels for response, since decibels are a convenient means for expressing a ratio scale in logarithmic form. The formula for response in dB is then

Response (dB) = 20.log10 (output size/input size)

How can we measure the frequency response of a system

You need to find a way to excite the system with a sinusoidal input of known and variable frequency, and have a means for measuring the size of this signal going in and the size of the signal coming out. For an electrical system, you might use a voltmeter for this. You can then measure the output voltage and the input voltage, from these you can calculate the response in dB using the formaula above, and then this can be repeated for a range of different frequency sinusoids. A frequency response graph simply shows how this measured response changes with the frequency of the sinusoid.

What is a 'frequency domain' description?

When we talk about how a system changes a signal that passes through it we can do so in two ways. Firstly we can look at the waveform shape that went in and the waveform shape that came out. Since this involves looking at the waveform shape against time, this is called a 'time domain' study of the behaviour of a system. Unless the input signal is very simple: a pulse, say, or a sinusoid, the time domain description is not very useful. In particular it is hard to generalise from the time behaviour to make a prediction for how the system would behave with other shaped time waveforms. Thus we prefer a description of the operation of the system which plots the spectrum of the input signal, the spectrum of the output signal and the frequency response of the system. These graphs all have axes of frequency, and such an explanation is said to be a 'frequency domain' explanation. It is potentially more powerful than a time-domain explanation, since knowledge of the frequency response curve allows us to predict the output spectrum for any given input spectrum. The output is simply the input multiplied by the response.

How do you measure bandwidth from a graph?

We use 'bandwidth' to mean a region of the frequency response of the system where all the sinusoids pass through with roughly equal values of the response. The standard definition of 'roughly equla' is usually taken to be as 'within 3dB'. Thus to measure the bandwidth, we find the peak response of the system and find what limits of frequency are at just 3dB lower than this. This range in frequency from the upper frequency limit at 3dB below peak, to the lower frequencylimit at 3dB below peak is known as the bandwidth.

How do you avoid distortion on your recordings?

The answer is simple in that there is very little you can do apart from set the recording levels carefully. If the record level is too low, then the signal-to-noise ratio will be poor - and your recording will be drowned out by noises coming from the electronics and the tape. If the record level is too high, then the peaks of your recording will be clipped and the signal will sound harsh. Most recorders have a record level meter that is often labelled in 'VU' or volume units. On this scale, a value of 0dB represents a safe recording level. However clipping does not normally occur until about 3-6dB above this level. Thus you should aim to make your recording peak close to but below 0dB VU, and use the extra headroom for the occasional loud noise that might occur.

How do you measure distortion?

Normally, by distortion, we mean the introduction into our signal of frequency components not present in the original. This also gives us a method for esatblishing the quantity of distortion: input a signal with known frequency components and find size of the largest frequency component in the recorded signal not present in the original. You can then express the size of this component as a percentage size of the largest true component, or as an amplitude ratio in dB.

What is magnetisation?

When a signal is recorded onto analogue magnetic tape, the amplitude of the signal is reflected in the 'magnetisation' of the tape: roughly meaning the amount of magnetism stored on the tape. Recording tape consists of a plastic backing coated with small metal-oxide crystals. Each crystal can be turned into a small permanent magnet by applying an external electro-magnetic field (from the record tape head), and the magnetised crystals can themselves induce a small electrical current in a coil position next to the tape (in the playback head). By this means we can use a signal to change the magnetic properties of the crystals on the tape and conversely use the magnetic properties of the crystals to recreate an electrical signal. Hey presto: a tape recorder!

What is signal-to-noise ratio?

A way of measuring how much extraneous signal is added to our recording by the electronics and the tape. For analogue tape recorders most noise is generated by random fluctuations in the magnetic properties of the crystals on the tape. Ideally we would want such noise to be much smaller than the signal we want to record. We can measure how good or bad a tape recorder is by measuring the size of the noise generated by the recorder and comparing it to the size of the signal we are recording. The ratio of the amplitude of the signal to the amplitude of the noise is called the signal-to-noise ratio. It is often expressed in dB.

How do we measure the frequency response of a microphone?

This is a little tricky: ideally we would want to create a set of standard sounds: sinusoidal pressure waves of different frequencies but constant size. We could then simply measure the voltage output from the microphone as a function of frequency. Unfortunately, to produce sounds we need a loudspeaker and how do we know what the frequency response of the loudspeaker is? To get the response of a loudspeaker we need to use a microphone to measure the size of the pressure variations it generates. But then how do we know the frequency reponse of the microphone? And thus we are back to where we started! Basically we have to rely on a reference loudspeaker for which we know its response, or a reference microphone for which we know its reponse. We normally use the latter, since it is much easier to build a microphone with an essentially flat frequency response than it is to build a speaker. So ... we use a speaker to generate some sinusoidal pressure waves and monitor the voltage produced by the test microphone and by the reference microphone. Any differences will be due to deficiencies in the test microphone.

What is a filter and what does filtering a signal mean?

You should be familiar with the concept of a resonator. This is a system which responds to the input of some excitation signal. A pendulum is a simple resonator, and its response to a "push", that is an impulsive signal, is to respond by vibrating (oscillating) at its resonant frequency. You may also have experienced the excitation of a simple resonator with sinusoidal signals. This is often done to establish the resonant frequency of a resonator. We excite the resonator at different frequencies and determine which input frequency gives the largest output response.

Filtering is just recasting these ideas into the frequency domain. Instead of saying that the pendulum responds to an impulse by vibrating, we can say that the pendulum has "filtered" the input impulse into a damped sinusoidal shape output. Likewise we can say that an input sinusoid to the pendulum has been filtered by a simple resonator; the output is a sinusoid with the same frequency as the input but a different amplitude.

What then is filtering? A coffee filter allows small particles of coffee to pass through, but keeps back the larger coffee grounds. In signal terms, the components are sinusoids rather than particles, and the selection of which to allow through is based on their frequency rather than their size. Furthermore, we have the concept that the selection can be "partial", that is affects the amplitude of the component rather than just to accept or reject it. Thus a low-pass filter is a system that tends to pass sinsusoidal signals of relatively low frequency, while tending to hold back (attenuate) sinusoidal signals of relatively high frequency (much like the coffee filter lets through small particles and holds back large ones). However in signal terms we can build many other kinds of filter, for example a "high-pass" filter lets through high frequencies and attenuates low ones; these have less familiarity in the coffee domain.

Coming back to our pendulum, we can say that the inpulse signal contains sinusoids at every frequency, and a simple resonator selects those which are close in frequency to its resonant frequency; the damped sinusoidal output signal consists of sinusoids from a limited frequenc range about the resonant frequency of the pendulum. When we apply a single sinusoid to the pendulum, all it can do is to change the amplitude (and possibly phase) of the input to produce a filtered sinusoidal output of the same frequency.

What is a noise signal?

We can make a four-way division among signals. Firstly there is the distinction between Periodic and Aperiodic signals. Periodic signals have a waveform shape that repeats in time. That is it is possible to predict the future of the waveform by looking at its past. One can isolate a region of the signal, called its period, which occurs over and over again in time. Periodic signals thus have a clear fundamental period, and hence a clear fundamental frequency (or repetition frequency). These sounds also give us a clear sensation of pitch. Aperiodic signals have waveform shapes that do not repeat, that is it is impossible to predict the future of the waveform by looking at its past.

Each of these categories can be further subdivided: Periodic signals can be divided into Simple and Complex. Simple periodic signals are just sinusoids, while complex periodc waveforms are combinations of sinusoids. Aperiodic signals can be divided into Impulsive and Noise. Impulsive signals are aperiodic because they only occur once, with a concentration of energy at a particular time, while noise signals are aperiodic because they are generated by random or chaotic processes.

What is a spectrogram?

A spectrogram is a visual representation of the frequency content of a signal. A spetrogram shows how the quantity of energy in different frequency regions varies as a function of time. On a spectrogram, the signal is divided into many small time sections and each section is analysed in terms of what frequency components are present in the section. This analysis is called spectral analysis because the spectrum of each section is calculated and the quantity of each frequency component (that is each sinusoid) is measured from the spectrum. The quantity of each component is then converted to a grey level in which (normally) low energy components are converted to a white colour, while high energy components are converted to a black colour. These colours are then plotted on a vertical strip corresponding to the time at which the original signal segment occurred. The height of the coloured element on this vertical strip represents the frequency of the component.

Thus a spectrogram is a 3-dimensional analysis of a signal, the horizontal dimension is time, the vertical dimension is frequency, and the grey-scale shows the amount of energy occurring in the signal at each time and frequency.

What is the difference between a wide-band and a narrow-band spectrogram?

From the description of a spectrogram above, you will see that one part of the analysis process involves dividing the signal up into sections so that each section can be spectrally analysed. What we didn't say is how big these sections should be. Clearly if we chose very large sections, say of half a second, we wouldn't be able to see much of what is going on in a speech signal - each half second chunk of the picture would be a static set of colours. We know that speech signals change rapidly and we want to choose relatively small sections so that we can see the spectral content of the speech changing from moment to moment.

There is a problem however, as we make the sections smaller and smaller, it becomes more and more difficult to determine precisely which frequency components are present in the signal. You can see this in a qualitative way: if you only look at a small fraction of a single cycle of a sinusoid, it is very difficult to estimate the frequency: you can guess the duration of a whole cycle, but you might be considerably in error. In the same way, the smaller the section of signal we put into spectral analysis, the poorer the estimates of the frequency components it contains.

So we need to come to some kind of compromise: we want sections short enough to see the temporal detail in the speech signal, but long enough to see the frequency detail too. It turns out that there is not single best compromise value for speech signals. If we choose sections long enough that we can see the individual harmonics of larynx vibration, then these sections will be too long to see the time response of the vocal tract resonances (formants) as they respond to an excitation pulse. However, if you need to see source harmonics, then you need sections of this length. This is called a "narrow-band" spectrogram, and the sections are about 20ms long. On the other hand, if you want to see the detailed formant vibrations that occur within the larynx cycle, you need to use sections smaller than a singhle glottal cycle (which is about 5ms for women), so sections of about 3ms are commonly used. These short sections give us bettern temporal detail but poorer frequency detail. This is why the analysis is called a "wide-band" spectrogram. Wide-band analysis is most useful for finding formant frequencies.

What should I look for in a speech spectrogram?

If you study a wide-band spectrogram of say a couple of words of speech you should be able to see some of the following events:

  • Larynx excitation pulses: these appear as vertical dark lines at intervals of between 5 and 10ms or so. These are also called striations. Each one of these is caused by the sudden pressure change that arises above the larynx when the vocal folds close suddenly, cutting off the flow of air from the lungs. This change is so sudden it creates a kind of pressure "pulse" which contains energy at a wide range of frequencies, commonly up to 4kHz or more.
  • Formant vibrations: between the striations you will see dark regions which only occur at particular frequencies. These regions, which often appear several hundred Hz across because of the limitations of the wide-band frequency analysis of the spectrograph, are caused by the ringing of the vocal tract resonances (or formants) as each pulse from the larynx excites them. If you look carefully, you may see that these vibrations are larger (darker) just after the pulse, and get paler as the energy in the vibrations is lost from the vocal tract.
  • Changes in formant frequency: You should see that the dark regions caused by formant vibration change in frequency through the utterance. You may see that the resonances have a kind of continuity in time, and slowly rise and fall in frequency through a syllable, and from one syllable to the next. These slow and smooth changes in formant frequencies are because the frequencies of the vocal tract resonances are set by the shape of the vocal tract tube, which in turn is controlled by the position of the articulators. Since the articulators move relatively slowly (a few syllables per second) the formant frequencies appear to move slowly too.
  • Turbulent sounds: In regions where there is no larynx vibration and hence no striations you should see some "speckled" rather noisy unstructured regions of dark colour, often towards the high frequency end of the picture. These are "noise" sounds caused by turbulence in the vocal tract, for instance: bursts, aspiration and frication. Bursts are often short vertical bars, a lot like a striation, caused by the sudden pressure change in the vocal tract when a stop articulation is released. Aspiration is turbulence that occurs in the larynx, caused by a narrowing of the airway from the lungs made by the vocal folds coming close together. Frication is turblence that occurs at other points of narrowing in the vocal tract, made with the tongue or the lips. If you look carefully you may see differences in the frequency content of bursts or fricatives originating from different places of articulation. This is because the different articulator configurations shape the sound generated by the turbulence in different ways depending on the size and shape of the vocal tract tube in front of the constriction.

What is the difference between a harmonic, a sinusoid and a sinewave?

The term 'sinewave' is just shorthand for the term sinusoidal waveform, i.e. a waveform having a sinusoidal shape. The same can be said for the term 'sinusoid'. A harmonic is a sinewave component of a complex periodic sound, that is to say that (i) it is a sinewave, and (ii) it has been found to be a constituent of some complex sound. The particular characteristic of harmonics is that they must have frequencies which are whole number multiples of the fundamental frequency of the complex periodic sound.

What is meant by 'phase'?

When we talk of the phase of sinewave, we mean some indication of where it is in its cycle at some given time. Since sinewaves are periodic, they go though the same amplitude values over and over again, once per period. If we say that at some time t the phase of the sinewave is some value q then we are giving an indication exactly what proportion of the way through its cycle it was at time t. We can give a numerical value to phase using a scale of 360 degrees, rather like a position on the perimeter of a circle.

Two sinewaves of the same amplitude and period can still be distinguished if they are at different points in their cycle at any one time. If they are always both in the same point in their cycles, we say they are 'in phase' (0 degrees). If they are are exactly half a cycle out, we say they are 'out of phase' (180 degrees).

How are Pascals converted to dBSPL?

The formula for amplitude in units of decibels on the Sound Pressure Level Scale is:

Amplitude(DBSPL) = 20.log10(Amplitude(Pa)/20µPa)

So if we know some sound has a pressure variation of, say, ±0.2Pa, we would first calculate how many times 0.2Pa is bigger than 20µPa (=0.00002Pa):

0.2/0.00002 = 10000

We then take the logarithm base 10 (i.e. ask how many powers of ten the ratio is):

log(10000) = 4

And finally multiply by 20:

4 x 20 = 80dBSPL

Thus a pressure of 0.2Pa corresponds to an amplitude of 80dBSPL.

What is a logarithm?

The logarithm function is simply a shorthand way of asking the question 'how many powers of 10 is this number?'. In other words if we have some number x, then log10(x) is the value y which satisfies the equation:

10y = x

So, if x is 100, clearly y = 2, since 102 = 100. We can write log10(100) = 2.

Likewise if x is 0.1, then y = -1, since 10-1 = 0.1. We can write log10(0.1) = -1.

What is the difference between loudness and intensity?

Loudness is a perceptual or subjective quality of a sound; intensity is a physical or objective property. Although changes in intensity can cause changes in loudness, they are clearly two different scales. In particular, sounds which are below the threshold of audibility have a non-zero intensity, but zero loudness.

Intensity is measured in Wm-2, but we usually prefer to use the Sound Pressure Level scale (dBSPL). Loudness can be measured in units called phons, where 10 phons is the perceived loudness associated with a pure tone of 1000Hz at 10dB above the threshold of audibility.

Why do you need harmonics of decreasing size to make a square wave?

Answer A. Because you do. If they were all the same size you would get a different waveform.

Answer B. You can think of successive odd harmonics 'correcting' the curviness of previous combinations of harmonics. A sinewave at the fundamental is a pretty good first attempt at a square wave: it goes up and down at the right places, but it doesn't have a flat top or vertical sides. If we add a little bit of the third harmonic, we find that it has two effects: it reduces the size of the peaks in the first harmonic (since the third harmonic has a different sign at those times), and it also helps to square off the sides of the waveform. The first and the third together do an even better job, but there is still room for improvement: an even smaller fraction of the fifth harmonic helps in just the right way to correct the deficiencies of the first and third. And so on with the seventh, ninth, and the rest. Each time, a smaller and smaller amount of the next odd harmonic is useful to correct the deficiencies left by the lower numbered harmonics.

Do you need an infinite number of harmonics to make a complex tone exactly?

Yes and No. Our definition of 'complex' simply means not a sinewave. Thus a complex tone which is made up of exactly two harmonics is still complex. On the other hand, a 'perfect' square wave would need an infinite number of harmonics to get exactly vertical sides and a perfectly flat top.

All Fourier promises is that you can get as close as you like within a finite number of harmonics. Remember that our hearing is limited to 20,000Hz, so any harmonics higher than that can't be heard anyway.

Why does squeezing in more periods in one second mean that the harmonics are further apart?

If we increase the fundamental frequency of a complex periodic sound we reduce the duration of one cycle, i.e. we decrease the period. If we are looking at a waveform as we make the change, we will see the individual cycles shorten and crowd together.

The whole point of harmonic analysis is that we add together sinewaves to make the complex waveform. Since the complex is periodic of time T, then it must be the case that all component sinewaves will complete a whole number of cycles in time T. If they didn't - say they did one and a half cycles in T - they would add different amounts of amplitude into different cycles and then the sum would not be periodic in T.

If the period of the complex gets smaller, the periods of the harmonics must also get smaller. If the periods of the harmonics get smaller then they must increase in frequency, i.e. move up the spectrum. Since the period of the complex is always a whole number of periods of each harmonic, the frequencies of the harmonics must be whole number multiples of the fundamental frequency. We can think of this as saying that adjacent harmonics will differ by the value of the fundamental frequency. Thus if the fundamental frequency goes up, both the absolute values of the harmonics and their spacing will increase.

Why can an aperiodic sound be considered to have an infinite period?

When we say a waveform is aperiodic, all we are really saying is that it hasn't repeated within the interval over which it was observed. Thus as far as our observation is concerned, it might as well have been the case that the signal was periodic with a period longer than our observation interval. Thus we assume the signal was periodic, merely that the period was very long. When we come to do harmonic analysis, we find that we have so many harmonics available to us we might as well assume that they can occur at any frequency. For example, if our observation of the signal lasted one second, we can assume that its period must be greater than 1 second, or that its fundamental frequency is less than 1Hz. In which case its harmonics are not more than 1Hz apart. In this case, our spectral analysis will look continuous since all the harmonics will have crowded together into a solid block.

How is signals & systems theory applied in speech and hearing science?

We can see the applications of the concepts of signals and systems theory in three areas in speech and hearing science:

1. In modelling speech production: we can explain the acoustic characteristics of speech sounds through the use of the so-called 'source-filter' model, by which we model separately and independently the sources of sound in the vocal tract (i.e. vocal fold vibration and turbulence) from the effect of the shape of the vocal tract on those sounds. In effect, we use our signals terminology to explain the characteristics of the sound sources, and our systems terminology to describe the characteristics of the vocal tract system. Without such a clear separation, it would be very difficult to explain why it seems to be the case, for example, that the pitch and the timbre of vowel sounds can be set independently. The power of this approach is that we can identify two separate influences on the final character of the speech sound: the character of the original source and the character of the subsequent system that shaped it.

2. in instrumental analysis: we can use the theory to build instruments to analyse sounds (e.g. the spectrograph), or to assess the performance of instruments (e.g. a tape recorder or an audiometer). The spectrograph uses band-pass filters to separate out the individual frequency components of a speech sound to display how the energy in the signal changes its character with time. It is the availability of such tools that has allowed us to gain an understanding of how the linguistic content of the message is encoded in sound.

3. in hearing: the peripheral hearing mechanism seems able to deliver to the brain information about sound which is directly related to the spectral analysis we developed in signals & systems theory. Even our simple experiments in harmonic synthesis were able to uncover three important facts about hearing: that our sound perception is relatively insensitive to harmonic phase (from which we can deduce that the ear performs a kind of spectral analysis), that sinusoids close in frequency give rise to the perception of beats (i.e. that the spectral analysis performed is limited in spectral resolution), and that the perception of pitch is not solely reliant on the presence of the first harmonic (i.e. that our hearing has a specialised mechanism to process pitch independently from the spectral analysis). It would be hard to explain our ability to perceive speech without explaining that the linguistic content is encoded in the distribution of energy in the signal across frequency and time, and that our hearing mechanism is able to decode exactly this information from the signal.

What is meant by a 'frequency domain' explanation?

We have met one 'time domain' picture of sound: the waveform. We have also met one 'frequency domain' picture of sound: the spectrum. Here we just use the term 'domain' to contrast the different x-axes in the two cases. The contrast is useful when we also introduce a frequency domain description for systems: the frequency response curve. This allows us to explain what happens to a signal entirely in terms of the constituent sinewaves taking part: i.e. we take each sinewave component in the input spectrum, change its size according to the system response at that frequency to find the output amplitude for the component on the output spectrum. This 'explanation' in terms of spectrum and frequency response only is an entirely frequency-based or frequency domain explanation.

What is the relationship between the time & frequency domain pictures for a sinewave through a resonator?

I hope the following two pictures make this clear. First we put a sinewave through a resonator at close to the resonant frequency of the resonator:

You should be able to see that this resonator gives about 20dB of gain at its resonant frequency, and that the output is quit a bit larger than the input. Contrast with this picture:

Here, you can see that the resonator gives about 20dB of attenuation at this distance from its natural frequency, so that the output signal is quite a bit smaller than the input.

Explain the frequency response graph for band-pass filters and the use of 'bandwidth'.

Look at this schematic representation of the frequency response of a band-pass filter:

A band-pass filter has a response that is close to unity around its centre frequency, with increasing amounts of attenuation on either side. We could use the frequency values for the low-frequency edge and for the high-frequency edge to describe the exact shape of the filter, but the standard preference is to use a combination of the centre frequency and a value called the 'bandwidth' which is a measure of how wide the filter response is towards its top. The definition of bandwidth is then simply the range of frequency values within which the filter response is within 3dB of the peak response. In other words, we find the peak response and draw a horizontal line 3dB below this. Where this line cuts the response curve gives us two frequencies, and the difference between them is the bandwidth.

How are band-pass filters used for analysis?

The easiest way to understand this is to imagine we have a complex periodic waveform of indefinite length and we seek to find which harmonics are present in the signal and what their amplitudes are. Let us also limit the region of the spectrum we are interested in to 0-20,000Hz which covers all of human hearing, and let us limit the resolution of our analysis to 20Hz (i.e. we don't care if we are up to 20Hz out on the frequency of any harmonic). This then requires us to use 1000 band-pass filters, each 20Hz wide: the first looks at the region 0-20Hz, the second 20-40Hz, the third 40-60Hz, etc. We take each one of these 1000 filters in turn, and we pass our signal through it. What will come out the other side will be all components of the signal which happen to lie within the limits of the band-pass filter.  So if the input signal has a component at 90Hz with an amplitude of 5volts, then from the filter set at 80-100Hz we will see a sinewave output with an amplitude 5volts. We can then plot on our spectrum the fact that in the region between 80 and 100Hz, the input signal has an amplitude of 5volts, and then repeat for the other frequencies.

Isn't the treatment of the filtering of complex sounds as the combination of the filtering of independent sinusoids only true in circumstances when the system is linear?

Smart Alec. Yes.

How does a system actually change a signal?

Answer A. The physical characteristics of the system interact with the physical characteristics of the signal. Thus an acoustic resonator is just a cavity of a particular size and shape; when we pass a signal through it, those components of the signal that 'match' frequencies with the resonant properties of the system pass through amplified or relatively unchanged, while those components which occur at other frequencies don't excite the resonator so much and are effectively attenuated.

Answer B. The process is simply best described mathematically rather than in words, but consider this 'wordy' explanation: each tiny pressure variation in the input sound will cause the resonator to vibrate. If it is a simple resonator, it will vibrate with a damped sinusoid at the resonant frequency. The size of this damped sinusoid depends on the size of the pressure variation, and the time at which the damped sinusoid takes place depends on the time at which the pressure variation takes place. Imagine now we have thousands of little pressure fluctuations per second, of all kinds of different sizes. Each one will give rise to a different damped sinusoid: they will be different because the pressure variations will be of different sizes, and they will be different because the pressure variations occurred at different times. What do we see coming out of the resonator but the sum of all the individual damped sinusoids? This combination of individual fluctuations is called the convolution of the input signal with the time response of the system - it is effectively the 'time domain' equivalent of the 'frequency domain' explanation (see above) of how systems process signals.

Why can't the ear hear phase?

We can be pedantic later in the course, so for now let us agree with the question: it does seem to be true that for complex periodic sounds, changes in the phase of the component harmonics has extremely little effect on the perceived quality of the sound. First you should acknowledge that this is odd: if the harmonics change phase with respect to one another then the waveform shape will be different and you have every right to expect that our perception of the sound will be different too, but it's not.

Without going into the anatomical and physiological detail of the structures of the inner ear, the best explanation we can give is to say that although the acoustic signal going into the ear does indeed change with a shift in phase, the signals sent from the ear to the brain do not change with a shift in phase. What kind of representation preserves information about signals without being affected by phase? Well, simply the information about the amplitude and frequency of the component sinewaves. A corollary of this discovery about insensitivity to phase is that it is evidence for a kind of spectral analysis in the cochlea.

Copyright © 2023 Mark Huckvale

Last modified: 16:10 06-Jun-2010