Model-Based Approach

As we have discussed earlier, it is desirable to develop a simple signal processing model for HRTFs so that we can synthesize spatial audio efficiently on modern microprocessors. An ideal model would capture all of the salient spectral features of an HRTF that we use for localization. This model would also be parameterized by enough individual data (i.e. head measurements, etc.) to be accurate, but not too many so as to be impractical.

There have been many approaches to developing HRTF models employed in recent years. Some approaches investigated include coding HRTF measurements using eigenvalue decomposition, using neural networks to reduce the HRTF data set, pole-zero models, and wave equation (physical) models. Many of these approaches only aim at reducing the data size of a set of measured HRTFs rather than actually modelling the physical interaction of the sound with the body.

The wave equation approaches seem to be the most intuitive and natural approach to the modelling problem. If the system of wave equations could be solved for all of the surfaces of an individual's torso, head and outer ear then their exact HRTF could be computed. This is obviously an intractable problem. Recently some researchers have reduced the complexity of the physical model they are using by only considering the pinna effects. Using wave equation approaches here has been met with some reported success.

In earlier experiments, a team of researchers discovered that measured HRTFs could accurately be represented by a minimum-phase filter in series with a linear delay. This is significant for several reasons. First, it appears to imply that the HRTF is really a linear system that can be conceptually (and more important, computationally) separated into distinct elements. One element would represent the head shadowing effects that produce the ITDs while another would represent the filtering effects of the pinna (also incorporating the gain representing the ILDs). Secondly, this seperation of components lends validity to the theory that azimuthal effects can be represented well with ITDs and ILDs, while echoes from the body and pinna cause spectral filtering, which varies with elevation.

Figure 1 Example reflections of sound due to the pinna for different elevations. This aids humans in localizing the elevation of the sound. Adapted from (Brown 1996).

For this project, we have embraced the wave equation modelling philosophy as the most natural, and studied the available models that have been published. There are surprisingly few people reporting work with this type of a physical modelling approach. We have investigated a modeling scheme which uses this approach and is based on the reported work of Mr. Phillip Brown and Dr. Richard Duda at San Jose State University.

This is a simple model which contains three main components: the head shadow model, the pinna model and the room model. Because we are dealing with a linear system, the order of the components does not matter in the actual implementation. In figure 2, there is a block diagram of the model we implemented.

Figure 2 A block diagram representation of the modelling scheme. The three main components are the Head Shadow Model, the Pinna Model, and the Room Model.

 Head Shadow Model:

This component is responsible for generating the ITDs and ILDs caused by the shadowing effects of the head. The first element here is a simple pole-zero filter which is meant to approximate the Rayleigh spherical head model described earlier. It is parameterized by the angle difference between the location of the ear and the azimuth of the sound source.

Previously we mentioned that the gain of the low frequencies (below 1500 Hz) is not drastically affected by head shadowing. To achieve this effect, the head shadow model has a fixed pole and a zero that moves to produce the desired amount of roll-off, depending on the azimuth of the sound source.

Also, as part of the head shadow component there is a linear delay element. This accounts for the ITD produced by the waves propagating around a rigid sphere to reach the ear.

The pole zero filter suggested by Brown in his model is given by:

H(w) = [1 + j(dw)/(2m)] / [1 + jw/(2m)]
Where:
m = speed of sound / radius of the head
d(theta) = 1.05 + (.95 * cos(1.2*theta))
and theta is the angular difference between the location of the ear and the azimuth of the sound source.

The linear delay element for each ear (in seconds) is given by

T(theta) = 	{ -cos(theta)/m		:  0  < |theta|  < 90
		{ (|theta| - 90)/m	:  90 < |theta|  < 180

Figure 3 The frequency response of the Head Shadow Model from our MATLAB implementation, shown for various values of theta (theta is the angular difference between the location of the ear and the azimuth of the sound). Note that for frequencies below 1500 Hz, the sound is not drastically effected.

 Pinna model:

Brown's approach to developing his pinna model was to try and work only in the time domain. The elevation dependent pinna effects are essentially caused by a series of echoes reflecting off of the surfaces of the pinna and into the ear canal. If the relationship can be found between the elevation of the sound source and the characteristics of the important echos (time delay and amplitude), this part of the system can be determined.

Figure 4 A measured HRIR (without the Head Shadow effects) with varying elevation. (Brown 1996).

In his lab, Brown made recordings of the HRTFs of individuals at several different spatial locations. After separating the head shadowing effects from the effects of the echos, he studied the echo behavior as the elevation of the sound source changed. In all individuals, he noticed echoes which changed in a very regular way with elevation. Many of these echoes were produced by the pinna, but some had a distinctly different behavior and were attributed to reflections from the shoulder.

Figure 5 A representation of an HRIR over various elevations, with markings indicating significant echoes. These are the features of the HRIR that are desired to be captured. (Brown 1996).

The amplitude of the echoes did not seem to change significantly with elevation, so fixed values were picked to characterize those features. The time delay of the echoes changed dramatically however, and Brown fit the following equation to the delay for each echo:

TE(n, theta, phi) = A(n) * cos(theta/2) * sin( D(n) * (90-phi) ) + B(n)
Where:
n is the echo number
theta is the angular difference between the location of the ear and the azimuth of the sound source
phi is the elevation of the sound source.

A, B and D are a set of scaling constants that are indexed by the echo number. After the first 6 echos (including the direct sound into the ear as the first echo) the amplitudes were trivial, so they were not considered important cues.

As you can see from the images shown in figures 6 and 7, this model appears to be a good match visually for the attributes of the echoes found in the measured HRTFs.

Figure 6 A measured HRIR with varying elevation. (Brown 1996).
Figure 7A modeled HRIR with varying elevation from our MATLAB implementation.

 Room model:

It has long been recognized that the reflections in a room give us sonic cues about our ambient environment. These cues add to our perception that a sound presented over headphones is external, rather than coming from inside the head. To try and use these effects, we have added a room model component to our localization scheme.

This is a simple component, which produces an attenuated and delayed copy of the original sound. This copy of the sound is then fed through the head and pinna model with the same direction parameters as the sound to be localized. This means that we are simulating that the first room echo to arrive at the listener is also coming from the same direction as the sound source (possibly from a wall directly on the other side of the source from the listener location). This should not only aid in the sense of externalization, but also provide an additional cue for localization of the primary sound.

The delay time we used for this room echo was 15 ms, corresponding to a reflection coming from a wall approzximately 2.5 meters behind the sound source. This time delay was kept constant, so we were essentially modeling a spherical room with a radius of 2.5 meters. This doesn't necessarily reflect any real physical environment, but did seem to work with aiding the externalization and localization in informal listening tests.

 Matlab Code:

The MATLAB code that we wrote to implement our version of the Brown-model can be found below.

Headshad implements the pole-zero head shadow approximation.

Pinna implements the pinna and shoulder elevation effects

Room implements the room model

Localize generates a stimulus and implements the whole spatialization model.

A Technical Note about the FFT.



Back