Voice Tracking Camera: Video Conferencing of the Future

Zachary Knudsen, Jason Lessans, Michael Scholl Department of Electrical and Systems Engineering


Video conferencing has become increasingly widespread in the workplace. Currently, a camera has to be positioned at the edge of the conference room in order to capture all the employees on screen. This forces people to face in the same direction, making communication awkward. We have designed a system that, when placed in the center of the table, uses three microphones to locate the speaker and turns the camera to face them so that the emphasis is on the person currently speaking.

Table of Contents:

1. Overview


Create a device which quickly and accurately determines the position of a person speaking and points a camera towards them.



Video Conferencing

Back to Table of Contents

2. Physical Model

The sound wave is modeled as a plane perpendicular to its direction of travel. The microphones will detect a sound waving approaching at an angle, θ, at 2 different times separated by a TimeDelay, dT. This time delay is determined by the speed of sound, Vs, the angle of arrival, θ, and the distance between the microphones, d. Using the equation shown in Figure 1 [1], we can determine the angle of arrival which points toward the sounds source.

Figure 1

Physical Model
Physical model of a microphone pair

This model has a limited validity, however, if a sound source is too close to the microphone pair or if, similarly, the distance between the microphone pairs is too great, then the sound wave is more spherical than planar and can lead to erroneous angle estimations. This situation is called Near-Field. The criterion for validity of the plane wave approximation is as follows [1]:

Equation 1

Where R is the distance to the sound source, λ is the wavelength, and d is the distance between the microphones in a pair. In our design, the microphone pairs are 16 cm apart and thus can accurately detect human voices (max frequency 3700 Hz :: minimum λ of 9.27 cm) at a distance of 55 cm (1.81 ft) and beyond.

Back to Table of Contents

3. Angle Estimation

In order to determine the time delay, we sample the microphone output at 80 kHz and cross-correlate one second worth of samples. This provides us with a shift, n, that is proportional to the time delay. Our equation for the angle, therefore, becomes [1]:

Equation 2

Where Vs is the speed of sound, n is our shift, fs is the sampling frequency, and d is the distance between the two microphones. In a 360 degree environment, however, this equation has two solutions. To solve this problem we added a third microphone perpendicular to the pair and measured the delay between it and the other microphones. If the delay was positive then the source is on the same side as the third microphone and if it was negative it was on the opposite side.

Back to Table of Contents

4. Increasing Accuracy

With 3 different pairs of microphones, we end up with 3 angle estimates for the speaker’s location. However, based on sampling rate limitations, certain estimates are subject to a great deal of error, and the most accurate estimate is typically the one reported by the pair most perpendicular to the source angle. Figure 2 shows how the pair that is most perpendicular to the direction of the sound wave is the pair in which the difference in the sound delays between the microphones is minimal.

Figure 2

Resolution diagram for a microphone pair

The angle reported by this pair is the ones that is chosen and the delay to the remaining microphone is used to determine whether the speaker is behind or in front of the microphone pair.

Back to Table of Contents

5. Filtering and Other Improvements

One issue we encountered, was filtering out extraneous sounds other than those from the speaker, such as footsteps, AC, squeaky chairs, etc. In order to deal with this we decided to focus only on sounds within the frequency range of the human voice, from about 200 to 3700 Hz, and filtered the rest of the sounds out. Additionally, since it was impossible to completely filter out all extraneous noise this way, we decided to ignore any noise below a certain threshold, since the voice of the speaker would most likely be louder than any other sound in the conference room.

Since the motor is positioned so close to the microphones, however, we found that vibrations and noise from the motor were sometimes loud enough to be interpreted as a person speaking. We therefore insulated these motor vibrations from the microphones by surrounding the motor with foam, an effective sound barrier.

Back to Table of Contents

6. Implementation

As a proof of concept, our design works pretty well. It can track a speaker within 10 degrees of their location in less than three seconds, which is well within the 56 degree range of the camera. It also provides a high resolution display from the camera and does a good job of ignoring extraneous noises, which is crucial to determining who is speaking. With the right adjustments, our design would be very valuable to companies who frequently do web conferencing.

There are still several problems with our current design. The largest problem is that we currently use expensive microphones because with the cheap microphones we had initially intended to use, we were unable to get the design to detect a low volume speaker due to their small signal to noise ratio. The design also cannot properly handle simultaneous speakers and our system for sending all of the audio and visual information to a computer incorporates many wires and an analog to digital converter which is somewhat impractical and costly. Additionally, we still have not designed a good way to make to make our design visually appealing while shielding the motor noise, which is necessary so that the conference participants are not distracted by it.

Ideally, we need to test out the design with cheap microphones that are of a higher quality than the ones we had initially intended to use. Additionally, we need to encase our design in a dome-like structure that would insulate the motor noise and cover up the electrical components and wiring inside. A quieter motor would also benefit our design. Also, we need some way of detecting when there are simultaneous speakers so that the camera does not oscillate back and forth between them. Ideally, it would just choose one of the speakers to track, probably the louder speaker, but we have not figured out a way to accomplish this. Finally, all of the wires should be grouped together with a rubber covering, so that they do not get tangled and are less visually obtrusive.

Back to Table of Contents

7. References

[1] York, Joshua. “Acoustic Source Localization,” Undergraduate Research. Washington University, Summer/Fall 2008. Web. 25 Apr. 2011. http://ese.wustl.edu/ContentFiles/Research/UndergraduateResearch/CompletedProjects/WebPages/fl08/JoshuaYork/doa.html

Date Last Modified:

May 9, 2011


This paper is available on-line at

Back to Table of Contents