TOKYO -- Picking human voices or other specific sounds out of a background cacophony is not exactly easy, but technologies to do it accurately and quickly are developing rapidly.
One technology driver is the demand for better voice recognition to take advantage of the high-performance microphones built into gadgets like smartphones, tablets and navigation systems.
But companies are polishing new technologies for other applications as well, including the capture of sounds from far away, and the extraction of individual voices from multiparty teleconferences.
Nippon Telegraph and Telephone has taken the concept of a telephoto lens and applied it to acoustics, developing the world's first device that can zoom in on the voice of a single person standing 20 meters away.
The device, which NTT has dubbed the "zoom microphone," is linked to a camera, so when the camera focuses in on a person, the person's voice can be clearly heard. The system works so precisely that only the person in focus is heard, and other people talking nearby are inaudible.
At television recording studios, shotgun microphones are stationed here and there to capture the audio of the show. These cylindrical microphones can accurately capture the targeted sounds over short distances, but the working range is limited to several meters.
Acoustic energy attenuates exponentially over distance, so by the time the sound waves reach a listening device 20 meters away, there is little to collect.
NTT's zoom microphone does the seemingly impossible by using a set of 100 small microphones fitted to an array of 12 parabolic sound collecting dishes. The entire structure stands 1.5 meters tall and is 4 meters wide.
The microphones are common commercial products -- the secret is in how they are arranged. Sound waves reflect off the parabolic sound collectors, so the microphones capture a complex set of acoustic data, including information about the strength of the sound waves and the times since they were generated. The 100 microphones capture as many different sound characteristics as possible, and this is utilized to distinguish and isolate the necessary sounds.
"The technology for simultaneously processing large amounts of audio information has advanced tremendously," explained Kenta Niwa of NTT's Media Intelligence Laboratories.
Systems used to be able to handle only the data from four to eight sound channels. But now there are devices that can handle 100 channels, and that is the advance that enabled NTT to use so many microphones for its zoom microphone.
NTT is working to make its device smaller and more functional. How might it be used? One way would be at sports events, for example to pick out the voice of a single soccer player in a match at a stadium. Another application would be for a remote conversation with someone in the middle of a noisy factory.
While NTT guns for distance, Toshiba is going for classification, with a technology that can sort out the participants in a meeting of around 10 people and identify who is saying what.
Toshiba's system can extract the remarks of a specific person from the recording of a meeting. It does this by not only classifying people according to voice characteristics, but also by estimating the direction to the talker from the microphone.
To estimate the direction to the talker, the company exploits the phase difference of the time it takes sounds to reach the microphone. If sounds reach the microphone at different times, there are also differences in the position of the sound waves, so by combing the information about the sound waves with information about prerecorded sound waves, the direction to any given speaker can be estimated to within an angle of three degrees.
Toshiba also incorporated new thinking on how to combine the directional information with data about each speaker's vocal characteristics. If too much emphasis is placed on the directional information, two people situated in basically the same direction from the microphone can be incorrectly treated as a single person. So for Toshiba's system, the directional information is used only when necessary, for example when two speakers have similar voice characteristics.
Technologies that depend on only voice characteristics can classify speakers with a precision of only 31% when the meeting has seven or more participants. The new technology has a precision of 74%. Running on an ordinary personal computer, it can classify the speakers in the recording of a three-hour meeting in around five seconds.
Devices such as car navigation systems have trouble hearing spoken commands in a noisy environment, so NEC has developed a way to cut out the noise.
The system uses a pair of microphones and compares the signals input to each microphone to distinguish the human voice from the noise of things like the air conditioner and the car's audio system. Since the sounds become distorted when the noise components are removed, they are adjusted so the navigation system's speech recognition feature can do its job. This adjustment is done by comparing the spoken sounds with prerecorded model voice sounds and correcting as needed.
NEC says its new technology enables speech recognition to work even when the noise component in the signal-to-noise ratio is five times greater than conventional technologies can handle.
"The spread of smartphones with speech recognition is what spurred us to rethink voice input technologies," explained Shinichi Ando, a manager in the NEC Information and Media Processing Laboratories.
Japan has fallen behind other countries in the manufacture of smartphones. But with these new voice recognition technologies, the country could soon recover ground in the field of audio software.