Sonic Searches Gain Speed

June 2002
By Maryann Lawlor
E-mail About the Author

Software locates spoken words and phrases at maximum velocity.

Analysts who must search hours of audio recordings for key words of particular importance to a mission now can find them in a matter of seconds with nearly 100 percent accuracy. Because the technology supports any task that requires the search, analysis and monitoring of voice content, potential customers for the capability range from intelligence organizations looking for terrorist code words to customer service personnel seeking to improve client relations. Additional applications include knowledge management, training and education.

The advent of word processing programs greatly enhanced accessibility to information within text documents. Taking advantage of this capability when working with audio recordings, however, requires the transcription of the spoken word into the written word. This process is not only time-consuming but also risks introducing errors into the data set. In addition, if a search query is misspelled when entered, references to the term may be missed.

As audio and video capabilities became more affordable, organizations began to take advantage of them to record, save and share presentations, educational material and reports. However, locating specific references contained within hours of a taped presentation continued to be a tedious process.

A phonetic search engine (PSE) developed by Fast-Talk Communications Incorporated, Atlanta, addresses this problem by allowing users to search recorded speech and locate specific terms at a rate approximately 36,000 times faster than real time. Using the software, 20 hours of audio or digital content can be searched for a key word in one second with 99 percent accuracy.

Based on identifying and indexing phonemes, the smallest unit of speech, Fast-Talk’s audio searching technology can identify and retrieve any word, proper name or phrase, regardless of who is speaking or how a word is spelled. As a result, hundreds of hours of telephone messages can be searched for specific terms in a matter of seconds.

According to Jackie Fenn, vice president and research fellow at the Gartner Group, Lowell, Massachusetts, the audio search field is at the beginning of a commercial growth spurt. Although the software benefits the commercial sector, any advance in technology also offers advantages to government agencies. “The government has been doing searches for years, but any time a commercial product is available it is a jump forward. It offers new approaches that they’ll be able to take advantage of,” she points out. Fenn adds that Fast-Talk also allows users to employ the capability with other technologies.

The software is the result of more than five years of research and development conducted by Dr. Mark Clements and Peter Cardillo at the Georgia Center for Advanced Telecommunications Technologies, Georgia Institute of Technology. Clements is the co-founder of Fast-Talk Communications, and Cardillo is one of the company’s developers.

To search the content of an audio recording, it is first indexed by the Fast-Talk phonetic preprocessing engine. This can be done from archived recordings or during the recording period. The engine creates a high-speed phonetic search track that is parallel to the spoken audio track and a discrete index file that is searchable immediately.

The preprocessing procedure takes into account several variables. Each language has its own set of phonemes and grammatical structures. In addition, the quality of the recording medium may vary. While broadcast recording generally yields high-quality audio, landline and cellular telephones render lower quality results.

Fast-Talk’s PSE uses multiple techniques to address these variables. Acoustic models describe the expected characteristics of the audio files to be preprocessed, then apply the most appropriate acoustic model to the media being preprocessed. For example, broadcast audio provides the cleanest sound track to search because the speech is clear and little or no background noise is present. A different acoustic model must be applied when preprocessing audio that is recorded from landline or cellular telephones or if lower-end recording equipment is used.

The preprocessing is done at a 10-to-1 ratio, so an hour of recorded audio can be indexed in six minutes. The time required to process the information decreases if faster, multiple or massively parallel computing architectures are used. During the preprocessing phase, the PSE creates a search track, which is the auxiliary data set that is searched when the user submits a retrieval request. Once the audio track’s content has been preprocessed, a user can search it immediately.

Although the search track requires about 6.4 kilobytes of additional storage for each second of recorded audio, the information is highly compressed. As a result, the cost of storage is approximately 10 cents per recorded hour of speech when commercially available secondary storage media are used.

Patrick Taylor, senior vice president, Fast-Talk Communications, explains that the system does not identify the actual words but rather the sounds. Compared to other audio searching techniques, this software’s accuracy actually improves when a query is made for phrases. “When you input a phrase into a text-search program, if you didn’t get the phrase precisely correct, you may get no hits. With Fast-Talk, the more you give it to work with, the better the result,” he explains.

To locate a term or phrase, the user inputs a query using a standard keyboard. Because the system is searching for sounds rather than typed text, spelling is inconsequential. This is especially beneficial when searching for proper names, Taylor says. If a user is searching for “Qaddafi,” the name can be submitted as Khaddafi, Quadafy, Kaddafi or even Kadoffee, and it still will be found, he contends.

Once the text is entered, the PSE applies an internal dictionary and a letter-to-sound algorithm and converts it into a set of phonemes that are used for the search. Users can hunt for either single words or phrases. A Boolean grammar set features a temporal operator that specifies time intervals between terms. So the system can be instructed to look for “brain cancer” spoken within 60 seconds of “cellular telephone,” for example.

The company’s researchers currently are developing the capability to input queries using digital audio devices such as desktop microphones, telephones or handheld devices. Taylor believes this feature will be available within the next year.

Search results are sorted by confidence level with both the track and time designation indicated. Clements notes that this feature is beneficial. “Because the PSE uses a phonetic model for searching and returns results based on the confidence level of the match, it does not have to make hard decisions about whether a piece of speech fits the search criteria. It simply reports its confidence in each result and sorts the complete list so candidates with the highest level of confidence are at the top of the list. The user may peruse the list as deeply as desired,” he says.

The technology offers additional advantages over other types of audio searching capabilities. For example, the large vocabulary continuous speech recognition approach converts speech into text that can be quickly searched. However, because the text transcripts are a collection of words in the recognizer’s dictionary, words not in the database will not be found. When new vocabulary is added, reprocessing is required, which adds a significant amount of time to the procedure. Error rates for this method can be more than 40 percent.

Word spotting is another technique used to search audio files. While this method provides an open vocabulary, it is difficult to conduct a search faster than in real time, Clements explains. Although approximations can be introduced to accelerate the search, this enhancement can lead to decreased accuracy, he adds.

To date, the PSE has been employed in the Microsoft Windows NT and Windows 2000 environments, but it also can be used with other operating systems such as Linux. The kernel of the PSE is implemented as a library for direct linking from Microsoft Visual C++. It also includes an interface layer that is compliant with Microsoft’s Component Object Model (COM), which allows integration into any languages that can access COM objects such as Visual Basic, Active Server Pages and Java. Because the technology is a plug-in, audio/video asset management firms can offer other technologies that provide additional features and benefits.

If the search data are divided into multiple search tracks, banks of computers can be used to retrieve data concurrently. The results can be merged into one list then sorted in confidence order as if only one search was performed.

The PSE can be Web-enabled. This feature allows users to extract content remotely. Speed of performance depends on available bandwidth.

Taylor maintains that this technology is going to open up computer-related productivity in an entirely new dimension. Many organizations have gathered audio recordings but only to keep as backup in case questions arise. Fast-Talk’s software makes specific terms within a document easily accessible. Also, printed transcripts of recordings present the specific terms that were used but not the intonation of a person’s voice, which could indicate if a customer, for example, is simply stating a fact or is expressing frustration, he explains. By locating the segment of the recording that includes the search term, users can listen to the speaker and gain insight into the context of the comment.

Clements offers another example of how the technology can help businesses. “An interesting possibility for the PSE is the application of real-time scanning of large data sets. Suppose that an executive of a large corporation wants to know about all references to the company’s product name in real time, anywhere in the world,” he says. “A continuously running PSE could locate the occurrences and notify the executive—via e-mail, instant messaging, paging, telephone or fax—each time the product is mentioned. It even may be possible to detect the tone of the reference, that is positive or negative reference.”

Last September, the company introduced an evaluation kit for a Spanish version of the technology. To create the data set for early evaluation release, the firm used broadcast-quality radio and television content from media outlets in Mexico and Cuba. The data for additional regional dialects were incorporated for the final software version, which is currently available.

Fenn notes that organizations can use Fast-Talk software as a knowledge management or training tool. Recordings of internal meetings or general presentations can be shared with other employees who can search hours of content for the specific terms or sections they are interested in hearing.

“The education material space is the one showing the most interest. It will probably require some reshuffling of infrastructure, but in the next two to three years there will be some niche operations for it,” Fenn predicts.

Kevin Vest points out that government agencies have used word-spotting and other techniques to search audio recordings but none has the performance value of Fast-Talk’s software. Vest is the assistant vice president and new media technology division director of engineering, Science Applications International Corporation (SAIC), McLean, Virginia.

Late last year, SAIC began integrating Fast-Talk’s technology with Screening Room, a video search and republishing technology developed by Convera, Vienna, Virginia. The goal is to create scalable high-performance access to any analog or digital data from an ordinary Web browser, Vest explains.

“There are a whole range of products that will come out of this. We’ve just scratched the surface. We have demonstrated the technology to a number of people in the military and government sectors and the distance learning arena. They’re quite interested in it,” he offers.

Knowledge management and preservation are two other areas Vest believes can benefit from the technology. As more government employees reach retirement age, the technology can be used to capture their experience, and recordings can be easily accessed for training purposes, he adds.

Taylor points out that one area where the software can be particularly useful is in the federal prison system. Telephone conversations are recorded regularly so they can be reviewed for suspicious terms. Using Fast-Talk, the recordings can be reviewed almost immediately and with greater accuracy.

Although the current version of Fast-Talk’s searching software is most effective with broadcast-quality recordings, it can be used in telephony environments. The company has released a beta version that addresses the specific challenges telephony audio sources pose. The first version is scheduled for release this summer.

Vest notes that all the uses for this type of technology have yet to be discovered. “When you have a new technology, you have to look at what possibilities open up,” he says.


Additional information on Fast-Talk is available on the World Wide Web at

Enjoyed this article? SUBSCRIBE NOW to keep the content flowing.