The Mind-Blowing Promise of AI-Driven Voice Profiling
In the future, voice analysis of an intercepted phone call from an international terrorist to a crony could yield the caller’s age, gender, ethnicity, height, weight, health status, emotional state, educational level and socioeconomic class. Artificial intelligence-fueled voice forensics technology also may offer clues about location; room size; wall, ceiling and floor type; amount of clutter; kind of device, down to the specific model used to make the call; and possibly even facial characteristics of the caller.
Rita Singh, an associate research professor at the School of Computer Science, Carnegie Mellon University (CMU), leads a team developing the technology. “It’s all about deducing all kinds of biorelevant parameters from a person’s voice. Biorelevant parameters include everything from your physiological parameters, demographic, sociological. The list is very, very long,” Singh says.
Voice research dates back six to eight decades and draws from an array of fields, including medicine, psychology and architectural acoustics. Singh is now in India working on a book about the technology. “In my book, I have a list of more than 150 fields that have reported results about the human voice. There is a lot of information out there, and if you collate or track through that information, you find indications of signatures for a particular parameter, like your blood pressure, in your voice,” she says.
Despite decades of study on the human voice, the field of voice forensics is still young. Singh compares it to DNA science in the mid-1980s. “That’s where this technology is right now. It’s not off-the-shelf. It’s backbreaking work at this point. And it’s going to be backbreaking work for another 10 years. Or five years,” she says.
But with today’s technology, the science should advance relatively quickly. “It’s going to progress much faster than DNA did because we have powerful computers. We have powerful algorithms. We have all of the artificial intelligence power behind that. It’s still nascent, but it’s going to happen fast,” Singh says.
But what is already clear is that many variables can influence a person’s voice, and with the rise of artificial intelligence and machine learning, those variables can be detected and extracted. For example, height is proportional to the vocal cord length and affects the resonance of the voice. The length of the vocal tracts and the resulting resonance also can indicate a person’s race—or at least what most people would consider race. “I hate to use the word race because there is no such thing as race. There are three types of skeletal structures in the world, according to anthropologists—Mongoloid, Caucasoid and Negroid. That’s it. It has nothing to do with skin color,” Singh declares. “From the voice, you can tell what the skeletal structure of the person is like.”
Some of the variables likely are surprising to most people. For example, room clutter affects emotions and can be detected in a person’s voice. The voice also can indicate whether a person is a leader within a group or has had a morning cup of coffee.
This month at a World Economic Forum event, Singh and her team plan to demonstrate a new capability: to re-create a speaker’s face based on artificial intelligence-driven voice forensics. “It’s not at the ‘wow’ stage yet. It’s still a work in progress, and it’s going to take years to perfect. But we’re able to start doing that now,” she reports. “There is enough information in the voice that we can conceivably, at some point in time, generate human faces.”
The science is new enough that researchers do not yet know how many facial features eventually could be derived from voice forensics. Some features—hair color, facial hair, baldness—may not be represented at all in the voice. But then again, they might. Singh says researchers could one day discover that a gene for baldness correlates with a gene for a particular kind of voice, for example.
She indicates that it is hard to describe over the phone exactly how the technology works. “I at least need a whiteboard in front of me,” Singh says.
But, she explains, it begins with a theory that a particular parameter might create a signature within the voice signal. If the researchers have reason to believe that a signature exists, then they design algorithms to perform feature engineering. Simply stated, feature engineering allows machine learning algorithms to discover specific features within data.
In this case, Singh says, the feature engineering is based on a subfield she is trying to create that she terms microarticulometry. Articulometry is the measurement of human articulation. “Microarticulometry has to do with looking very finely into the voice signal to derive clues from it,” she says. “All of the parameters we try to deduce have signatures in the voice. Many of the parameters appear at very fine levels of the voice.”
The algorithms must be precisely tuned to detect and extract those very fine variables. That means the margin of error is miniscule. “You make a very tiny error, and you’re off,” Singh notes. “There may be patterns in the voice signal that I can’t see with my eyes. They may exist in high-dimensional mathematical spaces that can only be computationally analyzed. I cannot discover them using traditional methods. That’s where artificial intelligence comes in.”
The technology can be helpful even with short messages, although the number of parameters will be limited. “The Coast Guard gets calls that sometimes have only one word: Mayday,” she says.
The U.S. Coast Guard has used the technology to help cope with prank callers. The service receives more than 200 false maritime distress calls a year, and the number is growing, according to a Department of Homeland Security Science and Technology Directorate article published last year. Every distress call launches an expensive search and rescue effort involving at least a small rescue boat, a C-130 fixed-wing aircraft or a rescue helicopter, and their crews. The cost of each outing can run from $10,000 to $250,000. One anonymous caller cost the Coast Guard $500,000 in 2014 by prompting 28 unnecessary rescue missions.
A Coast Guard spokeswoman says the maritime service does not currently provide any funding for the CMU team but still works with it.
“The Coast Guard has an agreement with CMU in that we get their assistance using the voice forensics technology with the hoax call cases, and they can publish their works with clearance from us,” says Lisa Novak, a Coast Guard public affairs officer. “We have used the lab to investigate several hoax call cases, starting about four years ago. The technology has been used in various ways, such as to identify logical suspects, to exclude suspects, to identify someone where there is a hoax call with multiple callers and to support referrals to the U.S. attorney’s office for criminal prosecution consideration, to name a few. It has been used as a tool to obtain more information and not relied on as a sole piece of evidence.”
The Coast Guard asked Singh to analyze hoax calls, and she came up with a list of what she could discover about the caller’s physical characteristics and environment. The list included age, height, weight, background noise and type of room where the call was made.
Singh reports that a lot has changed with the technology over the past year. “There have been improvements in every aspect since then,” she says. She cautions, however, that changes are incremental, so “nothing magical” has happened in recent months.
Singh, along with three other technology and business experts, are co-founding a company called telling.ai to commercialize the technology. The company will initially focus on health care to aid early diagnosis of diseases such as Parkinson’s and Alzheimer’s and to more easily identify different kinds of intoxication, including from recreational drugs.
She also reveals that she has worked with other agencies as well as the intelligence community, but Singh says she does not feel free to disclose much. In some cases, she suggests, lives could be at stake if she reveals the details of an investigation. In others, she simply does not know whether a case has been closed. “They don’t tell me. I give them my analysis, and they’re gone,” Singh says.
Because the field of study is young, many challenges and opportunities remain, she allows. For example, because of high migration levels and the melding of languages, determining a person’s region of origin is growing increasingly difficult. Also, it is difficult to determine whether a particular signature within the voice signal is tied to multiple parameters.
Still, the technology offers broad benefits. Singh indicates that she is open to recommendations for use cases. She suggests that voice forensics could be a noninvasive way of monitoring workers for signs of stress, fatigue or health issues. Additionally, the profiling technology could lead to more secure voice authentication processes. “That is something that should be doable in the near future,” she concludes.