IARPA's Artificial Intelligence Analyzes Behaviors
A research and development program to create automated software capable of detecting specified behaviors in videos has nearly reached its goal of detecting 75 percent of activities with a false alarm rate of only 2 percent.
The Deep Intermodal Video Analytics (DIVA) program, which is run by the Intelligence Advanced Research Projects Activity (IARPA), is “creating automatic activity detectors that can watch hours of video and highlight the few seconds when a person or vehicle does a specific activity,” the program website explains. Behaviors of interest include carrying heavy objects, loading those objects into a vehicle and then driving away. The DIVA activity detectors work in single- and multicamera streaming video environments and can be used to improve video forensic analysis and real-time alerting of threat scenarios, such as terrorist attacks and criminal activity, the website adds.
“The goal has been to cut through the increasingly overwhelming amounts of security-style video that’s out there—think closed-circuit TV-type of video—and automatically go through it and identify specific activities,” says Jack Cooper, IARPA’s DIVA program manager.
The artificial intelligence (AI) and machine learning (ML) technology could monitor security video at airports, border crossings, or government facilities where camera network operators are overwhelmed with the volume of real-time video to monitor, or it could be used forensically after incidents have occurred to identify relevant activities.
“Imagine an operator who’s got dozens of streams that they’re responsible for monitoring. Humans are very capable of analyzing visual information, but it just becomes too much for one person to look at the number of streams that may be important to them,” Cooper offers.
Analysts would define specific activities of interest, and the technology could either alert them in real time for live video or highlight those behaviors for review during a forensics investigation “instead of having eyeballs on every screen on every video that’s out there,” Cooper suggests, which would dramatically reduce the amount of video to be viewed by human analysts or operators.
The four-year program ends late this month, and researchers are closing in on the final goals for detection and false alarms. “We are currently at 70 percent detection at that false alarm rate, and we’ve got a couple more bites at the apple to squeeze out that last little bit,” Cooper reveals. “We have made a lot of progress over the course of the program. I believe after phase one, we were at about 25 percent detection, and after phase two about 50 percent detection and now 70 percent.”
The major challenge has been similar to the challenge faced by many AI/ML programs: finding the right kind of labeled or annotated data to train the technology. “Data is king. Having large amounts and really high-quality, labeled examples of things you’re looking for is the secret sauce for a machine learning algorithm,” Cooper explains. “To really get to the level we were shooting for, that 75 percent, while maintaining a good false alarm rate, we really needed a lot of data.”
To meet the challenge, the researchers spent time early in the program collecting video samples and labeling the data. “As the program went on, as the teams had more examples, more information, to train their really powerful systems, that was the key,” Cooper explains. “Getting that data, organizing that data, labeling that data, that was really one of the central challenges of the program.”
Cooper stresses the need for the vast amounts of data to be labeled or annotated, adding to the challenge. The program lists more than 30 human behaviors for the technology to accurately identify, which is no easy feat for software. “The complexity of these activities—think about a person talking to another person—that can look very different in different circumstances. Some of the visual cues are very subtle there. So, maybe we need 100, 500, 1,000 examples of that to learn what that activity looks like and go find it.”
That challenge, though, also led to another success for the program: the creation of a new business, Visym Labs. Founded just two years ago as a spinoff from Systems and Technology Research, Visym has enlisted volunteers around the world to use their cellphones to video themselves performing a variety of behaviors, annotating the videos and providing them to the research community for training or testing algorithms.
“One of the teams was able to spinoff a new company that has a faster way of creating and annotating data,” Cooper reports. “They actually were able to generate millions of labeled examples in a reasonable time. That technology is out there in the marketplace now, which is great.”
The Visym Labs website explains that large-scale data collection is usually performed in three steps: setting up cameras or scraping raw images or video from the web, sending that data to an annotation team for labeling and then to a verification team to ensure quality.
“This approach is slow, expensive, biased, nonscalable and almost universally does not get consent from each subject with a visible face,” the Visym website explains. “We construct visual data sets of people by enabling thousands of collectors worldwide to submit videos using a new mobile app. This mobile app allows collectors to record a video while annotating, which creates labeled videos in real time, containing only people who have explicitly consented to be included.”
The company also specializes in privacy sensor technology that “applies a private transformation encoded in the sensor optics/analog preprocessor forming a keyed image,” the website explains. It offers different levels of privacy. A high-privacy keyed image, for example, cannot be interpreted by a human without knowledge of the key encoded in the sensor optics, but the same image can be interpreted by a paired so-called key-net.
“Our goal is to create a platform for ethical visual AI that gives you the benefits of visual AI in private spaces like the home while preserving your civil liberties. To do this, we require massive training sets of human activities that are collected with consent, and a new type of visual sensor that is privacy-preserving by design,” Jeffrey Byrne, founder and chief executive officer of Visym, explains in an email exchange. “To date, we have collected over two million videos of people performing activities around the house, which will be highlighted in a new open-research challenge for human activity detection in videos. We will be formally launching this challenge in collaboration with NIST [National Institute of Technology and Standards] in spring 2022.”
Cooper says the DIVA program also considers privacy. “First, the DIVA program is about identifying the activity, not the individual. We’re just identifying that an activity is taking place, such as a person is carrying an object,” he says. Second, he adds, all IARPA programs go through a rigorous review to ensure they collect data through approved methods.
Cooper also touts the program’s real-time processing capabilities. “Another thing we’ve overcome is that we’ve been able to do a pretty good job at detecting these activities and keeping our processing to real time,” he says.
Only two teams—Carnegie Mellon University and the University of Maryland—remain in the competition. The Carnegie Mellon team has, with minimal retraining, applied its system to an activity detection challenge for self-driving cars. Autonomous cars, of course, must identify humans and some behaviors, such as carrying a child or crossing a road.
As part of the Office of the Director of National Intelligence, IARPA supports the entire intelligence community. Once the program is complete, it will be up to the various agencies to determine whether or how to use it. “We have technology transition plans in place, and that’s something that we prioritize for any IARPA program,” Cooper says. “We have multiple partners engaged, but we leave it in their hands on how to apply the technology to meet their mission.”
The IARPA team expects to meet its goals at the end of the program, but Cooper notes that more research could be beneficial. “I think we’re going to hit our goals, and those goals were set for a reason, but by no means is activity recognition in video a solved problem. How do you handle less labeled data? How do we go even faster than real time? How do we get the false alarms down another order of magnitude? All of those are valid research problems,” he says.