Two closely related science and technology programs aim to improve image location and search capabilities, saving intelligence analysts significant time and effort.
U.S. intelligence analysts often must wade through enormous amounts of imagery—both photographs and videos—to uncover the exact information needed. To make matters worse, data often does not contain geolocation tags, which indicate where the images were taken.
So, the Intelligence Advanced Research Projects Activity (IARPA) Finder program is designed to help analysts locate non-geotagged imagery, whether photographs or video. The Aladdin Video program seeks to improve search capabilities for specific events so that analysts can more quickly find the videos most relevant to their needs.
For imagery without geolocation tags, analysts work hard to deduce as much as they can using reference data from many sources, including overhead and ground-based images, digital elevation data, existing well-understood image collections, surface geology, geography and even cultural information. It is a time-consuming and labor-intensive exercise that often meets with limited success. “The goal of the Finder program is to develop tools to help analysts locate where in the world images or video were taken,” explains Jill Crisman, IARPA program manager for both Finder and Aladdin Video.
To do this, the system requires a model of the world. The model includes a wide variety of reference data, such as satellite imagery, maps and photographs publicly available on the Internet. “Finder compares what it sees in a query image or video that it is trying to locate with that world model,” Crisman says. “With Finder, analysts only have to look at a few possible locations. Before, they would actually have to search the entire region where they think it might be located to see if the information matches various locations. We’re building automated tools to try to filter out the stuff that is obviously not in those locations.”
Crisman likens the process to a person sorting through old photographs and trying to remember where they were taken. “Basically, Finder’s research is looking to use every last bit of information that’s in the query image or video. The visible terrain, the layout of the objects or even the vegetation can be hints to the query image or video’s location,” she states. “One of the nice things is that it actually creates a rationale that describes how the query was located so that analysts can keep track of what they did and what information was used.”
Three teams currently are engaged in the Finder program, which kicked off in June 2011, and they have created prototype systems to be used for evaluation. They are headed by Applied Research Associates, Albuquerque, New Mexico; ObjectVideo, Reston, Virginia; and BAE Systems, Arlington, Virginia.
The teams were evaluated earlier this year. They were given six weeks to locate 400 queries. The content dictates how easily an image or video can be located. The Eiffel Tower, for example, would be a no-brainer, Crisman indicates. A small hut in the middle of nowhere, on the other hand, would be more challenging. “Sometimes systems can actually pinpoint where an image or video was taken, so they know the person who took it was standing right here and looking in this direction. And other times, all it can really tell is that it’s sort of in a particular area. So if all you can see is vegetation in a query, we might only know that you’re in an area where that type of vegetation grows,” she says.
For the evaluation, actual analysts used the prototype systems to run a few queries. “We found with about 30 to 35 percent of the queries, they could actually locate within 45 minutes exactly where the photographers were standing. We could tell where they were standing and which way they were facing,” Crisman reports. Other photographs or videos contained less information and were more challenging. “But we were able to take the big region they were searching and reduce it down in about 70 or 80 percent of the queries,” she adds.
The teams will be evaluated next February and again in 2016, the final year of the four-year program. Each evaluation will require higher percentages over larger areas. “We’re also scaling the regions that they have to search and the regions they have to model by five times. For February of 2015, they will have to model an area that is 300,000 square kilometers in size, and they will have to search a region that is 50,000 square kilometers,” Crisman reveals.
One of the major remaining challenges is to continue reducing the workload for analysts. “Finder has analysts in the loop, and right now we find that some of the tasks the analysts are doing are a little bit too repetitive. We’re trying to take those tasks out and build tools to reduce that workload,” she states.
While Finder attempts to geolocate images and videos, Aladdin’s goal is to develop software tools to let analysts search big data video much like they do with text. Massive numbers of video clips are generated daily on many types of consumer electronics and uploaded to the Internet. In contrast to videos that are produced for broadcast or from planned surveillance, the clips produced by anyone who has a digital camera present a significant challenge for manual as well as automated analysis.
The Aladdin Video program seeks to combine the state of the art in video extraction, audio extraction, knowledge representation and search technologies in a revolutionary way to create a fast, accurate, robust and extensible technology that supports the multimedia analytic needs of the future. The goal is for the systems to retrieve videos from a massive collection that contains a given event. The event is specified with a text description and some example videos.
“Aladdin is trying to build tools to allow analysts to work with that data. It has an automatic tagger. It looks at the video in the analyst’s queue and creates a sophisticated search index for the video library—what content it has on it. It’s like a giant card catalog of information about what’s in the videos that are in its library,” Crisman elaborates. “Aladdin processes the analyst’s query and searches the catalog to quickly provide a list of relevant video clips. The video clip list will include text summaries of exactly what’s happening in each video to justify it being a part of what the analyst is looking for.”
For evaluation purposes, IARPA chose to use publicly available Internet videos of common events, including making a sandwich, working on a metal craft project, flash mobs and parkour, a physical training regimen inspired by military obstacle courses. “It’s a really fun program because it’s open to anyone in the world to participate,” she says.
The test and evaluation team found public videos with use agreements that allow them to be used for research. “They’ve taken these videos and made a test set, and they’ve actually tagged which videos have the events, and then we give it to anyone in the world who wants to participate. They search for the events we define, and then we rank them and score them on how well they can actually retrieve the videos we were looking for in that test set,” Crisman relates.
The public can participate because the Aladdin program evaluation is hosted by the National Institute of Standards and Technology’s Text Retrieval Conference Video (TRECVID) effort, which supports research in automatic segmentation, indexing and content-based retrieval of digital video. Goals include encouraging research in information retrieval based on large test collections; increasing communication among industry, academia and government by creating an open forum for the exchange of research ideas; speeding the transfer of technology from research labs into commercial products by demonstrating substantial improvements in retrieval methodologies on real-world problems; and increasing the availability of appropriate evaluation techniques for use by industry and academia, including development of new evaluation techniques more applicable to current systems. “We actually do have people around the world who participate in these evaluations,” Crisman states.
The systems currently have a mean average precision—the performance metric for the precision for which they can find all positive results—of about 10 percent for event queries with few positive example videos. Crisman’s goal is to improve that to 15 percent this year. The numbers, however, are misleading because of the diversity of videos. Videos about making a sandwich, for example, include footage of people making sandwiches with their feet, which is outside the norm. Of the top 10 retrieved videos, the systems average 70 percent precision. Crisman aims to increase the precision through the remainder of the program. “The prototypes that the teams have been developing have been very, very good at searching when you have a lot of video examples. If you have 100 examples of making a sandwich, and you have a nice background collection of events that are not making a sandwich, the teams are doing quite well at finding those videos and creating a nice results list,” Crisman reports.
The prototypes are challenged, however, when fewer videos are available. “That’s where we’re pressing the research. The big challenge I’m pushing is to get the results better when you have fewer examples, which basically means that you have to do better content tagging with English words and do better searches. The two go hand in hand. That’s how you get better performance,” she says.
Aladdin is a five-year program that just started its fourth year. It will wrap up in February 2016. SRI International, Menlo Park, California, has two teams participating in Aladdin. One began the program as Sarnoff Laboratories, Princeton, New Jersey, before being acquired by SRI. The other two teams are led by Raytheon BBN Technologies, Cambridge, Massachusetts, and Carnegie Mellon University, Pittsburgh.
Crisman stresses that IARPA does not violate privacy rights. “All the data that we’re using for the Finder and Aladdin programs is publicly available data. If we’re using Internet videos, we go to great lengths to protect the privacy of the people in those videos,” she concludes.