Seeing Eye Systems Learn to Discern

May 2011
By Maryann Lawlor, SIGNAL Magazine
E-mail About the Author


The Mind’s Eye program takes intelligence gathering to the next level. The system would be placed on an unmanned ground vehicle or on a pole, would recognize certain activity, then determine if the actions are suspicious. Fifteen research teams are tasked with creating visual intelligence that can interpret 48 verbs.

Transformative capability will enable surveillance ground platforms to recognize suspicious activity.

Persistent surveillance has been pegged as a crucial capability in current and future operations. Mind’s Eye is one of the Defense Advanced Research Projects Agency’s (DARPA’s) newest and most aggressive efforts to improve conditions for warfighters on the ground. The agency is working with the U.S. Army, industry and academia to create a way to educate video collection devices. Although existing cameras and sensors capture activity in an area, the mounds of visual data they collect are overwhelming to analysts and warfighters alike. Once visual intelligence is achieved, these information mountains will become actionable knowledge molehills that can be sent to commanders and perhaps directly to warfighters’ handheld computers in the field.

“We all know how rapidly our capacity to capture and stream video has grown, and we are frankly only at the beginning of this video and data explosion,” Lt. Col. James Donlon, USA, Mind’s Eye program manager, DARPA, Arlington, Virginia, explains. “A particular operational capability we are oriented on is to employ these new visual intelligence algorithms on manportable UGVs [unmanned ground vehicles]. What is important to recognize here is that the intent is not to take the human out of the loop, but rather to give warfighters better capabilities on the platforms of the future to make operations more effective.”

The Broad Area Announcement (BAA) distinguishes machine-based visual intelligence with its focus on recognizing and reasoning with action as well as objects in video. The machines must be able to not only visualize and manipulate scenes but also use these concepts in an imaginative process. If the capability can mimic the human mind exactly, platforms equipped with it will be able to do much more than sound an alarm when they see a familiar pattern. They will anticipate what might happen next, imagine alternative futures, fill in gaps in its knowledge or perceptual experience, and notice when an activity does not fit the norm, Col. Donlon relates.

The Mind’s Eye program focuses on creating the ability to discern 48 verbs. Among these action words are common terms such as approach, give, follow, dig and bury. The systems must be able to distinguish among the different meanings of the verbs. For example, a person can “give” a speech or “give” someone a package.

  Once this capability is achieved, the “smart camera” could be mounted on a pole or in a UGV to assess situations in the same way that human scouts do today; this takes humans out of harm’s way. For example, when a group of men are observed exiting a truck, digging a hole, placing something in it and driving away, a platform with a visual intelligence system would recognize this as a potential danger and alert the proper authorities. To reach this conclusion, the system must be able to recognize each of the actions as they relate to one another.

Accomplishing this level of visual intelligence involves the ability to perform four tasks. In a given video, the integrated visual intelligent systems must recognize an action and judge whether or not one or more verbs is present or absent according to human semantic categories. The systems will be expected to produce one or more sentences that describe a short video suitable for the human-machine communication needed to task systems and report to humans. They must be able to resolve spatiotemporal gaps in video, suggest what might have come before the action occurred or predict what will happen next. Finally, these systems will be expected to learn what is normal in longer-duration videos and detect anomalous events over the course of that input.

The program began in September 2010, and 15 research teams called principal investigators (PIs) were awarded funding in January 2011. Twelve of the research teams include government, commercial and academic organizations that are developing fundamental machine-based visual intelligence for a camera. They will create a software subsystem integrating current state-of-the-art computer vision and artificial intelligence and add their own knowledge in the areas of visual event learning, spatiotemporal representations, machine-generated envisioning and visual inspection and concepts. Three companies—General Dynamics Robotic Systems, iRobot and Toyon Research Corporation—are collaborating to design the architectures to incorporate this visual intelligence software into a camera to be placed in a manportable UGV.

DARPA’s program has tasked these teams with technical challenges that over the decades the machine vision and artificial intelligence communities have avoided as too difficult, the colonel says. “To be fair, up until now, there has been a rational approach for these communities to take while a foundation body of work was pursued in machine vision and artificial intelligence. As a result, though, it is also fair to say that each field has stayed away from the most challenging aspects of visual intelligence, each assuming that the other discipline would provide a transformative capability that could be plugged into their own contributions. The result has been specialized research on increasingly narrow problems that does not provide much benefit to solving real-world problems. DARPA is about enabling the very best cutting-edge research in response to great technical challenges and channeling that inventive energy to solve real-world problems for the warfighter,” he states.

The 48 verbs were chosen after careful study of the most commonly used verbs in the English language related to action in the physical world, but the list is by no means all-inclusive, Col. Donlon admits. For virtual intelligence researchers, however, the size of this list is daunting, but that is one reason a large number—rather than five or 10—was chosen, he adds.


The University of Pennsylvania’s Pelican flying robot can navigate itself over, under and through obstacles. Once developed, the Mind’s Eye system could be mounted on ground-based robots such as the Pelican to discern activities it observes.

“There are a variety of strategies that can be used in machine vision to classify small portions of these activities in particular inputs already,” he explains. “So in the end, this is a deliberately difficult challenge in terms of the breadth of concepts the machines are expected to address in the range of abilities that make up visual intelligence. This discourages incremental improvement under previously limited conditions—such as a small number of actions, specific inputs and singular performance goals. These challenging conditions will make necessary the novel and multidisciplinary approach to visual intelligence that is the focus of the program.”

Col. Donlon relates that at the first PI meeting in January, he was impressed with how far the 15 organizations’ research has come in the building-block areas for visual intelligence. “To my pleasant surprise, we’re seeing excellent initial progress already from these teams in processing a vast amount of training video that we’ve given them. They’re able to demonstrate that some of the importance premises of the program of what machine vision and artificial intelligence can bring to the table already are in fact valid.

“These teams are responding very aggressively to this new challenge of addressing a much broader range of abilities and a broader range of activities. They’re demonstrating some excellent initial progress in recognizing the 48 verbs in the BAA. So I’m expecting significant progress by the end of this first year [of the program],” the colonel states. DARPA and the teams will conduct some experiments this summer, and by August he anticipates seeing advances in the range of visual intelligence abilities, he adds.

To this end, the U.S. Army has been a critical partner in the research from the beginning. The Mind’s Eye team is closely partnered with the U.S. Army Research Laboratory (ARL), Adelphi, Maryland, which has an active program in advancing the technical state of the art in robotics. Researchers there are adept at examining the very best current Army capabilities and accelerating research for more advanced applications, he explains.

The ARL has joined DARPA as an evaluation and transition partner in Mind’s Eye. “This degree of active cooperation is exceptional for fundamental research such as that found in Mind’s Eye, and it’s already paying off,” the colonel says. “They have been highly engaged from the very beginning with DARPA and all research teams, ensuring that the operational need inspires the very best ideas in visual intelligence. Even in this first year of the program, ARL is providing purpose-built payload boxes and interface control documents as motivation for very tangible technology transition potential,” he adds.

DARPA is strengthening the connection between research and operational need by facilitating the direct interaction between scientists and Army UGVs as well as with active-duty scouts from the 82nd Airborne Division. Col. Donlon says this has been tremendously valuable in keeping all contributors focused on the ambitious humanlike abilities needed to reduce the risk in soldiers’ workload. “When it comes to keeping a diverse performer community focused on a common goal, there isn’t a substitute for directly manipulating the Army’s best UGVs and learning about their capabilities and limitations or for hearing first-hand from scouts about the real-world challenges of reconnaissance and surveillance,” he states.

The colonel acknowledges that a robust body of work in machine vision exists. Much of this work is based on classifying patterns from the low-level features in video. The methodologies have worked fairly well for detecting and tracking objects—the nouns, such as people—and their attributes—the adjectives—and it will be necessary to use the best of this work to accomplish visual intelligence. However, the essential approaches taken in the best of these algorithms are not sufficient for recognizing the activity—the verbs—in video in a robust and generally applicable way.

Other activities also have helped the progress toward achieving intelligent vision. These events included the Performance Evaluation of Tracking and Surveillance International challenge, hosted in the United Kingdom, and the Video Surveillance Event Detection event, sponsored by the National Institute of Standards and Technology. Both activities demonstrated that machine vision detection of events in video has improved incrementally.

“A real strength in this work is that it is motivated by highly realistic surveillance video,” Col Donlon explains. “Until now, when tackling tough problems such as these, it has been prudent to look at the videos and determine what patterns of value might be detectable, such as gatherings, thefts and bag abandonment. The community that responds to these tasks naturally builds as best [it] can on what [it is] already doing to demonstrate performance at detecting the patterns of interest for the purpose of responding to the challenge.

“I’ve taken a slightly different angle with Mind’s Eye. I have laid down a daunting technical challenge and have produced a large corpus of video to develop and test progress on that research. We are beginning with video that, while realistic, controls for certain sources of complexity at first, allowing us to relax those controls as our algorithms improve. So instead of incentivizing better performance when in a specific domain—say a particular airport surveillance video used for a single challenge—I am incentivizing broad performance in the whole of visual intelligence on somewhat simpler video, even if that performance is at first low. But I am demanding that the techniques developed will, from the very beginning, adapt to surprising new videos—from different settings or even novel variations of the actions to be recognized,” Col. Donlon says.

Currently, the Mind’s Eye team is not concerned primarily with comparing machine performance to an objective gold standard of perfectly correct performance. It is not even initially comparing performance of machines to each other. It is focused instead on comparing performance to average human performance on the same tasks given the same inputs. The program is not specifically about producing a smart camera; the smart camera in the BAA provides a vision and source of technical constraints for the conditions that visual intelligence, as Col. Donlon has defined it, would be of operational value to the warfighter.

“In the future, I would very much like to demonstrate visual intelligence on such a smart camera, which is why I have a small number of system integrators looking at those considerations now. There is much in the way of good work already underway in industry and government studies to pair computing platforms with cameras, and we will of course take advantage of the state of the art when the time comes to demonstrate the operational value of visual intelligence,” he states.

Mind’s Eye Program:
U.S. Army Research Laboratory:


Enjoyed this article? SUBSCRIBE NOW to keep the content flowing.