Robot See, Robot React
Scientists seek to provide computers with greater understanding of dynamic visual data.
Advances in visual processing may soon allow robot vehicles to travel autonomously across battlefields and city streets. Researchers are developing mathematical models that offer insight into how mechanical and biological systems interpret images for movement and navigation. The answers will provide a key to designing more sophisticated automated guidance systems for commercial and military use.
Permitting machines to navigate accurately using visual cues has been one major hurdle in robotics research. Although improvements in processing technology allow vehicles and aircraft to maneuver without human intervention, these systems continue to encounter difficulty in dynamic environments such as cities. While machines perform well in structured settings, the ability to perceive and move around in a totally unfamiliar setting would be a great step forward for robotics, experts say.
One of the institutions investigating the processes behind visual interpretation is the University of California at Los Angeles (UCLA). Researchers are studying dynamic vision, which is a computer’s ability to use visual data about its surroundings and apply it in a changing environment to perform assigned tasks. According to Stefano Soatto, associate professor at UCLA’s computer science department and head of the vision laboratory at the university’s engineering school, potential applications for dynamic vision include basic robotic tasks, remote or autonomous movement and manipulation, tracking, docking, remote surgery, visual navigation and driving. But performing tasks like these is not simple. “It’s not just a matter of connecting a camera to a computer. You need to interpret it. An image to a computer is just a bunch of zeros and ones. So how do you extract from these zeros and ones information about the shape of an object, its texture, position, orientation, motion and so forth?” he offers.
The crux of dynamic vision research is understanding how machines interpret the data they receive. For example, if a photograph is taken of an office that contains a desk, computer, books and a telephone, the human brain can immediately and effortlessly assess this information. But, if this image is scanned into a computer, it will be reduced to a series of numbers between zero and 255.
If the lighting in the room changes slightly and another picture is taken, a human will have no difficulty identifying the scene as being exactly the same as the original image. However, when this data is entered into a computer, all of the numbers will change. This process also will happen if the camera is moved slightly.
Much of how humans and machines interpret this information depends on a number of unknown factors, Soatto observes. “Your image depends on the shape of objects, on their material properties, light distribution, the position of the camera and its calibration—which is all unknown. All of these things influence your measurements in the sense that if you change any of them, your images—the values between zero and 255 recorded in the computer—are completely different,” he says.
The key ideas behind dynamic vision date back to the 1940s, but Soatto believes the first person to demonstrate working systems based on the theory is Ernst Dickmanns, a professor at Universitaet der Bundeswehr, Munich, Germany. Between the late 1970s and 1994, Dickmanns developed passenger cars that drove autonomously on European highways at speeds of up to 111 miles per hour, reading speed signs and passing slower vehicles.
Dickmanns used a structured environment to make his system work. Highways are not arbitrary objects, but rather slabs of concrete with stripes painted on them and populated with cars, trucks and motorcycles. Soatto describes this as a very simple environment because many of its properties are known. “If you took Dickmanns’ car and put it on the streets of Los Angeles, it wouldn’t go anywhere because the complexity of the environment is infinitely superior. On the streets you have pedestrians, buildings, dogs—all sorts of things. It is very difficult to embed these [variables],” he maintains.
Dickmanns used this model and attempted to enrich it by augmenting the system. Soatto chose to approach the issue from another angle. “Suppose you know nothing about the environment. If you put me in a building I know nothing about, I can still get around. So I came in from that end, the so-called unstructured environment, and one of the questions that I posed was, ‘What exactly can I do in an environment I know nothing about?’” Soatto explains.
Called structure from motion, this theory asks the question, Can the three-dimensional properties of a scene be reconstructed by using only images of an object? To solve this problem—which Soatto describes as one of the central issues of dynamic vision—UCLA researchers are measuring light, which represents images and arrays of positive numbers. This data is converted into estimates of the particular properties of objects, such as shape and motion. A framework is then created based on stochastic process estimation, statistical modeling and optimization. The resulting structure can be posed in mathematical terms as a very complex optimization problem. “We use techniques from stochastic optimization and systems theory to address this problem by devising algorithms to do the estimation and analysis—for instance, to prove that a solution to this problem exists and that it is unique,” he says.
The difficulty of designing dynamic vision is in determining the best way to collect this highly variable data and extract the properties of a specific scene. “How can I recognize friends’ faces regardless of whether they are in bright sunlight, shade, wearing wigs, smiling, in different poses or orientations? I still manage to recognize them,” he notes.
But extracting this data and coalescing it into an algorithm is extremely difficult. No efficient system for handling variations in illumination exists today, Soatto maintains. He notes that facial recognition systems require that images be captured under very rigid parameters of lighting, distance and angle. If this framework is disrupted or ignored, the system does not work properly. The goal is to find an invariant property in an object and recognize it despite all variables.
One approach is to store images of an object under all possible conditions of lighting and angles. But this is not practical because the range of variables is almost infinite, he observes. Scientists know how to address certain types of invariances that are geometric in nature and relate them to an object’s position and orientation. However, variances relating to illumination levels and qualities are unknown, forcing researchers to make assumptions in their calculations.
Although dynamic vision seeks to understand the mathematical key to extract data from images, Soatto believes the solution must be independent of any specific algorithm or system used to solve it. “Whether it’s the eye and the brain or a camera and computer, it doesn’t matter because the mathematical properties of the problem are the same,” Soatto emphasizes.
The goal of UCLA’s laboratory research is to discover or analyze properties of the problem of dynamic vision that are device-independent. Once these properties are discovered, an algorithm can be implemented for a specific device, but “if we have gotten the math right, then any device should work,” he points out.
Though other researchers are studying the use of different bands of light, lasers and even acoustic imaging to guide autonomous vehicles and robots, the basic problem remains the same. “No matter how the image is acquired, once it gets into a computer, it’s a bunch of zeros and ones. So in a sense, the difficulty is not acquiring an image, it is interpreting it,” Soatto maintains.
The U.S. military has been a major contributor to research for autonomous aircraft. Although dynamic vision systems have difficulty in complex environments, such as city streets or a battlefield, they operate reasonably well in the air. Soatto notes that this is a matter of perspective. Although the physical space is the same, when viewed from a high altitude, environments such as city streets become relatively static and structured, he explains.
Soatto believes that a number of dynamic vision systems will be available in the near future, noting that Dickmanns’ system is already in commercial use on freightliner trucks. The technology generates a warning if it detects any inconsistencies in highway driving patterns. For example, if a driver were to fall asleep behind the wheel, the system would keep the vehicle in its lane and sound an alert to wake the driver. Vision-based robots are working on factory floors, and the UCLA vision laboratory also is involved in research to provide U.S. Army helicopters with autonomous guidance and landing systems. “These systems are a reality—you don’t need to wait five years,” he says.
Additional information on dynamic vision is available on the World Wide Web at www.engineer.ucla.edu/stories/vision.htm.