Researchers working on behalf of the U.S. intelligence agencies can use reams of open source, anonymous data to foretell social turmoil such as disease outbreaks or international political unrest. Once fully developed, the capability to predict coming events may allow U.S. officials to more effectively respond to public health threats; to improve embassy security before an imminent attack; or to more quickly and effectively respond to humanitarian crises.
Both Google and Yahoo have proved they can detect major flu outbreaks based on Web search results, but the intelligence community may be able to out-Google Google. Researchers at the Intelligence Advanced Research Projects Activity (IARPA) intend to push the capability further, predicting events before they hit the news. Three teams—led by Virginia Tech, BBN Technologies and Hughes Research Laboratory—continually compete to see who can best predict events that create social disruptions.
The teams work on IARPA’s Open Source Indicators (OSI) program. “It’s a research program to develop and test methods of anticipating and detecting significant societal events. These include social and political crises, upheavals, as well as disease outbreaks, from publicly available data,” explains Jason Matheny, IARPA’s OSI program manager. “The goal is to find indicators in these publicly available data that give us insight into a society before the news covers it.”
The agency announced the program in August 2011, and the research kicked off in earnest in March 2012. Expected to end in April 2015, the program already is yielding results. The OSI has achieved early targets for accuracy, lead time and precision and recall, meaning researchers can accurately predict a range of events with a low false alarm rate, Matheny reports. “For flu, we’ve increased lead time by more than two weeks compared to the earliest public health reports that are available. With rare diseases, we’ve increased lead time by six days compared to the earliest public health reports. With civil unrest, or political crises, we’ve increased lead time by seven days compared to the earliest news reports. We’ve also made progress in developing methods that make use of a variety of anonymized data in real time very quickly.”
Matheny cites some examples of predictions that proved accurate. “The disturbances in Mexico involving large numbers of protests across the country last year, as well as Brazil, and election outcomes in Venezuela, are recent events where our teams made accurate forecasts,” he indicates.
Matheny says the program offers several benefits. “One is to provide the United States with early warning of societal crises overseas for support of humanitarian missions. For support of embassy security—for example, when embassy personnel should be on heightened alert due to a local crisis, as well as to provide travel alerts to U.S. citizens abroad,” he states. “A second range of benefits has to do with public health, providing early warning of disease outbreaks in order to mobilize the public health response overseas and to understand what might be coming to the United States, since most flu outbreaks are global in nature and pandemics don’t recognize political boundaries.”
One general benefit from OSI is to gain a better understanding of the dynamics of large-scale societal crises. “Can we better recognize the kinds of patterns that we see preceding major societal upheavals? That’s something that hasn’t been studied using the kinds of data that OSI is looking at: bringing—I guess you would call it—computational social science to the problem, trying to understand the kinds of patterns that precede the sorts of events that we care about,” Matheny says.
The teams use different approaches, but machine learning is a key element. “It’s really a research approach, or a technical approach, in which you train a system to recognize certain patterns in data. It doesn’t require that [the system] actually be artificially intelligent. It only requires that something be able to recognize patterns,” Matheny explains.
Machine learning is used in a range of different applications, from speech recognition to anti-spam software on personal computers. “We’re using, within the OSI program, machine learning to recognize the patterns and open source data that precede significant events. Those data are so large and varied that it’s impossible for a person to observe the patterns. It really does require a machine. It’s on the order of 100 terabytes a day of data,” he reveals.
The OSI program is focused solely on Latin America. “For research purposes, we look at 23 countries in Latin America, not including the Caribbean. Latin America is an ideal place to develop and test methods with the ultimate goal of being able to test those methods in other regions as well,” Matheny reveals.
He adds that the area is perfect for a number of reasons. “There’s a lot of open source data. The continent was an early adopter of social media, of the Web, and as a result has a lot of historical data that we can use to discover the patterns of activities that have preceded significant events,” he reports. Furthermore, he says, the continent has fairly reliable reporting. “In order to find out if we’re accurate, we need to have reliable ground truth data. That is, we need to have reliable reporting of the sorts of crises and disease outbreaks that we’re trying to detect or predict. Latin America has relatively good ground truth reporting. There are lots of newspapers to choose from. There are lots of official sources, and even in cases where you find conflicting reports, you can eventually get to the bottom of whether or not an event happened. That’s not true in all regions.”
Still, reading the open source intelligence tea leaves has its challenges, including the sheer volume of data. “Open source data is big. It moves quickly,” Matheny says. “For instance, tweets come in at about 10,000 per second. So, if you’re looking for signals like increased key word references to disease symptoms or political sentiments, you have to be able to process data very quickly.”
Additionally, language varies substantially. “It can take a lot of work to develop systems that recognize that two or more words mean the same thing, so you have to develop a really flexible approach to recognizing these patterns in data,” he adds.
Furthermore, even in Latin America, truth can be difficult to determine. “For test and evaluation, that means there’s some amount of human oversight to determine whether conflicting news stories are about the same event or about different events,” he says.
Google and Yahoo used only Web search queries to detect flu outbreaks in the early stages, but IARPA has improved the timeliness and accuracy, in part by using a wider array of data. Researchers can use data—such as Web search queries, blogs, microblogs, Internet traffic or net flow, financial market data and Wikipedia edits to understand what’s happening within a society: whether people are getting sick in large numbers in the case of a flu outbreak; whether people are enraged at their government; or whether there is a humanitarian crisis. Matheny adds that negatively charged Wikipedia edits about a particular candidate or party can indicate political unrest, as can a significant increase in Web queries for an opposition party.
Dinner reservations also can be revealing. “In one case, a researcher came up with the idea to look at cancellations of restaurant reservations as an early indicator of a disease outbreak, the thought being that one of the first things people do when they get sick is to cancel their dinner dates. It turns out that’s right,” he relates. “You can detect flu outbreaks much faster and more accurately by not restricting yourself to looking at one stream of data like Web search queries, but instead by looking at multiple types of data.”
Social media also can be helpful. For example, large numbers of people may post Facebook statuses indicating that they feel lousy and are staying home from work. Others might tweet that they are staying home with sick children, or they might conduct Web searches for disease symptoms or local medical clinics. Matheny stresses, however, that all of the data used is anonymous. “The systems themselves don’t keep the identifiers within data. We don’t need it within the purposes of this program, which is to look at large-scale societal events. We’re not looking at individuals; we’re looking at what’s happening at the society scale,” he emphasizes.