The increasing presence of news sources on the Internet offers an unprecedented opportunity to access open-source intelligence for a variety of purposes. Researchers from several U.S. universities have collaborated to take advantage of these resources, creating a big data collection and distribution process applicable to disciplines ranging from social research to national security.
Developers of the Global Database of Events, Language and Tone (GDELT) created the worldwide, multidecade, georeferenced, daily event dataset and made it available for open research after one of the thought leaders involved needed information for projects. Kalev Leetaru, a Yahoo! fellow in residence at Georgetown University, required the massive amounts of data for his social sciences work. He and his partners collaborated to construct a catalog of human societal-scale behaviors and beliefs across every country down to the city level, and in cases with enough information, even down to major government buildings. This localization is one factor that sets the project apart from other similar efforts.
Specifying information to such a pinpointed location gives researchers much more context. Leetaru explains that as the system pulls in data about people, origins and locations and then connects them, users see key narratives surface through the meshing of latent and physical dimensions. GDELT compiles 30,000 to 100,000 new events every 24 hours, pulling from news sources across the globe. As more outlets become available online, the system will increase in robustness. The dataset stretches back to 1979, providing users with a decades-long frame of reference. Developers are working now on extending the data to include sources all the way back to the 1800s.
GDELT also goes through the text of articles to improve identifications. For example, if a riot takes place in a remote part of Africa, the dateline often will reflect the particular country’s capital, not the town where the event actually took place. By scanning entire articles, the tool discovers exactly where the problem occurred. The software also learns to differentiate the Hilton in Paris from Paris Hilton or Cairo, Illinois, from Cairo, Egypt.
The program pulls from sources ranging from major international ones to hyperlocal ones. It has 58 fields that capture all the available details about events and actors as well as 10 fields for details about each actor such as role, type and ethnic and religious affiliations. The software scans articles and extracts people, organizations and core themes the developers designed it to monitor. Leetaru explains that, basically, GDELT tacks on a list of events to an event database and tacks that list onto a graph.
Including ambiguous events from combat zones makes the system relevant for military or national security personnel. The website uses the example “unidentified gunmen stormed the mosque and killed 20 civilians” to show the type of data the system could pull. In fact, its reach into the media of certain areas could add a level of information to military operations from areas otherwise hard to access. “There really are not other datasets of this magnitude available,” Leetaru says.
GDELT offers knowledge of human behavior and decisions gathered from across time and locations to help identify patterns and shifts. Information can be accessed in various ways and formats. Each day, GDELT ranks all of the events from its Daily Update by the amount of media coverage each receives, and places them on a Google Earth Keyhole Markup Language map. Website visitors also can download different versions of event data or view the Global Knowledge Graph that helps to make connections between all the pieces of data.
In addition to Leetaru, the other main forces behind the project are Philip A. Schrodt, now a senior research scientist at Parus Analytical Systems LLC, but formerly of Penn State University; Patrick Brandt of the University of Texas at Dallas; and John Beieler of Penn State University. Schrodt explains that the idea of gathering event data goes back years, but with far more limited resources. Approximately 20 years ago, development of some automated coders expanded sources, however, the numbers were still low. “It was a pretty slow process,” he explains. With the expansion of online news—generally available for free—and geolocation tools, processing stories in real time became a reality. Aggregators such as Google News or BBC Monitoring offer copious amounts of sources.
GDELT makes sense out of the hundreds of thousands of stories on certain topics. “Humans can’t read that much text and make sense of it,” Leetaru says. “GDELT is using a machine to re-compile that material.” It then produces a list of core happenings and lesser events. It even adds context such as the cause of a riot. “It really renders the news into a computable, quantitative format,” he explains.
The developers first released the tool in March 2013 with historical data; daily updates commenced in June. Schrodt states, “This has taken off faster than any social science dataset I’ve ever seen.” And when items go viral in such a manner, outside people come up with unexpected applications, he adds. “There’s all this big data in the political realm, but not all that many big datasets out there. People can see this immediately and run with it.”
Though big data is a hot commodity currently, few resources can capture human society. Developers are seeing white papers come out that make use of GDELT. They posted on their website the correct way to reference their project. In addition to social science researchers, people working with space-time data and in other fields also take advantage of the tool. Nongovernmental organizations could make use of the information in their work, especially as more reporting becomes available about happenings in underdeveloped areas such as the Central African Republic. Gathering data from these places is difficult. The personnel involved with the project open the data to everyone for however individuals want to use it. Leetaru says this differs from the traditional academic model in which researchers create datasets and hoard them, trickling out papers over time.
However, anyone interested in accessing the dataset needs to understand it truly is big data. “This data is massive,” Leetaru says. “You can’t just load it into Excel. It requires technical and statistical skills to really use it. It’s not a magic dataset. It’s enormously powerful. It requires thought in how to use it.”
The open collaboration on GDELT requires a certain amount of technical expertise with big data, but little specific skill other than that. Developers say the blog they use to update interested parties on news has been useful. Project personnel encourage collaboration. The website has information for contacting them with ideas.
More and more information becomes available regularly, and the personnel behind GDELT believe that soon almost all the newspapers in the world will be online. “Every day the total volume of news media available online grows by some percent,” Leetaru says. “It truly is exponentially rising over time.” And, physical limitations do not apply to the Internet. “Potentially, space is infinite,” Leetaru adds. More smaller papers around the world are starting electronic versions, improving the resolution of datasets. The impact is especially meaningful to collecting information on third-world countries that traditionally have only physical versions. The infinite space also affects GDELT because more core reporters can post blogs with additional resources and citizen media can be incorporated as well, adding another layer of fidelity.
Developers have designed their tool to filter known spoof sites in order not to skew data. Even though erroneous reports still can find their way into the system, the huge swath of information it takes in acts to keep the presented data relevant. Because of the many sources, the consistent reporting is determined, and that information comes through in the final results. The developers explain that despite best efforts, mistakes are made both in human and machine coding. Schrodt believes sophisticated coders understand that data will not be perfect, but persons less familiar with such processes might not.
Developers have a grant from Google Translate to expand to other languages, increasing the number of sources that will contribute data. Advances in machine translation for the system could improve coverage of areas such as Latin America and China. Ultimately, Leetaru says, they hope to machine translate every source the program encounters.
GDELT 2.0 is under development; a new coder is halfway finished, according to Schrodt. It incorporates more sophisticated natural language processing software from Stanford University and should offer more knowledge of political actors. A grant from the National Science Foundation last summer allowed the researchers to perform more open-source work. Schrodt explains that an assortment of incremental improvements taken together should result in marked improvement in the precision of the data. Leetaru says the upgrade will extend the universe of what can be done with the data and the types of applications possible. Funding for the project comes in bits and pieces. “It’s really been from the grass roots,” he explains.
A challenge that faces GDELT involves lack of specific places mentioned in stories. Schrodt says, “I don’t think [people] appreciate the fact that geolocation is dependent on what we see in the news reports.” Wichita, Kansas, shows up as a location for riots and other abnormal events because it is approximately the geographic center of the United States, meaning it lists as the city location in articles that only mention the country. Leetaru explains that the system cannot magically catalogue every event on Earth. It is limited by what is reported. Certain areas naturally will have more coverage than others. This is why the team seeks out new sources. The older information has extra limitations because less was available electronically. “We put forth a view of the world as put forth by the world’s media,” he says. So when users look up information on a terrorist group, the most closely associated person might be a professor at a U.S. university considered a specialized expert source to whom the media turns. This connection is valuable, if the user wants to know more and reaches out to the professor.
Moving forward the goal is to seek new research, including massive classes of it. He adds that GDELT offers a fascinating testament to the world now, allowing a project of this enormity to come to existence through piecemeal resources.