Desperately Seeking Big Data Standards
An industry group is racing to draft commonly accepted best practices for security and privacy.
A coalition of information technology companies has banded to create industrywide standards to help both the public and private sectors manage large information datasets. This effort is tapping some of the best minds in computing to work with datasets that now are reaching into the realm of petabytes and exabytes in size.
Whether these large datasets come from harvesting the buying habits of customers on a store’s website or from a military fleet of drones performing aerial reconnaissance over Iraq or Afghanistan, managing huge volumes of information can be compared to trying to take a sip of water through a firehose. New ways must be developed to manage and secure large volumes of information when traditional database applications fall short. In some cases, the inability to provide timely analysis of an extremely large dataset may result in a missed business opportunity or, as in the case of the intelligence community, failure to capture vital information for security purposes. In addition, the limitations of existing analytic tools when it comes to big data may hamper a business or government organization’s ability to spot future weaknesses, or to be proactive about important decisions.
“We are trying to identify scalable techniques for data-centric security and privacy,” says Sreeranga Rajan, chair of the Big Data Working Group and director of software systems innovation with Fujitsu Laboratories of America in Sunnyvale, California. This working group is a subset of a larger, not-for-profit, industry-based organization, the Cloud Security Alliance, which is concerned with broader questions of information assurance for the cloud computing industry.
“So far, we have 60 companies that are participating in the working group, companies both big and small,” he adds. Charter member companies of the working group, which was organized officially last August, include eBay and Verizon.
The working group spans hardware, software, data security, database management, storage and other categories of information technology companies, and Rajan insists he is being realistic in sizing up the challenge ahead. “It’s a large, collaborative group, but it’s very promising, and we’ve received a lot of interest from a lot of experts in this field,” he explains.
To focus the efforts of the Big Data Working Group, members have organized themselves under subgroups with one of six themes, each with a different goal.
The Big Data-Scale Crypto subgroup is working on cryptological standards pertaining to “the volume, variety and velocity of the data and to scale that to the big data environment,” says Arnab Roy, research staff member, Fujitsu Laboratories of America. Specifically, this group is tackling issues of privacy and storage related to cryptography, as well as how remote server nodes communicate with each other in a cryptological sense to protect data while facilitating sharing, he explains. “The scale is so massive, you need to be able to communicate between hundreds of thousands of nodes. You need to protect data that is governed by rich, complex policies. You need to determine how to protect data when you are gathering it at a much faster pace from diverse endpoints.” This group also is working on how to search within a large dataset while maintaining data security and integrity.
In many ways, cloud computing and big data go hand-in-hand, says Wilco van Ginkel, senior strategist with Verizon, who chairs the Cloud Infrastructure subgroup and also serves as a co-chair of the larger Big Data Working Group. That is because the very definition of cloud computing—many real and virtual servers clustered together to create sharable, large-scale computing resources—also makes possible the manipulation of the enormous datasets common in big data applications. “In order to process big data in any way shape or form, you need big computing power, and the cloud provides cheap, easy-to-scale computing power,” he explains. The cloud provides the processing and storage environment for big data. Van Ginkel also says his subgroup is exploring how cloud computing might allow data analysis to reveal new ways that big datasets can interact with each other, perhaps revealing new ways to understand information contained in those datasets.
The Data Analytics subgroup explores challenges in the area of data mining, according to Neel Sundaresan, senior director and head of eBay Research Labs and also a co-chair of the Big Data Working Group. When the Central Intelligence Agency dispatches a drone into the mountains of Afghanistan, analysts want to be able to sift through the big data files that are sent back in real time to obtain actionable information, in some cases in near real time to deal with contingencies. This subgroup also is concerned with issues surrounding data sharing, privacy—including not revealing the personal information of individual users in a way that could identify them—and access control. The final item concerns allowing selective access to portions of a big dataset determined by policies set by managers. Because of the need for faster analysis, Sundaresan adds that a fundamental change is being considered about when data analysis can take place. “Before, we used to think of analytics as something that came at the end. You run an experiment; you get some data; and then you analyze it.” One of the goals of his subgroup is to “provide tools and provide layers on top of the data that allow everybody to look at data and to make their own analysis,” he concludes.
The Framework and Taxonomy subgroup could be described as something of an umbrella to the larger working group. “When we talk about big data, we want to be sure there is a common understanding, a common definition of what big data is,” van Ginkel emphasizes. “We think this is important, because if you look at the other subgroups, there is a common foundation when we talk about big data.” He adds that these common definitions help to guide the work of the other subgroups and provide a common frame of reference when it comes to developing industrywide standards for big data.
Van Ginkel says that because the industry itself is still trying to settle on what defines big data, issues addressed by the Policy and Governance subgroup remain a “moving target.” In many ways related to the Framework and Governance subgroups, he says the work of this subgroup is evolving as more research is done in areas covered by the other subgroups.
The Privacy subgroup deals with what van Ginkel describes as the social and ethical uses of information gleaned from big data. Using his own company as an example, van Ginkel says that Verizon Wireless, which sells connectivity to smartphones and tablets, is amassing a growing dataset full of information about its clients. Privacy concerns, however, dictate that important decisions related to specific, personally identifiable information be made over the use of that data, regardless of whether those decisions result, for example, in improved wireless data service for customers, he says. In addition, questions need to be resolved over the transnational use of big data. “The whole legal aspect of cloud computing is a challenge. I live in Canada, and if I use my mobile phone in Canada, according to Canadian law, my service provider, Bell, may be able to use my customer data in one way. As soon as I cross the border into the United States, the same data is traveling over Verizon’s network, and they may have different laws and rules governing the use of that data.” Finally, van Ginkel points out, the technical challenge of retaining the anonymity of certain personal data is buried deep within big datasets.
While the subgroups continue their work, the larger Big Data Working Group also has several projects in various stages of development. One is an experimental platform of big datasets from a variety of industries so that the subgroups can test the results of their work as needed, Rajan says. “The experimental testbed would draw test data from the health care, financial and other sectors and could help accelerate the development work of the subgroups,” he explains. In addition, the Big Data Working Group is expected to release its first in-depth research report on the subject of big data and security during the Cloud Security Alliance congress. The Big Data Working Group also recently released its list of “Top Ten Big Data Security and Privacy Challenges,” related to the topic-driven work of its subgroups. Rajan says the goal of the list is to help assemble research proposals to raise funds for joint government and industry projects.