To meet the challenge of implementing big data, a new international scientific organization is forming to facilitate the sharing of research data and speed the pace of innovation. The group, called the Research Data Alliance, will comprise some of the top computer experts from around the world, representing all scientific disciplines.
Managing the staggering and constantly growing amount of information that composes big data is essential to the future of innovation. The U.S. delegation to the alliance’s first plenary session, being held next month in Switzerland, is led by Francine Berman, a noted U.S. computer scientist, with backing from the National Science Foundation (NSF).
Meeting the challenges of how to harness big data is what makes organizing and starting the Research Data Alliance (RDA) so exciting, Berman says. “It has a very specific niche that is very complementary to a wide variety of activities. In the Research Data Alliance, what we’re aiming to do is create really tangible outcomes that drive data sharing, open access, research data sharing and exchange,” all of which, she adds, are vital to data-driven innovation in the academic, public and private sectors. The goal of the RDA is to build what she calls “coordinated pieces of infrastructure” that make it easier and more reasonable for people to share, exchange and discover data.
“It’s really hard to imagine forward innovation without getting a handle around the data issues,” emphasizes Berman, the U.S. leader of the RDA Council, who, along with colleagues from Australia and the European Union, is working to organize the alliance. Ross Wilkinson, executive director of the Australian National Data Service, and John Wood, secretary-general of the Association of Commonwealth Universities in London, are the other members of the council.
Along with designing the architecture of the network infrastructure that will support the alliance, the RDA also will serve as a self-governing community that will review and evaluate proposals to use the scientific data-sharing network for specific projects. Berman emphasizes that while the RDA is being designed with the needs of scientists in mind, she does not see this as a data-sharing tool for academics only. “This is something that needs to be done across various sectors in the community. We are focusing on open-access research data, but what we’re doing here will have implications in other parts of the data world.
“The reason for the Research Data Alliance is to facilitate specific short-term efforts, and those efforts should facilitate the sharing and exchange of data,” she adds. The true measure of the alliance’s work, Berman explains, will be the outcome of the efforts of the various working groups that are being assembled to evaluate and shape the research projects that they select in the future.
Berman, who is also a professor of computer science at Rensselaer Polytechnic Institute in Troy, New York, explains that while others see only the hurdles of taming big data, she looks at the opportunities for faster and better innovation that will accrue from learning how to process and share large datasets of information in a more timely way, especially for scientists conducting cutting-edge research.
“Size is not the only issue with big data. There is data that is complicated for a variety of reasons—there’s data that is complicated because there is so much of it you can’t visualize it, you can’t handle it; and there’s data for which we have to do very sophisticated analytics,” she explains. One example is data from the social sciences, which requires sophisticated relational analysis that she says is very hard to do.” Berman also points to Health Insurance Portability and Accountability Act (HIPPA) data, which is a key to initiatives to make health care more available and affordable. This data is governed by strict rules and regulations to protect patient and doctor privacy.
As an example of the kinds of activities envisioned for the RDA, Berman outlines a hypothetical medical study into the incidence of asthma in large urban areas. “As we know, these are very complex problems; they have to do with open-access health records; they have to do with environmental records that have to do with smog and air quality; they have to do with population and geography. We would want to combine all these large datasets.” The key is to make these disparate and often incompatible datasets interoperable and make possible “the kinds of trends-from-noise analysis” that answer challenging questions.
As outlined in the RDA’s website, membership is open to everyone, and individuals can apply through an email address found on the website. The goal is to cast as wide as possible a net of researchers involved in disciplines involving very large datasets to enable effective problem solving. Proposals for projects may be submitted through the RDA’s online forum, available only to members, in the form of a “case statement” outlining the goal of the project with specific information on the involvement of the individuals proposing the project. Following a peer-review process familiar to most academics, the RDA community is given approximately four weeks to offer comments on a proposal, and then the RDA Council takes another four weeks to deliberate and consider the merit of the idea. At the end of the process, if a proposal is approved, a project working group is formed, and that group then is responsible for executing the research data sharing project. As envisioned, a project working group could include RDA members from other scientific disciplines whose insights and datasets might have relevance to the project at hand. Continuing with the example of the hypothetical asthma study, Berman foresees a future project working group including, “people from the health industry: It might be people who know something about census datasets; it might be people who have a sense of environmental activities. You might envision a wide variety of people coming together to answer specific questions.”
The average project for the RDA is expected to run between a year and 18 months, and Berman describes these as projects that can overcome roadblocks to sharing data. “A concerted effort for a year, a year and a half, by a number of people who are willing to create some sort of tool, or policy, or implementation, or framework and adopt it, and then move on and start solving those problems.”
The initial work to form the RDA took place during several meetings of a steering group, which included Berman and six other prominent researchers, in Washington, D.C., last October. Since then, more than 120 U.S. and international organizations have joined in the organizing effort. The steering group received a financial shot in the arm last November when the NSF awarded a $2.5 million grant to Berman and Rensselaer Polytechnic Institute to facilitate her RDA work and to support other U.S. organizations that are involved.
Robert Chadduck, NSF program director for data and cyberinfrastructure, believes that the effort to organize the RDA “responds to a global community that has interests in advancing science and scholarship and education and commerce by developing relationships and capabilities for the exchange, sharing and understanding of science data.” He believes one of the key selling points for the RDA is the priority set by the working groups to identify and implement approved research projects within the defined timeframe of one year to 18 months.
The initial RDA plenary session is scheduled to take place in Gothenburg, Sweden, March 18-20, to establish governance and to set priorities for the alliance. In the months leading up to this session, groups have been meeting to draft proposals for projects that meet the initial criteria outlined by Berman.
Along with working groups that are meeting to propose projects for the RDA, other administrative working groups are meeting informally leading up to the plenary to devise how the RDA will be governed and managed as a global research consortium. “Some of the working groups will be really interested in standards that work for the alliance’s substantial community that help others get to common ground to answer particular questions,” she explains. Other working groups will focus on infrastructure, including a common registry, or a means of discovering common registries regarding specific topics, while another working group will focus on policy as it pertains to a topic. As an example, Berman points to the National Institutes of Health’s policy in which groups that were awarded research grants were required to enter the results of their work into the Protein Data Bank. That requirement, she says, created, “a worldwide repository that has been a huge resource for the medical research community.”
RDA working groups also are expected to formulate common best practices for the alliance, an effort to help the research community standardize on the best ways, and best means, of archiving big datasets for easy sharing and to facilitate collaboration. “The academic community is struggling to figure out, ‘Where do you put that data that you declared in your data management plan is vital to the community?’” Berman says. She points to the university library community, which she says is “eager to work with academics to develop a framework, a business model for the management of big data related to research.”