The Nation's Data Hubs Drive Innovation
A focus on data science extends to improving machine learning.
The National Science Foundation’s Directorate for Computer and Information Science and Engineering is working to create a big data ecosystem. As part of that effort, the NSF, as it is known, is expanding the National Network of Big Data Regional Innovation Hubs, first created three years ago. The hubs, with one location for each U.S. Census region—the Midwest, Northeast, South and West—grew out of the need to aid the development of big data research and to help solve complex societal problems. The hubs are having a positive impact on the growth of machine learning, increasing the access to data, methods, networks and expertise, experts say.
The hubs’ important contributions in advancing machine learning will help the industry get past the so-called brittleness or limitations of machine learning, purports Eric Davis, principle scientist at Galois, a research and development firm with headquarters in Portland, Oregon. “One of the more promising things I think that’s happened recently has been the fact that the U.S. government has built a network that they intended us to share standards and data with called the Big Data Hubs,” he notes. “And this has been a critical organizing factor in machine learning.”
The NSF is providing additional funding to operate the hubs and strengthen their role in engaging local and state governments, industry, nonprofits and academia in big data science. The effort necessitates a national coordinating office and aims to elevate the hubs’ output to a national level.
“We’ve shifted fundamentally to talk about data science and build capacity in data science, and of course, machine learning comes under that umbrella,” explains Melissa Cragin, executive director of the Midwest Big Data Regional Innovation Hub. “And so while we’re called the Big Data Innovation Hubs, we’re very focused on building capacity in data science, building expertise, access to data-related services and networks related to all things data science.”
That means making available “to all kinds of communities” access to data-related skills, services, tools and opportunities, Cragin states. By developing public/private partnerships and working with groups to leverage these resources, the hubs can help coordinate solutions to “shared grand challenges,” she notes. The hub also is endeavouring to extend data science research and education to undergraduate institutions—including minority-serving institutions—to help add data skills for the developing workforce, she states.
The regional aspect allows each hub to identify priority areas or “spokes” that they are pursuing. For the Midwest, issues relating to water quality; digital agriculture and unmanned aerial systems; and food, energy and water, among others, play a major role.
“One of the things that sets us apart from the other hubs, of course, is that we’re landlocked—apart from the Great Lakes and major rivers,” Cragin notes. “We’ve got water quality issues that are important for drinking and recreation, and most certainly for agriculture and the environment. Our regional focus eases the work of bringing together experts and resources from academia, industry, government and nonprofits around these grand challenges that we have in the region.”
In addition, the Midwest Hub is pursuing an effort called “Machine Learning Farm-to-Table,” Cragin says. Officials realized that researchers from the hub’s food, energy and water, and digital agriculture working groups didn’t have a lot of meetings in common and weren’t interacting often. “We saw that it was very important to get them to talk to each other because they each have data that the other needs,” Cragin stresses. And the potential opportunity for cross-community interaction and collaboration and research is quite important. We saw that as a gap, so we put together the Machine Learning Farm-to-Table effort.”
The hub brought in industry, government and academia, as well as machine learning and other methodology experts to help domain scientists understand the large data sets being produced around plant sciences, agriculture, food, energy and water. The effort promoted new technologies for data science, applications of machine learning and statistical approaches.
“It’s very important that domain scientists begin to have better access to and use of computational and statistical methods that they may not have had training on before, so that they can decide what the best approaches are to answer questions they have now that they have these big new data sets,” she stresses.
Cragin admits that it can be a challenge sometimes developing direct partnerships with private industry. “We’ve had good technology industry collaboration, but the difficulty has been the value proposition for some companies in other sectors in partnering with the hubs,” she shares. “We don’t provide direct services. We can get industry to come to our domain-based meetings and machine learning or big data meetings. And in the hub’s priority areas, such as water, digital agriculture and unmanned aerial systems, and plant sciences, it’s working, but getting sector companies to partner with the hubs has been a different kind of challenge.”
Naturally, technology or data-oriented companies that have a vested interest in advancing their products or expertise have offered partnerships. Companies such as Microsoft Research, DataScience.com, Google, Amazon and others, have had successful partnerships with the hub. Microsoft Research, early in the process, offered the four hubs $3 million of Microsoft Azure and cloud services, so that the hubs could individually bring projects to the cloud, harness analytics and manage big data sets and processing.
These tools and partnerships, along with the hub’s everyday efforts to expand the big data ecosystem, are making an impact, Cragin asserts.
“Each hub is fully engaged in data science and machine learning, and now we’re beginning to think about how all the hubs are doing this,” Cragin observes. “How do we begin to provide both informed information resources, but also better direct links to real data sets that will support machine learning and research and development as well? We’re beginning to think about how to do that and to provide resources that will increase the use of data sets for machine learning.”
Meanwhile, the Northeast Big Data Regional Innovation Hub is based at Columbia University, home to the Data Science Institute (DSI), which was founded as part of former New York City Mayor Michael Bloomberg’s Applied Sciences Initiatives, explains the hub’s Executive Director René Bastón.
“The Northeast has extraordinary challenges as the oldest region in the country. We have the densest population, aging infrastructure and economic challenges that are rust-belt type of challenges,” Bastón states. “But at the same time, these are our strengths. For example, having the densest and most diverse population in the U.S. is a huge advantage for biomedical research. What we do at the Northeast Hub is try to understand how we can take advantage of the region’s strengths to bridge its societal challenges. It’s how we [can] cross-fertilize across all of these sectors, industry, academia, government, nonprofits, and create the kinds of systems and network effects that leverage data science approaches to benefit society.”
Like the other hubs, the Northeast Hub is working to advance machine learning through the improvement of data, access to data and building partnerships. “A big challenge in machine learning is getting access to large volumes of high-quality data sets, particularly if you’re doing supervised learning, and I think the impact of what we can do in that space is quite significant,” Bastón suggests. “Part of what the hubs are charged with is to make data more discoverable and more accessible and help integrate a number of heterogeneous data sets in any category.”
The Northeast Hub has a project, soon to be launched, that involves the notion of a so-called big data map. Working with some of the other regional data hubs, the Northeast is looking at how to develop and identify the hardware and software resources available to create a big data map.
As part of that initiative, the hubs will take a close look at the challenges or problems stakeholders are working on answering through machine learning, an idea suggested to Bastón by machine learning expert Andreas Müller, an associate research scientist at DSI and author of the O’Reilly Media book Introduction to Machine Learning with Python.
“Müller said that maybe the best way to create a data sharing resource was to gather the challenges people are trying to answer with machine learning, along with any available data they believe would be useful, rather than gathering as many data sets as possible.” Bastón shares. “Many cities with open data portals have used the latter approach and, frequently, only a fraction of the data really get used. Rather than doing something like that, we wanted to turn this on its head and gather the challenges, the problems that people are trying to solve with machine learning and ask them to upload any data sets that are associated with these challenges.”
Under a new pilot program to be introduced later this year called the Collaborative Resource and Understanding Exchange, the hub would then work with a community of users to create solutions to the challenges using the uploaded data sets. With funding from the NSF, the Northeast Hub’s pilot would help the machine learning community work with experts in specific domains to understand their challenges and available data and jointly create solutions to those questions, he explains.
“The platform also will work to overcome one of the hurdles from a supervised machine learning perspective, the lack of high-quality, prelabeled training sets,” Bastón states. “Those are tough to come by and it’s very onerous for any individual machine learning practitioner to label their own data, and it’s expensive to outsource the labeling. Our pilot program will attempt to solve this problem by crowdsourcing the labeling and much of the overall curation of the data sets that are uploaded.”
Meanwhile, the hub will reach out to domain experts—whether it is in city transportation or healthcare—who aren’t necessarily data scientists or machine learning practitioners, to learn about their needed machine learning solutions.
“If it’s a successful experiment and it becomes a new paradigm for how you generate and manage the type of data that machine learning folks really need, and learn what’s really important to that community in how you set up that data and make it discoverable, with developed metadata schemas, that could really be something people can build off of,” Bastón offers. “So we’re very excited to get that launched.”
The Northeast Hub is also considering a project that would “help connect the dots,” from an Internet of Things perspective for Smart Cities, and how to use data in real time, versus retrospectively. “We’re doing a lot in the city space,” he notes. “And we’re in a unique position at the hubs, in that we see across regions, across sectors, across industry, academia and government. Then to have real-time data to feed into a system, we could start to look at, say, the real-time health impacts of environment,” Bastón states.
“Learning how to take the different areas of converging technologies that have led to our current data science and machine learning capabilities and apply them on a large scale to hopefully have an outside impact on the world, creates fascinating possibilities.”