Making Sense of Big Data From Supercomputers
Big data can mean big problems for the people trying to derive usable information from a large number of sources. Since coming into existence in March, the Scalable Data Management, Analysis and Visualization Institute has made strides to resolve this issue for programs running on supercomputers. The young organization’s efforts have applicability to a variety of scientific fields—including nuclear physics—and its tools are open source so others can take advantage of the findings.
Funding comes from the U.S. Department of Energy (DOE), under which the institute falls. For the next five years, the department will provide $5 million annually to support research among the members, which include seven universities and six national laboratories. Lawrence Berkeley National Laboratory is the lead, because one of its staff, Arie Shoshani, was chosen as the institute’s director. The private company Kitware Incorporated also is a member, supplying its visualization toolkit for partners to use. Shoshani explains that his organization’s emphasis on scientific data, large scale simulations and in-situ processing separate it from other agencies working on big data projects. Running data reduction, analysis and visualization tasks in situ means that these tasks are performed on the same machine where the simulation takes place. This allows such functions to be performed on the system in memory before getting the data out to disk, and therefore greatly reduces the amount of data that has to be stored on disk for further, or post-processing, analysis.
The director explains that his organization is important to supercomputer simulations, to in-situ analysis and visualization of data that is performed on the machine where the simulations are run and to the preparation of data for post-processing analysis and visualization. “Basically, our institute is needed in order to support the success of large-scale simulations in accelerating scientific discovery,” Shoshani states. With the tools and research performed by team members, other scientists can glean the information they need from their experiments and simulations, focusing on their areas of interest instead of facing large volumes of data they cannot understand. The institute’s work makes it possible to understand the information in the data, which leads to new knowledge. “Supercomputers are not of much value if you can’t use tools to get the information,” Shoshani says.
The resources also are important in detecting problems early in experiments. Researchers can use the institute’s findings to identify issues early and stop unprofitable simulations instead of wasting expensive time on supercomputers.
The institute does not perform research and development activities. Its main priority is to apply existing technology tools and products in data management, analysis and visualization to various DOE applications running very large simulations on supercomputers. “We’re really in the business of taking things that we know work and applying them to this environment where we have faster and faster machines with more and more cores in them,” Shoshani states. Three of the world’s 10 fastest supercomputers reside at DOE facilities. Institute team members also focus on adapting tools to meet specific needs and on identifying needed technologies and extending tools to support them.
Shoshani says the biggest successes to date include tools that speed up computation, that index for real-time search and that advance parallel visualization, which shows phenomena not understood previously. Examples of the last item include magnetic reconnection in plasma physics, which accounts for the Northern Lights, as well as tracking hurricane patterns and atmospheric rivers.
Though the institute is only seven months old, the various team members already have produced more than 30 papers. Several toolkits are available as well, divided into categories based on their purposes. Shoshani says the amount of progress and published works is possible because of the large size of the institute and because the organization builds off 10 prior years of research conducted through the Scientific Discovery through Advanced Computing effort.