Modeling Reliability In Distributed Computer Networks
Researchers hunt for weaknesses in large node-based systems.
U.S. government computer scientists are studying how computer grids react to volatile conditions to understand how events such as virus attacks, sudden changes in workload and cyberattacks can affect linked groups of hundreds or thousands of geographically dispersed machines.
Managed by the National Institute of Standards and Technology (NIST), Gaithersburg, Maryland, the three-year project is developing software-based models to determine how vulnerable grid networks are to failure. According to NIST computer scientist Christopher Dabrowski, these models will allow organizations to identify and solve problems quickly as they deploy grid systems.
Computer grids are a means to harness the power of hundreds or thousands of computers efficiently for a single project. This computing method takes advantage of “down time,” when machines are not using their full processing power. The distributed nature of grid computing allows many systems to solve problems rapidly in fields such as engineering, genetics and financial services.
Grid computing differs from parallel processing, which typically involves multiple computers tied together at a single site using the same software. Dabrowski notes that grid computing uses larger, geographically dispersed systems featuring many processing nodes and may involve multiple organizations. This loose architecture creates more flexibility in terms of how nodes enter, leave or become part of a grid. The number of users also may be larger than in a parallel-computing situation. “Users dynamically submit their jobs into a grid system, and you may have many jobs running in parallel on widely dispersed nodes,” he says.
Another difference between grid and parallel computing is that multiple administrators and oversight groups often control grid resources. This structure offers a more heterogeneous structure than the parallel systems approach, which is centralized. Kevin Mills, NIST senior research scientist, explains that the size and dispersion of nodes in grid systems create a higher degree of uncertainty in the architecture than do parallel networks.
One example of a grid system currently in use is an effort by the Search for Extraterrestrial Intelligence (SETI) organization to analyze radio data from space. Participants in this program receive software so their computers can analyze data during down time rather than run a screen saver. “That application could have been used on a parallel computer, but I think the people who started the project had in mind that there were many more idle CPU [central processing unit] cycles available in the general world and that people would be willing to offer them,” Mills says.
A key advantage of a grid system is that many more nodes can focus on a problem than can be accomplished using a typical parallel-computing solution. “You’re taking advantage of the idle time on a great number of remote computers. Potentially integrated, you could solve a task much more quickly than in a single- or multiple-site dedicated parallel system,” Dabrowski explains.
Although some preliminary research and small-scale coding is complete, Dabrowski notes that the program will not produce any substantial results until next year at the earliest. While the main focus is on the reliability and robustness of grid systems, he cautions that no new experimental grids will be created for the study. The NIST team instead assess how existing grid systems respond to failures, and how the researchers can correct them to allow applications to run in a reasonable amount of time. “We’re not really pushing forward the basic capabilities of the technology. We are looking at the technology that’s being developed and creating metrics and tests to assess its potential,” he relates.
To assess how grids operate under various conditions, NIST researchers are developing programs to simulate medium- and large-scale grid systems. Dabrowski observes that an important aspect behind understanding system failures is learning how their behavior changes as their scale and size increase. He anticipates that, in some situations, large grid systems may become unpredictable and produce undesirable results or even fail. “We want to examine the potential of this technology to see that situations like that don’t occur,” he says.
Grid computing is still in its infancy, and many of the technology’s specifications are incomplete, Mills says. However, a great deal of interest and momentum exists for developing grids. “It’s our assessment, from the government customers we’ve talked to, that this is the direction that they are likely to pursue, and there is no waiting. It would be detrimental to wait until this technology rolls out to have some understanding of its behavior on a large scale,” he maintains.
Dabrowski notes that existing grid networks largely are operated by single organizations or affiliated groups. “You haven’t yet seen large-scale grids that involve many different organizations where users and service-providing nodes either randomly join or leave such grids,” he says.
Because this type of large-scale heterogeneity does not yet exist, specifications are being developed to enable flexible interoperability. These guidelines are necessary because future systems will be so large and potentially unpredictable that it may be very difficult to simulate their behavior, Dabrowski says.
The NIST effort also stresses the difference between security and reliability. For example, a denial of service attack might affect the overall reliability of a system by rapidly spreading failures across a grid. But Mills does not differentiate between the two. “We just look at the range of the failure rates that you can sustain. You might argue that higher failure rates might be the fault of concerted attacks and lower rates would be the result of normal failures,” he says.
Another part of the research will determine acceptable failure rates for grid computing systems. Mills notes that even within acceptable failure rates of 10 or 20 percent, protocol design errors may cause a network to perform poorly—on a level comparable with much higher failure rates. “That’s the kind of thing we find and feed back to the standards bodies,” he says.
At this early stage of the program, NIST researchers do not have many examples of failures in grid systems. One reason is that the networks now in use are relatively small and controlled by single organizations. Mills explains that the large-scale failures he predicts will not occur under these circumstances but may arise as grids increase in size and scale and as multiple user groups begin sharing resources.
His observations are based on problems seen in smaller system deployments in local area networks. Some of these difficulties occurred in service discovery protocols. These are key parts of grid applications that locate the resources needed to process a problem. Early research on these protocols focused on systems for local area networks such as plug and play and the genie service location protocol. “In the work we’ve done, we found cases where feature interference in the protocols resulted in unexpectedly low performance at low error rates,” he says.
For example, most of these protocols can notify users when a change takes place in the system. These applications are built to operate on protocols such as transmission control protocol, which is used to deliver a notification. But if an alert is not delivered, which can happen at low data failure rates of up to 20 percent, other mechanisms in the protocol are activated to recover the lost data because the failure is assumed to be long-lasting. However, these mechanisms do not activate for transient failures that can lead to users never receiving any data.
By the time the study is finished, Dabrowski hopes to have several simulation models that can be used by anyone developing a grid system. He notes that the models are standards-based and will simulate the known application workloads of a grid network. “What we’ll produce will be the models themselves, together with a set of metrics and tests that we will use to evaluate these simulated systems,” he says.
Although tests and models have their limitations, Dabrowski relates that they do allow users to simulate large-scale operational systems for much less than what it would cost to develop and implement a large testbed. These software models will allow system developers to test their grids, or they may be used to develop new assessment and analysis tools.
A number of commercial and government organizations are interested in grid computing, even though the technology is still being developed. Besides the SETI applications, Dabrowski notes that commercial firms are beginning to develop systems with hundreds or thousands of nodes to solve problems in the pharmaceutical industry and in various engineering and financial applications.
The U.S. Department of Energy is funding a grid computing system for major scientific applications such as simulating nuclear explosions, modeling various energy systems and studying combustion technologies. The department will use this network to analyze the massive amounts of data that will be produced by the European Organization for Nuclear Research’s Large Hadron Collider. Other government organizations funding initiatives include the National Science Foundation, NASA and the Defense Information Systems Agency, which is planning to build a service-oriented network architecture based on the technology.
Industry standards groups such as the Global Grid Forum are developing rules and protocols governing how resources are used and administered on these networks. These standards will allow grid systems to use their services within a network to interoperate through directories. Dabrowski notes that other standards-making bodies, including the Enterprise Grid Alliance and the World Wide Web Consortium, will play an important part in managing the development of this new technology.