Partnership Promises to Prevent Cloud Computing Problems
Software developed by university researchers accurately predicts cloud computing issues before they occur, enhancing reliability; cutting costs; potentially improving cybersecurity; and saving lives on the battlefield.
Infrastructure-as-a-service clouds are prone to performance anomalies because of their complex nature. But researchers at North Carolina State University (NCSU) have developed software that monitors a wide array of system-level data in the cloud infrastructure—including memory used, network traffic and computer power usage—to define normal behavior for all virtual machines in the cloud; to detect deviations; and to predict anomalies that could create problems for users.
The software, known as Unsupervised Behavior Learning (UBL), has been shown to identify 98 percent of anomalies with only a 1.7 percent rate of false alarms. It has attracted funding from the U.S. Army Research Office, National Science Foundation, National Security Agency (NSA), Google and IBM.
“The benefits will be huge. First of all, it will definitely save a lot of money for the infrastructure providers,” declares Xiaohui “Helen” Gu, an NCSU associate professor of computer science. “If you don’t have any tools to help you predict or prevent or diagnose problems, everything is done manually. It typically takes a very long time to recover a system.”
For the military, which is now using cloud computing on the battlefield, the consequences of an outage can be especially severe. “As long as the government or the military uses large-scale computing infrastructures, these kinds of tools will be useful for them to make sure the operation is successful,” Gu points out. “Otherwise, if they are in a mission-critical task and the infrastructure is dumped and they don’t have any tools to recover the problem, that could be devastating. If you have a tool to predict and prevent problems, that could save lives.”
The NSA also sees value in the software. The agency is funding additional research to detect unauthorized access on a cloud infrastructure, Gu reports. “In the beginning, we mostly targeted performance anomalies like bugs in the software or hardware problems, which can cause your system to run slowly or to crash. Recently we have started to look at a rootkit attack on Android devices,” she says. “We do show that we can successfully detect those rootkit attacks,” Gu discloses, adding that the team will publish a research paper on the topic this month.
Google seems poised to use the software soon. “We are talking to Google about licensing our technology to them. They have hundreds of thousands of metrics they need to monitor, and they are very interested in using our model to identify what is causing a problem,” Gu reveals. “Our model is actually generic to whatever metrics you want to monitor. As long as it’s numbers or you can represent it as numbers, then our model can handle that.”
Google experienced a short but widespread outage on January 24 that had social media users all atwitter. The company explained the outage in a blog post. “Earlier today, most Google users who use logged-in services like Gmail, Google+, Calendar and Documents found they were unable to access those services for approximately 25 minutes,” wrote Ben Treynor, Google’s vice president, engineering. “For about 10 percent of users, the problem persisted for as much as 30 minutes longer.”
Treynor explained that a software bug was the culprit. “An internal system that generates configurations—essentially, information that tells other systems how to behave—encountered a software bug and generated an incorrect configuration. The incorrect configuration was sent to live services over the next 15 minutes, caused users’ requests for their data to be ignored, and those services, in turn, generated errors,” Treynor stated.
Google is not the only cloud provider that has experienced outages, and Gu points out that even short outages can have major consequences. “We have a lot of examples, not just from Amazon, but from Google, Microsoft, they all have their cloud infrastructures and they all have similar problems. For Google, they rely on their advertisement money, so if their service goes down, they lose their advertisement money. UBL definitely can save money for the companies,” she says.
The idea of using machine learning methods to detect and predict anomalies, faults and failures has been of great interest to the research community in recent years, Gu and her team wrote in a 2012 research paper, “UBL: Unsupervised Behavior Learning for Predicting Performance Anomalies in Virtualized Cloud Systems.” The different approaches can be broadly categorized as either supervised or unsupervised, the paper explains.
Supervised approaches rely on training data so that the system recognizes normal and abnormal behavior. That data also must be properly labeled. Providing and labeling the training data can be challenging. “Both cases are very hard to achieve in real-world systems because failures are rare, and most of the time when failures happen, you don’t have a chance to capture them. And not only do you have to capture the failure data, but you have to correctly label them, which is also very hard,” Gu explains. “If your labeling is wrong, your model definitely will be wrong.”
The UBL, however, uses an unsupervised approach, which requires less information. “With UBL, we only need training data for normal behavior, which is easy to get. We don’t require labeling the training data. We also don’t require failure data,” she declares.
Additionally, the UBL can detect anomalies the system never has seen before. “Since we don’t require failure data, and we don’t require labeled failure data, we can predict anything that deviates from the normal behavior. Then we will raise alarms saying there’s something wrong here,” Gu remarks. “The next step is to compare the anomaly samples with some normal samples to find out the difference between them. That’s how we find out what went wrong.”
The UBL may help overcome reluctance some people still have about adopting cloud services. “Right now the cloud infrastructure is not really reliable, so that prevents a lot of users from adopting it,” she indicates, adding that small- and medium-sized businesses might be especially wary.
The program is lightweight, meaning it does not use much of the cloud’s computing power to operate. It is able to collect the initial data and define normal behavior much faster than existing approaches, Gu says. Once it is up and running, it uses less than 1 percent of the central processing unit load and 16 megabytes of memory.
Gu describes UBL as a black box system, meaning it is used for production rather than development and does not require the source code for the system being monitored. White box diagnostic tools are tools used by developers who have access to a program’s source code. “In the case of UBL, when you run the code in the cloud infrastructure, the cloud infrastructure does not have source code, or doesn’t have access to the source code of the application running,” she states. For example, if IBM runs software in the Amazon cloud, Amazon will not have the IBM source code. “Our system runs in a production environment, because what we want to predict are production failures, production problems. We don’t need the details of an application, but we can still predict the problems and do certain diagnoses,” she states.
The NCSU started the project about six years ago under an umbrella project known as SysMD before attracting funding in the form of research grants from the Army Research Office and others. While she chooses not to reveal details, Gu indicates that further improvements are coming. “We need to do a lot of improvements. We have a lot of ideas,” she says.