Self-Inflicted System Malfunctions Threaten Information Assurance

July 1999
By Lt. Col. Glenn D. Watt, USAF

Expanded network connectivity increases risk of sharing mishaps.

While the security industry concentrates on protecting systems from external threats, a danger to information access is brewing from within organizations. The expansion of and growing reliance on networks is jeopardizing military information technology by exposing numerous sectors and even entire commands to errors that are introduced internally by a single entity.

Unlike command structures of the past, today’s U.S. Air Force base networks are as much a part of a combat system as the wiring in aircraft. The theater commander in chief uses the communications network to convey orders and other command and control information to field commanders in the same way a pilot employs technology to pass information to the flight control surfaces. Consequently, information assurance is as vital to command leaders as it is to aviators in the cockpit.

The Air Force is successfully tackling the challenge of identifying and deflecting external attacks by correlating data from thousands of sensors. However, devising a plan to manage a self-inflicted threat is equally critical for today’s enterprise networks. Commercial off-the-shelf hardware and software currently being used widely in some government sectors can produce unexpected network disasters that require recovery processes that are as complex as those needed after an information warfare attack.

Statistics from the Air Combat Command (ACC) at Langley Air Force Base, Virginia, reveal that the largest portion of network service nonavailability results from errors introduced by software changes rather than hackers. In 1998, the majority of disruptions in service occurred as a result of unintentional errors. In almost every case, professional network operations and a disaster recovery plan, two items that could have prevented the problems, were missing.

ACC network operation and security center (NOSC) metrics provide additional insight into this phenomenon. In a detailed analysis of 45 days of outages, these events were tracked and correlated with malicious cyberattack activities. No correlation between the two events was found. A similar analysis performed with known software or hardware changes produced a 75 percent correlation. This information indicates that there is a greater need to focus attention on professional network design, installation, operation and recovery.

Some recent examples illuminate the magnitude of the problem and the rationale for adjusting priorities.

In March 1998, ACC Internet access to sites was denied. The outage eventually affected the Air Force community at large. ACC’s computer support squadron notified the Defense Information Systems Agency’s nonsecure Internet protocol routing network operations center about the outage and started working on the problem. In addition to restricted Internet access, incoming and outgoing electronic (e-) mail from sites began queuing up at relays throughout the Air Force and at Internet e-mail hubs. All public folders under Microsoft Outlook were inaccessible because of an ACC-wide Microsoft Exchange problem with the public folder structure. Langley’s NOSC contacted the Microsoft Corporation and received a fix; however, it took three days to implement.

It was not an electronic adversary that had attacked Defense Department systems. NOSC traced the network failure to a person at Minot Air Force Base, North Dakota, who had system administration privileges. As this administrator upgraded the server’s public folders, he accidentally responded incorrectly to a question prompt, and the ownership of all folders throughout the ACC enterprise methodically changed.

The mishap stemmed from the way Microsoft Exchange servers replicate folder ownership and access in an enterprisewide system. Directory replication connectors between Exchange sites are permanent relationships. Frequent changes can lead to unintentional results. The enterprise software assumes no future reconnection of these Exchange sites; consequently, all public folders are homed to each individual site. This feature provides transparency of public folders to the Exchange users after site replication is complete and ensures continued access to the public folders.

The Minot situation arose when their sites reconnected and created a directory replication between themselves and Langley. A folder then existed at each base with the same folder identification and claimed to be homed in two different places. The Exchange software tried to resolve the conflict. By design, Exchange assigned ownership of the public folders to the site that made the most recent modification to a public folder property.

In April 1998, a Defense Message System (DMS) installation at Langley encountered a similar problem. Although the Minot staff shared its experience with the Langley installation team, it overlooked the fact that a number of enterprise servers in ACC were still using X.400 services. As the team began to update the Exchange software, circular references started propagating throughout the enterprise. Within hours, the mail cache exceeded 4,000 messages, and the servers went down. For more than two days Langley personnel could not send or receive e-mail, and during the third and fourth days, e-mail was sporadic.

Although most of the problems were resolved by the end of the second day, an incident that occurred on the third day cast doubt on the selected course of action. Systems operators, lacking any formal disaster plans, started brainstorming new solutions to the immediate problem. One solution called for restoring the affected server from a backup taken before the Exchange upgrade. This proposed solution could have exacerbated the problem because Exchange information spreads throughout the system. A single-server restore would put the enterprise into a confused state. The restored central server would operate with the assumption that all other servers still had complementary software, when they would actually be running a newer version. At some point, the systems would attempt to synchronize—with unpredictable results. Fortunately, the plan was abandoned after a network expert voiced concern. Instead, ACC decided to continue to fix the current software after network specialists from Microsoft provided on-site assistance.

By the end of the fourth day, the problems with e-mail remained. ACC characterized the status as 10 to 20 percent degraded, but the right resources were on site to fix the problem. Improperly configured connectors caused looping errors; consequently, simply typing a yes versus no during one part of the process caused changes in the server addresses that also had an adverse impact.

Some pre-existing conditions could have compounded the problems, but ACC estimates that the software change induced 90 percent of the situation at Langley. Correcting the problem took almost five days because ACC client-server systems are linked together in an enterprise environment, making diagnosis and repair more complicated.

Langley was the most complex site in the network. Over time, problems propagated because of the manner in which ACC linked the sites into a large enterprise. Deploying solutions took an equal amount of time. The installation teams followed similar procedures at other sites but did not experience the same problems.

Self-inflicted denial of service is not confined to military environments. According to experts, AT&T’s digital frame relay collapsed in April 1998 leaving thousands of companies without digital transmission capabilities for approximately 20 hours. During that time, e-mail, file transfers and financial transactions stopped. The technical explanation for the outage was a switch software bug that replicated itself via the network.

The density, complexity and interdependence of networks are growing, and occurrences of self-inflicted information assurance threats are also growing proportionally to the number of network devices installed. Both of these examples of military computer network failure stemmed from a cascading chain of circumstances triggered by a simple event: operator error.

Evolving networks into enterprises has created a dilemma: Should operation of the enterprise be outsourced to professional specialists or kept in house? Both options have dangers. Identifying and fixing the cause and effects of a failure can be complex. In some cases, because of unpredictable results, it can also be dangerous.

Although network weaknesses are not new, the magnitude, propagation speed and residual effects of today’s enterprise network failures are alarming. When integrity and availability of service is paramount, regaining control rapidly and effectively is the measure of success. The increased reliance on single-stream processes and the decrease in the popularity of system redundancy will result in a greater number of problems like those experienced at Air Force bases.

Countering the threat requires that several processes be adopted. A single organization, capable of detecting and resolving acts of hostility as well as preventing self-inflicted denial of service, must command and control enterprise networking resources. Centralized control provides a focal point for collecting, correlating and analyzing first tier network management metrics, compliance, threat warning/attack assessment and mission situational awareness. Full responsibility for change management is equally important to provide command authorities with fact-based recommendations to accept or deny network modifications that could affect operational availability.

Staying apprised of the operational health of the network via early warning signals can also reduce risks, just as warning of other types of disasters helps to limit the damage. Network operations and security centers correlate diverse indications and warnings through remote network monitoring and traditional communications methods and offer the best method to reduce the risk of self-inflicted denial of service. A team of trained network professionals with commercial network management systems can monitor each device and, in conjunction with configuration control, can significantly reduce the risk of a disaster.

When a failure is unavoidable, the extent of disruption caused by a self-inflicted denial of service will reflect the quality of the recovery plan that was in place before the accident occurred. Experts agree that an effective plan should be specific to minimize the number of independent decisions that must be made.

Trained on-site personnel should be the first to respond to the emergency. A disaster response team would assess the general extent of the damage, notify management authorities, identify losses in detail, begin the reconstitution plan and determine the critical resources that must be re-established immediately. A recovery team would assist in restoring systems in numerous ways—from rebooting computers to moving equipment to a designated hot site. This team could also coordinate efforts with other organizations affected by the same disaster. After normal operations resume, the cause of the problem should be evaluated, the recovery plan should be updated, individuals involved in the situation should be trained, and the revised recovery plan should be tested in a realistic, nondisruptive manner.

The solution to providing information assurance depends on professional management and operation of network centers along with a plan to reconstitute systems should the unthinkable happen. Professional management treats information as a corporate asset. The result is confidence and forward thinking that advances the availability and maturity of the entire enterprise.


Lt. Col. Glenn D. Watt, USAF, serves as the deputy chief of the network systems division, headquarters, Air Combat Command, Langley Air Force Base.