The semantic Web will connect data in new ways and lay the foundation for further advances.
It is the objective of the U.S. Navy’s Information Dominance Corps to manage a global network that delivers instant integration of military data across a number of separate specializations such as geographic, intelligence, logistics and manpower, as well as provide information about red or blue forces. The semantic Web will be the engine needed to power the effort.
These objectives create an unprecedented demand for the retrieval of unrelated data from sources that are diverse and not interoperable. Such data now is stored in files that have inconsistent coding. The existing files are organized in contract-mandated projects that answer only inquiries that are limited to their respective enclaves. For answers that combine weapons, geography or logistics, Information Dominance Corps (IDC) analysts must surf through several databases, which are neither synchronized nor compatible.
Currently, the IDC has to depend on human analysts to use judgment in the interpretation of scattered facts. That is not easy because the analysts have to deal with different vocabularies, undocumented data definitions and dissimilar formats. Therefore an enormous effort is expended in the cross-referencing of disparate data repositories and to reconcile data sources that describe the identical event, but are coded differently. With the inclusion of tens of thousands of sensors and with the presence of thousands of computing devices in the global Navy/Marine Corps network, the number of analysts that would be required for sifting through all this data would exceed whatever is manageable and surely affordable.
To overcome manpower limitations in the future, the IDC will have to resort to semantic Web technologies to assemble and correlate data that would support operating needs. The semantic methods are techniques that rely on the extraction of the meaning of data from their related context. Such context is obtained by appending to each original data source a long list of related information. These are called data ontologies.
Ontologies are formal statements that describe a particular data element. The texts of ontology statements are annexed to their respective data in a standard format. In this way they become readable as computer-addressable data entries. As a result, all data files end up as strings of ontologies that are attached to their respective data sources, which reveals the logical relationships. This arrangement makes it possible for computers to search and retrieve relationships and connections to data sources. It connects the scattered dots of seemingly random military data. It reveals the hidden meaning of transactions.
In a mature semantic Web, gigabytes are devoted to associate ontology statements for descriptions of only a few bytes of original data. The adoption of ontology-based semantics requires the construction of computing facilities that house huge amounts of computing and storage capacity. The handling of such enormous amounts of data requires data centers that possess economies of scale in capital cost while conserving energy that otherwise would swamp most of the available generating capacity. Such data centers can cost as much as $1 billion.
Ontologies can be generated automatically by browsing through logically related information in multiple databases searching for numerical information, but primarily for unformatted text that has been placed on disks in a narrative format. Indexing text by some sort of a numerical coding schema is not of much use. Indexing relies on pinpoint identification of each data element either from its numerical value or from words used as keywords. Index methods are precise, but cannot discover relationships that have not been tagged previously. They are useless in the case of foreign languages or with new vocabularies.
The difference between the index and the semantic methods is that data retrieved by index methods must depend on human intervention to extract knowledge from a huge number of possible choices. For semantic extractions, the available data is examined by computers and only then presented as a small number of results for further examination by human operators.
The purpose of the semantic Web is to make it possible for the IDC network to connect useful information from tens of thousands of databases automatically. The warfighters then can be shown what possible actions they could take. With the adoption of semantic methods, IDC will not be looking for thousands of uncorrelated search results, as is the case right now. It would receive answers in the form of a few priority-ranked findings.
The IDC computing environment should consist of a distributed but highly redundant global network. Various nodes of this network should collect information from every platform that acts as a data collector, such as desktops, laptops, smart phones, battlefield texting communications, unmanned aircraft video images, satellite pictures and radar tracking. A selection from this data would become available to appropriate persons because the network would possess situational awareness about each warfighter.
The ultimate objective is to endow everyone with the capacity to compile, assess and exploit information in support of decisions. Only a semantic approach in which the computer network relates data to its local situation can deliver that outcome.
The semantic approach makes it possible for computers to “understand” what is dispersed among widely distributed files. Only machine-readable data can be used to sift through every file that could possibly reveal what otherwise is hidden. Only by means of automated software agents will IDC analysts be able to support information dominance.
Ultimately, the data collected by IDC will require the recovery and storage of information from tens of thousands of connected devices. This data would be placed in petabytes worth of files, growing into exabytes in less than a decade. It would require the offering of high reliability levels—100 percent with automatic failover—when supporting combat. All of the data, in different data centers, would have to be accessible—in less than 250 milliseconds—for retrieval from multiple files. This would make IDC information universally discoverable and accessible while maintaining assured levels of security.
The IDC network requirements are demanding. They exceed, by a wide margin, the existing capacities. The initial operational capability would call for processing more than 100,000 transactions per second. The capacity for handling these transactions would have to grow exponentially with time because it would be carrying high-bandwidth graphics, images and color video. Such transmissions consume multiple megabytes of carrier capacity per transaction. Consequently the bandwidth to and from the IDC channels would have to be measured ultimately in terms of thousands of gigabytes per second.
After the receipt of the raw data into the IDC files, linked supercomputers would have to screen the inputs for further analysis. Software then would be deployed to preprocess inquiry patterns in order to identify standard queries so that typical questions can be answered without delay. One of the liabilities of semantic methods is the enormous amount of computation that is required to deliver useful results. The preprocessing workloads on the IDC supercomputers vastly would exceed what is needed for the handling of simple messages.
The projected size of IDC data files that support semantic processes is likely to exceed currently available space by a large multiplier. At maturity, it would require storing a stream of data totaling at least a thousand terabytes per hour or more than 20 petabytes per day, which is comparable to the processing load of the search engine Google. Google and IDC differ only in that the Navy requires higher system uptime to support warfare conditions.
The key tools for constructing and using the semantic Web are the Extensible Markup Language (XML), the Resource Description Framework (RDF) and the Web Ontology Language (OWL). The management of these standards is under the guidance of the World Wide Web Consortium (W3C). The term semantic Web refers to the W3C’s vision of how data should be linked on the Web. Semantic Web technologies are methods that enable people to create data collections, build vocabularies and write rules for the handling of related data. These three techniques are now labeled as Web 3.0 solutions.
XML is the protocol for recording data for Web accessibility. It is the format in which all data is recorded.
RDF is the model for assuring data interchange on the Web. RDF facilitates data merging and correlation even if the underlying recording schemas differ. RDF supports learning about data recording patterns over time without requiring the data identification to be changed. RDF forms graph views of recorded information, which is useful in presenting easy-to-understand relationships among data sources.
OWL is a family of languages for authoring ontologies. OWL would represent the knowledge about the events and their respective relationships as they apply to IDC operations. They form an added layer of meaning on top of the existing Web service protocols. Although many of the OWL descriptions can be obtained by automatic means that use mathematical algorithms, ultimately it will take a human analyst to find the applicable IDC relationships. This can be done only if everyone shares a common vocabulary for describing shared knowledge for the IDC enterprise. However, commercial software packages already are available that support the formation of OWL-compliant semantic relationships, which should speed up the adoption of these methods.
After sufficient experience is accumulated by means of automatic data mining of transactions, many of the ontology templates can be reused so that the labor cost of maintaining the semantic Web can decrease. The IDC databases then can be organized as specialized Web services such as those that produce information for target selection. Such services then can fuse data from dozens of sensors and the latest geographic images as well as data about available weapons, and they could be deployed aboard ships.
The semantic Web should be viewed as the latest extension to the current Web. The semantic Web advances searching methods from inquiries that are based on structured data to producing results that answer uncorrelated questions even if they are in the form of colloquial sentences. The semantic Web therefore should be seen as an enhancement to the already existing methods that are available for accessing information over the Internet.
Semantic methods overcome the current limitations of separate and disjointed Web pages that cannot be collected readily for the assembly of enterprisewide information except through human intervention. The semantic Web advances the IDC from connecting Web pages by means of the analysts’ eyes to connecting the underlying data by means of computers. It advances IDC analysts from sifting through piles of computer listings to using computers to identify a few possible answers.
Data ontologies will become the method for applying semantic-based applications to IDC operations within the next decade. The enormous expansion of IDC data, especially with the sharing of sensor, logistic and personnel information, will make the semantic-based retrievals of information an absolute economic necessity.
Ultimately, ontologies will form the foundation on which other advanced methods, such as fuzzy logic, artificial intelligence, neural networks and heuristic searching can be adopted. Those are reasons why the use of the semantic Web should be seen only as another but very important steppingstone in the evolution of computer-based reasoning that cannot be delayed.
Paul A. Strassmann is a distinguished professor at