Data Mining Software Accelerates, Streamlines Research and Analysis

July 2000
By Henry S. Kenyon

Program targets word relationships in documents, provides concise summaries of requested material.

An information retrieval tool that is more powerful and accurate than standard search engines allows users to quickly retrieve information within several mouse clicks. The application combs through the “hits” produced by investigation programs to find documents based on specific word associations and creates a database from these results. By highlighting only specific paragraph-sized sections of text, the technology streamlines the research process by eliminating the task of scanning entire documents to find bits of relevant data.

The vast amounts of unstructured material available on the Internet and World Wide Web present a challenge for analysts and researchers. Although search engines can be fine-tuned to locate specific information, the result is often hundreds or thousands of pages of material. While the process is already both labor intensive and time consuming, the difficulty of these searches is compounded by the continuing proliferation of material appearing in unstructured hypertext and text files instead of easily identifiable relational databases.

The advanced thematic navigation system (ATHENS) is designed to bridge the performance expectation gap of standard search tools by reducing search time, increasing the precision of content retrieval, and allowing for the discovery of word relationships. Offered by Veridian Pacific-Sierra Research in Rosslyn, Virginia, the program mines information from unstructured open-source documents, hypertext markup language (HTML) and text, while seeking thematic associations between words and terms.

Most search engines take large piles of documents and apply algorithms to produce smaller piles, ATHENS Project Manager Kevin Grottle claims. However, even the relatively small document stacks still can be too large for practical use. By comparison, this tool allows the user to navigate down to a subdocument level. Using a library as a metaphor, for example, a search engine can locate a shelf of books on a specific topic, but ATHENS can find specific passages in those books, he says.

ATHENS originated as part of a Defense Advanced Research Projects Agency (DARPA) program called Genoa. The initial software technology was developed by the Alpha Research Corporation, and Veridian Pacific-Sierra purchased the basic beta package in April 1998. The program was attractive because it was a good way to pluck information quickly by allowing an analyst to find something in three or four clicks of a mouse, Grottle explains.

The program fits between the user and the search engine by eliminating the need to laboriously read through entire documents to find relevant passages, he says. Extensive DARPA testing found that the technology was three to seven magnitudes better at nugget retrieval than other software-based methods.

Originally a Linux-based system, ATHENS was ported to a Solaris-based system after its purchase, and key components were replaced with Java. Pieces of software were then developed to allow the program to be browser-launchable and to fit into Genoa, Grottle says. Other additions included an enhanced database, algorithms and query engines as well as a redefined look and feel to the graphical interface.

Unlike a traditional search engine, ATHENS requires some online setup time prior to conducting a search. This time is used to download uniform resource locator (URL) seeds, crawl the Web, and construct a database of results. The current practical limit for an ATHENS database is 100,000 documents. No users have yet approached that number, Grottle observes.

To speed and enhance research, the program can be loaded with a variety of search engines to seed the inquiry. Depending on the type of investigation, a software wizard can help build the database.

After each search, the program provides a list of related documents found. The system is capable of simultaneous searches on different engines. Once all the URLs are loaded into the tool, a crawler program is sent out to investigate each page and link.

This capability makes ATHENS an efficient data collection mechanism, Andrew Kerr, Veridian Pacific-Sierra’s product marketing manager, explains. The program can create a database of thousands of documents. These stored items can be sorted in a number of ways, for example, by the frequency of a specific term or in alphabetical order. An analyst can choose multiple search terms such as amendment and Republicans, for example, and query them, he says.

While search engines are useful, they cannot cover all of the data, Kerr notes. ATHENS amends these shortcomings through the discovery aspect. Compared to other data retrieval software where users do not always know what they are looking for, this navigation system allows a search to change directions. Central to this is the graphical user interface, which creates a navigation tree of terms.

The tree, which appears on screen as a series of linked tables, is a manifestation of the principal ideas behind the software—the concept of knowledge discovery and categorization by themes or stories. This navigation technique is combined with the graphics to allow on-the-fly search changes as more relevant data becomes available.

A key feature of the program is that it can locate and identify unique names or terms and prepare a query list of those terms for the user to look up. A query builder window shows how many times a particular word appears in the database. Stored terms can be listed in several ways. If the command is given for an alphabetical search, for example, ATHENS will display all variants of the word, including misspellings.

Clicking on specific words on the navigation tree retrieves specific passages for study. The software displays the data in sets, with the query terms highlighted against the surrounding text. Users can also retrieve individual source Web pages. Once a subject is selected and clicked, users can choose from various semantic regions to read. Roughly a paragraph in length, these specifically highlighted portions of a document feature the key term in the search, Kerr says. After these sections are selected, clicking on the HTML button displays a list of all of the passages of cached copies. The page listings also feature links back to the document’s site of origin to produce a document page with the highlighted text section.

ATHENS defines a theme as a semantic association between two or more elements in a document—such as words or names. This association indicates that the words are functionally related to each other in the ideas they are conveying and occur in the same sentence or paragraph. Regions within a document where the words are semantically related are called stories.

The software rapidly explores a number of themes, which lead to key stories. The thematic navigator window presents analysts with the tree structure of these stories arranged by key words. According to company officials, an individual theme, defined by connecting various words in the tree, represents all subdocument stories where the selected words are semantically associated with each other.

By following the word paths, an analyst can explore themes that are relevant to the desired information, Kerr says. Related themes are constantly updated as the tree is navigated, and a story display can be brought on screen at any time to explore interesting themes. Because the words are subsets of full documents, ATHENS can display a larger region around the story for context, or the entire source HTML document with the highlighted story can be viewed. The original source HTML location can also be requested.

The software is available on the U.S. Defense Department’s Intelink network. Kerr notes that Veridian Pacific-Sierra plans to begin with the government community before moving the product into commercial markets. ATHENS currently has more than one million users on Intelink, he says. The program has operated on Intelink since April 1999 and is accessible through the Pacific-Sierra Research home page. Kerr hopes to expand its use to other agencies.

The software itself is not very bandwidth intensive, but it does require memory, Grottle says. ATHENS also uses considerable amounts of swap space—portions of the hard drive used as temporary memory. The database component of the software is written in C++, though it could integrate into a relational database management system package such as Oracle. Data indexing currently is done through home-grown software; however, the advantage of going to a provider like Oracle would be the additional support muscle, he says.

A new version of ATHENS became available in June. According to Grottle, the updated software is divided into smaller pieces, enabling it to operate within a browser and not simply be launched by one. The current version of the program also uses a JAVA applet, which is not compatible with some government security software. An all-HTML version also is being developed to offer wider accessibility, Kerr says.