Document Exploitation Increases in Importance

August 2003
By Robert K. Ackerman
E-mail About the Author

Analysts can archive, translate large amounts of vital facts quickly.

New technologies that increase the ability to process and enhance text documents are giving a badly needed boost to intelligence experts fighting terrorists and their weapons of mass destruction. Many of these technologies are being employed overseas on the battlegrounds of Afghanistan and Iraq as well as in the pursuit of terrorists in other countries.

Software advances have added to standard capabilities such as text and character recognition. Experts who must deal with both paper and digitized information now can combine the two into a single archive while simultaneously tagging key data. And, vital information can be automatically recognized in documents undergoing the scanning and archiving process.

These technologies also can aid document translations by recognizing key phrases that are sought by analysts. A document can be scrutinized closely by human experts or deferred for later examination in favor of more important information.

One document exploitation, or DOCEX, system currently is being used by both civil government and military personnel. Produced by Vredenburg, Reston, Virginia, this system builds on company proprietary technologies and government research, explains Doug Dearie, senior vice president and chief technology officer.

Dearie relates that DOCEX can help users process the large volume of documents that are being captured or encountered in military and intelligence operations around the world. This encompasses documents that are in multiple languages and formats, which can present significant technical challenges to archivists and analysts. The DOCEX system can manage the process and allow humans to add value to the product, he notes.

This process entails several key elements. First, documents must be captured regardless of format. Paper documents are scanned and entered into a manageable form using optical character recognition. Foreign language documents are analyzed by machine translation software for conversion into English. Computer indexing allows analysts to search a repository for specific documents or topics within those documents, depending on tagging procedures. And, a screening capability allows humans to identify important aspects of documents and to tag or annotate them.

This system is not limited to paper documents, however. Many adversaries are maintaining vital records on computer hard drives or even data transfer disks. This digitized information can be entered into the same archive and given the same attributes as its paper counterparts.

“From a pocket letter from a prison detainee to documents coming off of computers in offices—the whole host of sources—[the document] has to be processed, organized, have value added to it and [then shared] with the right people,” Dearie notes.

This system had its origins in work performed by a team headed by Dr. Melissa Holland at the Army Research Laboratory (ARL), Adelphi, Maryland. Systems that Holland’s team built and fielded with the U.S. Army, such as Falcon, served as the genesis for Vredenburg’s DOCEX. Dearie relates that Falcon can scan a paper document for conversion to digital form and also translate it into English from a number of languages. The DOCEX system carries that capability to the next level by applying it to large quantities of documents and processing them into a common archive, regardless of types.

The company achieved this by combining its own commercial software product, HighView. Now in its fourth generation, HighView is a document, imaging and workflow management solution that can capture, process, archive and retrieve documents.

Dearie shares that the company has worked with the Army’s Intelligence and Security Command (INSCOM) to field the technology for several years. Using DOCEX is not new to Army personnel, but being able to use enhanced capabilities that now are available is new, he states. Now, the Army has an automated DOCEX discipline.

Larry Den, senior vice president of information technology at Vredenburg, relates that many of the large information repositories currently being uncovered are not in traditional languages. The lack of available linguists for these data increases the pressure for machine translation—especially because the material cannot be tagged for archiving until it is translated. “There is a real disconnect between the quantity of information we are finding and the number of people that can exploit those documents or understand what is in them,” Den states. “So, it is natural to say, ‘How can we get these documents into some language or form so that a greater quantity of people can understand them?’”

Den likens the technology that emerged from the ARL as flower petals without a hub. The company’s commercial product provided that hub to produce the DOCEX system. “We have put everything in one place that makes it very easy to use,” he says.

A key element of this system is that it can incorporate a number of different software-driven capabilities that address a customer’s particular needs. These may involve different types of machine translation or transcoding, for example. New capabilities can be plugged into the system easily, Den adds.

The company produces optical character recognition technologies, but it also incorporates other commercial products into its system. The firm keeps an eye on advances from other sources that can be added to the DOCEX system.

Dearie continues that the company already has added new machine translation engines or languages that have become newly important, and more are coming. The company does not produce these translation engines, which can come from any of a number of different sources in the commercial arena or intelligence community. Instead, with machine translation being an ongoing evolution of constantly improving products, the company incorporates the new technologies into its system to add new capabilities or improve performance.

Another area that might benefit from upgrades is analysis. After information is extracted from these documents, experts might want to apply link analysis or use visualization tools. These capabilities can be added to the system.

“We can add new components that weren’t even contemplated that might assist with the process,” Dearie emphasizes.

Combining machine translation with key word or phrase searching helps save vital time in searching for important documents. “Without a system like this, you are facing a roomful of documents written in a language that you don’t have or for which you have a limited number of linguists,” Dearie offers. “How do you consume that—how do you search through it? How do you decide what’s there? How do you find anything of value?”

Instead of one-at-a-time viewing of documents by a linguist, the system employs a machine translation to sort through the collection of documents and discover high-priority information. Not only does it remove unnecessary steps and reduce human input, but also it helps winnow the wheat from the chaff, Dearie warrants. 

Den adds that current machine translation engines still have a way to go before they provide ideal translations, especially with the non-mainstream languages that now are areas of high interest to the intelligence community. A machine translation of a document in one of these languages is “virtually unreadable in terms of putting it into nice, clear English,” he says. However, key words or phrases do stand out, especially when the DOCEX system is keyed to recognize them.

“The word I would use is ‘triage,’” he continues. “If you have 20,000 documents and one person who knows that language and can translate them, then the old way would begin with page one and work its way through to the end. The hope would be that the important documents were early in the search instead of at the end.

“Now, however, you have a system that permits inputting them as fast as possible [and applying] some degree of translation to them. Then, we’ll do a query for the things we know are important. Then, those documents—even with the rudimentary translation—can [be sorted to] push out the important ones that merit a closer look. So, the valued linguists can be spending their time where we have an educated guess where they ought to be, instead of looking at the menu from the restaurant down the street,” Den declares.

Dearie elaborates. “If you know in advance that you are looking for a certain type of intelligence or a certain arena of information, then you can write pre-canned queries that look for that, and they can be run automatically at periodic intervals against all the documents that are in the system and that are arriving new in the system,” he adds. “As soon as the documents are scanned in and have been OCR’d [optical character recognized] and machine-translated and/or indexed, then these queries can be run against them.” These same queries also work on added values.

The DOCEX translation is not set in stone, even after it is stored in a desired archive. A later human translation can be collocated with the document’s original image. Other aspects of state-of-the-art document management and workflow can be incorporated. Users can place metadata tags on individual scanned documents as well as highlight important parts of the documents and message them to other people.

This DOCEX system can be fielded on a single laptop or run in an organizational structure featuring hundreds of users. Capabilities can be scaled up or down, depending on needs, and the system is not platform-specific.

In addition to integrating government and commercial software capabilities, the company is working to develop its own system upgrades. One area of research aims to extract value from unstructured content, Dearie relates. This applies to all types of unstructured content, not just paper documents—audio and video, for example.

While ideal speaker-independent speech recognition still has many obstacles to overcome, some aspects offer promise. Related technologies that do not require full recognition—for example, phonic indexing— can be applied.

Enjoyed this article? SUBSCRIBE NOW to keep the content flowing.