Cleansing Emerges as Trend In Data Warehouse Efforts
Stored information showered with inaccuracies must be scrubbed of errors to be effective tool.
Organizations that rely on large amounts of data are increasingly employing data cleansing techniques to ensure accuracy and efficiency by scrubbing data that has been polluted at the source or on its way to a data warehouse.
While experts agree that this is an emerging trend among data warehouse users, they say most businesses are not paying enough attention to the problems inherent in having impure data. Often, businesses operate with the assumption that their decisions are based on credible information, but many times this is not the case, experts say. It is not until businesses try to use the data that they realize the fallacies within their information warehouses.
This is not only the case in industry. Government integrates information vital to its operations as well, so data users in both arenas must be aware of impurities that can creep in when information is entered into a data warehouse. Common errors include a system misunderstanding the information, a lack of standardization and missing or buried data. All of these demand that data be cleansed prior to users relying on the information when making decisions, experts say.
John Ladley, a research fellow at the Meta Group, Stamford, Connecticut, says 80 percent of data warehousing efforts goes into cleaning, extracting and loading data. Ladley, who is also vice president of Knowledge InterSpace Incorporated, Chicago, finds that data quality is still the second most prominent challenge involved in a warehouse project.
“Most organizations are plagued by a state of denial as to the level of quality they have,” says Larry P. English, president and principal of Information Impact International, Brentwood, Tennessee. From the management level down, people just assume that the data is credible, English says, and Ladley agrees offering, “It is usually much worse than people think. Almost all organizations that have a system of some sort do not realize that there are problems with the data.”
Designated as a founding father of the data warehouse, Bill Inmon says that data cleansing is a serious issue in data storage. “It’s after the user starts to use what’s out there that they start to scream,” he says.
The troubled data stem from many sources, including when functional data is created for a single, specific use and not designed for an enterprisewide solution. In addition, business practices are often altered, and historical data becomes inaccurate. “The essence of a data warehouse is to contain data over time. That data, over time, will change,” Inmon states. Ladley agrees, “Organizations change their business processes, and that changes the way data relate to other data.”
The data cleansing market is estimated to be a $300 million a year business by the year 2000. Businesses often spend five to 10 times more money to correct their data after it is entered into the system than they would have if they had headed the problems off at the source, according to English. He advocates employing data cleansing efforts from the beginning of a warehouse project. All too often, he relates, businesses entirely abandon data warehouse efforts because the quality of their data is so poor that the illegitimate data cannot be used to help the business.
He refers to data cleansing as information “scrap and rework,” likening the act of scrubbing data to a manufacturing process in which quality is recognized over production rates. Just as many manufacturers have found that quality reigns in the success of their products, so too will information users discover the business and strategic value of accurate information within a data warehouse, he adds.
Referring to what he calls total quality data management, or TQdM, English suggests that people measure the real cost of poor data quality to understand the requirement for accurate information. When businesses quantify the need for data accuracy, there is a greater understanding of why data integrity is so crucial. After they qualify the need, “people wake up to the reality that this is not just an inconvenience, but it is a drain of the profits of an organization,” English states.
Clean data depends on the reasonability of the person entering the data, or the data source, the clarity of the data that is being requested and the validation process used to flag potential inaccuracies. Often, businesses push data entry clerks to produce many records at a time. The incentive to enter great quantities of data quickly is counterproductive to data quality, English adds.
Vality Technology Incorporated specializes in data cleansing. The Boston-based firm has developed software that investigates, transforms, standardizes, matches and integrates data from multiple sources. The product was created to help businesses make practical decisions with a mathematical approach to dealing with data problems.
The company’s Integrity software is a client-server-based data re-engineering product that features a “fuzzy matching” capability that Vice President of Product Strategy Stephen Brown defines as the ability to find related material and detect relationships between records despite file inconsistencies.
The software runs on either Windows NT or Windows 95 and on server environments including MVS, OS/390, HPUX and Sun Solaris. It uses the same interface to ensure that moving from one server allows a transparent data shift without information alteration.
Typically, Vality works with customers that have an average of 10 million to 20 million data records that require analysis. Some customers have as many as 90 million to 100 million records that need to be cleaned to ensure data integrity.
The company has worked on projects for Lockheed Martin, which was combining systems after merging with other firms. Vality was also called upon by the U.S. Navy, which is seeking to integrate stovepipe systems into a single data repository to support the warfighter. In yet another application of data analysis, the National Highway Traffic Safety Administration sought technology that could match police reports with ambulance reports to justify the need for wearing seat belts.
Brown says that the data engineering Vality offers through its Integrity software is “getting down to the true semantics of the data value.” This incorporates domain integrity as well as entity or logical key integrity. Domain integrity is the ability to understand a single instance of a value in its business sense. Entity integrity or logical key integrity is the matching that is done to understand where the needed values are located.
The company promotes a phased approach that consists of four steps. First, data investigation must be performed to determine the condition of the data. This helps data evaluators identify and correct problems before they enter the warehouse. The second step involves data transformation to determine what form the new data should take. The Integrity software provides conversion routines that assist in the transformation to newly developed standards.
Representing the company’s mathematical approach, the third phase involves the use of statistical matching algorithms that score linked items by how similar they are to each other. In the last phase the data is formatted and moved to the target destination.
Vality officials maintain that data re-engineering and cleansing improves customer service and retention by ensuring accuracy in business decision making. Data integrity allows a company to maximize its sales and marketing opportunities and enables smart enterprisewide solutions to be achieved. The effects of data contamination can be halted if a problem is caught early in the process.
Another company that specializes in data cleansing is Trillium Software, Billerica, Massachusetts. Trillium produces enterprisewide data quality management software that focuses on preventing data pollution. Data entry validation is just one of the methods the company offers as a way to cleanse data warehouses.
Without clean data, everything that is based on or results from the information loses credibility, says Trillium’s Director of Marketing Leonard Dubois. “Many businesses do not realize the condition of their data, or they believe the condition of their data is such that they do not need data quality. Sometimes they do not realize until they try to put these large systems into place.”
Trillium’s approach to cleansing data involves investigating data quality, cleansing the existing data and preventing contaminated data from entering the warehouse to curb future errors.
Citibank and American Express use the company’s software to detect fraud. Incoming information can be matched against databases to assist the credit card companies with risk management procedures. Many state governments are using the software to ensure that services such as welfare are being delivered to the right people in the right amounts. The company has also tracked import and export transactions for the U.S. Customs Service.
The software’s cleansing functions allow real-time editing as data is being entered into the system, explains Dave Pietropaolo, national account manager, Trillium. In conjunction with the prevention ideology, the company’s product line includes an on-line editing feature that helps ensure that information is accurate as it is entered into a system. This can be particularly helpful in a situation where a service representative enters data directly into the system while speaking to a customer. “While you’re on the phone, lots of cleansing is happening,” Pietropaolo claims.
The software offers international data cleansing capabilities as well. It is currently being used in 25 different countries in eight different languages. Pietropaolo notes that this capability makes the software useful for law enforcement or intelligence agencies as well as other businesses that house information that crosses linguist borders.
Trillium’s strategy for data management includes the need for an enterprisewide approach. By understanding the data that exists within an enterprise, maintenance can be simplified, company officials maintain. Therein lies the financial benefit of data quality assessment from the beginning for an entire business entity.
As data warehousing’s top guru Inmon says, “Today we have a more sophisticated view of what the data warehouse is.” As the understanding of data management and how it should be used evolves, businesses will likely latch onto the emerging trend of data cleansing to provide information integrity and to solidify their business decisions, he adds.
“The cost to American industry is enormous,” says Ladley, citing poor decisions and lost customers as just two results of allowing data accuracy to lapse.