If you’re ready to deploy a business intelligence reporting and analytics solution, then data quality is probably on your mind. Last week, we demystified master data management (MDM) and how it combines technology with consensus-building to improve and maintain data quality. Today let’s review another technology option for data quality initiatives: ETL.
Extract, transform, and load (ETL) tools are widely used to improve data quality. ETL tools first combine data from normally heterogeneous sources to a single repository (the extract phase), transform the data by cleansing and organizing it to optimize reporting and analysis, and then load the cleansed data into a data warehouse or other system, where end users can then rely on vetted information. When just starting out, many organizations simply use ETL to cleanse their data – like resetting dates or adding default values to empty fields. More advanced data cleansing can mean “deduping” duplicate customer records or parsing fields like “Full Name” into separate fields, “First Name” and “Last Name.”
Enterprise ETL tools are expensive, with Oracle’s Data Integrator costing $23,000 per processor, for example. There are open source and low-cost ETL tools out there, but they’re not generally suitable for enterprise-scale data quality initiatives. ETLTools.net has compiled a list of them as well as their limitations.
The advantages of ETL tools vary. Obviously SAP ETL tools offer tighter integration with SAP products; the same with Oracle. Some tools are faster than others to implement and learn, and some offer better support and documentation than others. Beyond advanced data cleansing support, it’s important to carefully consider the following list of items before purchasing any ETL tool:
82% of top performers implement a process for continuous data quality improvement.
– Gleanster Deep Dive: How Top Performers Improve Data Quality for Better Business Intelligence, January 2011
In recent posts, we’ve explored the consequences of poor data quality and also evaluated who should be responsible for maintaining good data within an organization. We’ve seen that there’s no quick fix for subpar data quality; rather, ensuring superior data requires a well-orchestrated team effort. A 2011 Gleanster benchmark report revealed that top performing companies understand that maintaining data quality is an ongoing discipline requiring constant attention. Organizations successful in the pursuit of better data have implemented strategies such as these to continuously improve their data:
1. Have a policy in place and take ownership
Organizations may hire a data quality manager as a dedicated resource and the first line of defense against bad data. The data quality manager governs data processes by ensuring that reliable information is loaded into the data warehouse and is responsible for data processes such as migration, manipulation and analysis. Additionally, some BI systems like arcplan allow authorized users to write back data directly to the data source. In this case it is the responsibility of that user to enter accurate information and not corrupt the system with erroneous data.
2. Enforce data quality at the source system
“Garbage in, garbage out” is the phrase to keep in mind. Making particular information mandatory in the source system is one way to go about maintaining data quality…
In my last post on the subject of Cloud Computing, I mentioned two ways to slice and dice data in the cloud — depend on query tools to extract data to a local database, or use Data Warehouses to support the transactional system in the cloud. Today, I’ll delve deeper into these two choices for culling meaningful trends and KPIs from data in the cloud.
Whether or not a transactional system is moved to the cloud, the data collected is still necessary for analytical processing. A transaction processing system is optimized to capture the specific transactions as effectively as possible. On the other hand, analytical processing data has to be optimized to allow detection of trends in Key Performance Indicators. Business Intelligence (BI) systems are usually built on the latter. When the transaction system is in-house, an Extract-Transform and Load (ETL) system can be written to automate the transformation of data from highly normalized transactional to denormalized analytical form.