If you're ready to deploy a business intelligence reporting and analytics solution, then data quality is probably on your mind. Last week, we demystified master data management (MDM) and how it combines technology with consensus-building to improve and maintain data quality. Today let's review another technology option for data quality initiatives: ETL.
Extract, transform, and load (ETL) tools are widely used to improve data quality. ETL tools first combine data from normally heterogeneous sources to a single repository (the extract phase), transform the data by cleansing and organizing it to optimize reporting and analysis, and then load the cleansed data into a data warehouse or other system, where end users can then rely on vetted information. When just starting out, many organizations simply use ETL to cleanse their data – like resetting dates or adding default values to empty fields. More advanced data cleansing can mean "deduping" duplicate customer records or parsing fields like "Full Name" into separate fields, "First Name" and "Last Name."
Enterprise ETL tools are expensive, with Oracle's Data Integrator costing $23,000 per processor, for example. There are open source and low-cost ETL tools out there, but they're not generally suitable for enterprise-scale data quality initiatives. ETLTools.net has compiled a list of them as well as their limitations.
The advantages of ETL tools vary. Obviously SAP ETL tools offer tighter integration with SAP products; the same with Oracle. Some tools are faster than others to implement and learn, and some offer better support and documentation than others. Beyond advanced data cleansing support, it's important to carefully consider the following list of items before purchasing any ETL tool:
- Is there a free/low-cost trial available? This will help you determine the software's learning curve and limitations up-front.
- Speed & ease of development
- Data formats: can the tool handle both structured and unstructured data (especially if you're looking to integrate social media data streams into your data warehouse)?
- Storage: are your development packages stored locally on your machine (meaning you can work on them offline and upload to the server when ready) or saved directly to the server?
- Support: is there an online community of developers you can count on? How many partners/resellers can provide additional support?
ETL tools do require dedicated resources and there is usually a difficult learning curve. They can also only tell you if/when something is wrong with your data – it's a manual process to fix the problems – and they do nothing to prevent bad data from coming back into your systems.
Basically, using an ETL tool is a quick way to jumpstart a data cleansing initiative. It can serve as an interim solution until you can move on to more complex processes (like MDM) in the future.
Ensuring good data quality is not a one shot effort; it takes continuous process management in order to be successful. I would love to hear how our readers are managing their data quality initiatives. Often just getting started is the worst part. Where are you in the process?