In my last post on the subject of Cloud Computing, I mentioned two ways to slice and dice data in the cloud -- depend on query tools to extract data to a local database, or use Data Warehouses to support the transactional system in the cloud. Today, I'll delve deeper into these two choices for culling meaningful trends and KPIs from data in the cloud.
Whether or not a transactional system is moved to the cloud, the data collected is still necessary for analytical processing. A transaction processing system is optimized to capture the specific transactions as effectively as possible. On the other hand, analytical processing data has to be optimized to allow detection of trends in Key Performance Indicators. Business Intelligence (BI) systems are usually built on the latter. When the transaction system is in-house, an Extract-Transform and Load (ETL) system can be written to automate the transformation of data from highly normalized transactional to denormalized analytical form.
Once the decision is made to move data collection to the cloud, the organization has to figure out how to accomplish ETL from data that is no longer in the form of an in-house DBMS. Frequently, the vendor of the SaaS software provides a query tool to allow collection and download of some data elements onto a local machine in the form of delimited text files. Once downloaded, these files have to be loaded into a DBMS for the BI application to properly work. At a minimum, this requires access by a power-user who is comfortable with extraction and loading of data. The query tool frees the power-user from dependence on the vendor and allows him/her to manipulate the data. However, since the data is now an abstraction in the cloud (and not a tangible DBMS), a query tool has to limit how and what can and cannot be extracted. This method is satisfactory when the volume and velocity of change are not very high. Because it is a manual and multi-step process, when data gets more complex and the number of transactions grows rapidly, a power-user (presumably with a day job) could make mistakes and this is where the query-based data extracts from the cloud could create unreliable analytical processing databases.
Additionally, if the query-based extracts cannot be automated, the process is not repeatable without human involvement – creating a requirement of training more than one power-user to mitigate a single point of failure. In my next entry, I'll discuss some possible methods to overcome these objections.