Abstract of MS Thesis - Nitin Kumar

Wednesday 11 April 2012 at 02:21 am.

An ETL process is used to extract data from various sources, transform it andload it into a Data Warehouse. In this thesis, we analyse an ETL flow and observethat only some of the dependencies in an ETL flow are essential while others basicallyrepresent the flow of data. For the linear flows, we exploit the underlyingdependency graph and develop a greedy heuristic technique to determine a reorderingthat significantly improves the quality of the flow. Rather than adoptinga state-space search approach, we use the cost functions and selectivities to determinethe best option at each position in a right-to-left manner. To deal withcomplex flows, we identify activities that can be transferred between linear segmentsin it and position those activities appropriately. We then use the re-orderingsof the linear segments to obtain a cost-optimal semantically equivalent flow for agiven complex flow. We have also presented an efficient framework to generateETL flows that can serve as test suite. Experimental evaluation has shown that byusing the proposed techniques, ETL flows can be better optimized and with muchless effort compared to existing methods.