Summary: In this article we look at the burgeoning field of Interactive Data Preparation. We take a close look at what it is, why its becoming increasingly valuable, some common use cases, recent commercial activity and why industry trends will continue to fuel investment in this space.

In this article we look at the burgeoning field of Interactive Data Preparation. We take a close look at what it is, why its becoming increasingly valuable, some common use cases, recent commercial activity and why industry trends will continue to fuel investment in this space.

Interactive Data Preparation is the process of visually interacting with one or more data sets in order to transform and clean dirty or partially complete data. The output of such “Data Wrangling” is often the creation of a scriptable set of actions or transforms that may be run against a larger set of data. The resulting data sets are frequently used as input into high-level visual reporting tools such as Tableau or QlikView

The process of restructuring data into a form suitable for analysis with visualisation tools can be complex and time consuming. During the transformation process, analysts are often hampered by data quality issues that must be addressed during the restructuring phase. Examples of data quality issues include correcting misspelt data, removal of duplicate data, missing data and removal of outliers.

The complexity of Interactive Data Preparation along with its time consuming nature result in two key problems: firstly, analysts spend up to 80% of their time developing transforms and only 20% of their time analysing the data that has undergone the transform; secondly, development of transforms requires significant help from skilled data scientists and has often been out of the grasp of less technical users. Simplifying this stage of the process workflow empowers less technical users and effectively leverage data scientists within an organisation.

We are in a significant growth phase in the amount of data being collected, stored and analysed. In part, this is due to organisations looking to derive greater value from previously untapped data. This is placing even greater pressure on data analytics transformation pipelines and highlighting current inefficiencies. While the total cost of experimentation within organisations has significantly decreased, in part due to the adoption of highly virtualised environments and more recently public cloud providers commoditising compute and storage, the relative cost of data preparation has increased and has become a bottleneck in the end-to-end process.

A Big Data report by McKinsey highlighted a predicted shortfall in the US of between 140,000 and 190,000 people with deep analytical skills by 2018 and a staggering shortage of 1.5 million managers with the skills necessary to use data to make effective decisions. If organisations are to effectively use data, they will need to streamline each stage of their data analytics pipeline to make effective use of available resources.

Collectively, these industry trends are highlighting an increasing need to innovate within the Interactive Data Preparation space by providing a richer visual experience, greater abstraction from underlying complex algorithms, intelligent recommendations, adaptive learning and greater visibility into data lineage.

The task of visually interacting with data for the purpose of data cleansing or transformation is nothing new. Microsoft Excel is one of the most commonly used tools to perform such a task. While Excel can be used to process datasets of up to one-million rows, which would accommodate a significant proportion of business analyst workloads by size, its support for automatically suggesting suitable transformations to users and subsequent ranking of suggestions is basic or simply does not exist.

Another weakness of traditional visual data preparation tools is their lack of data lineage or data provenance. While tools such as Excel readily allow operations to be undone, they do not provide a history of the declarative steps undertaken in the current working transform. Such information provides two key benefits: firstly the user and potential collaborators can clearly see how a collection of data sets were transformed into a new data set; secondly, it allows the transform operation to be used in an off-line batch or real-time scenario.

Excel is of course a great tool for working with small to medium clean data sets, has a comprehensive and stable set of inbuilt functions, not to mention excellent visualisation tools. However, it lacks a number of the features required for analysts' Big Data Interactive Data Preparation needs.

Until the end of 2013, there were few commercial solutions focussed on Interactive Data Preparation. However, two promising organisations, Paxata and Trifacta had been busy developing products since their respective incorporation in the first half of 2012. Paxata and Trifacta came out of stealth mode and launched limited access public betas in November 2013 and February 2014 respectively. While access to the beta programs within each organisation is currently limited, Trifacta's participation in Strata and a demonstration of Paxata's solution I attended recently, suggest that both organisations have advanced first releases and a solid foundation upon which to innovate further as we progress into 2014.

As strategic partnerships form in this space and market trends continue to drive innovation, I strongly suspect that we will see a number of additional players emerge and compete for what will continue to be a growing problem for most data analysts.