Data provenance, or data lineage as it is otherwise known, refers to the process of recording the origin and transform history of data throughout its lifetime. In this short article, we take a look at data lineage from the perspective of data analytics, specifically with a focus on large [Big] Data sets as is common with data analytics workloads.

Before we focus on data analytics, let us consider how data provenance is addressed by software development teams and why it is important. The distributed version control system, Git, is a popular tool for not only allowing developers to collaborate on a shared codebase, but also provides a detailed history of each file and who made (committed) a given particular change. Git, like a number of other version control systems was designed for small (from a data analytics perspective) collections of files. For example, if we look at the current limitations of github.com, we see that files greater than 100 MB will be rejected and while there are currently no restrictions on the size of a repository, creating a repository of several TBs of data clearly is trying to place a square peg into a round hole.

Let us imagine that the software developers described above are creating applications to perform machine learning or other data analytics tasks. A typical environment may consist of some application code that represents a model or algorithm, one or more flat-files that contain model coefficients and input and potentially output data sets. While Git maybe a good choice to provide data provenance for the application code and model coefficients, it is unlikely to be a good fit for large data sets. While storing data and code in separate repositories is not by itself a problem, the lack of data provenance with a number of storage tiers along with a lack of management tools to tie them both together creates problems.

Let’s consider the data provenance features of Amazon S3, a storage platform that has strong levels of data provenance at the individual object level. Amazon S3, uses buckets (containers) to store objects (files). Amazon S3 allows you to enable versioning on a bucket, thereby providing a history of each object. Combining Amazon S3 with Amazon CloudTrail allows you to record not only each version of an object but also who created each version. Provided we are considering flat data files, we could also checkout different versions and compare them with suitable tools to view the delta. Therefore, with a bit of work we can enable data provenance on objects stored in an S3 bucket.

One common and useful feature of a version control system that we have overlooked so far is the ability to tag a repository, thereby providing a shorthand to refer to a collection of files at a given point in time. While some might argue that placing a copy of objects into a different bucket or pseudo folder achieves this goal, it is a poor substitute for Git’s tagging. If we replace Amazon S3 with a traditional network filesystem, data provenance becomes an even greater challenge.

The lack of data provenance tools for typical data analytics data sets causes a number of issues. It makes it difficult to repeat analysis with very high levels of confidence without investing time and money duplicating data. It also makes it difficult to collaborate and share data amongst groups who are unsure of the true origin of the data, what operations have occurred during its lifecycle, by whom, using what transforms and what data cleansing has occurred. Increasing transparency of what happens to data throughout its lifetime and attributing the source of its origin will not only encourage reuse of data but also encourage sharing of data between groups.

With the continued growth in data analytics and an increasing need or desire to easily share large data sets that may be incrementally changing overtime, we need to develop tools to help simplify data provenance of large data sets.