Other attributes
Data lineage works to uncover the life cycle of data and show the complete data flow from the beginning to the end of the life cycle. This includes the process of understanding, recording, and visualizing data as it flows from data sources to data consumption and the transformations of the data along the way, which includes how the data was transformed, what changed, and why. The process allows companies to track errors in data processes, implement process changes with lower risk, perform system migrations with confidence, and combine data discovery with a comprehensive view of metadata in order to create a data mapping framework.
This is used to ensure data is coming from a trusted source, has been transformed correctly, and is loaded to the specific location. Data lineage plays an important role when strategic decisions need accurate information. And if the data processes are not tracked properly, validating the accuracy and consistency of the data becomes difficult, if not impossible, and costly and time-consuming. Data lineage works to solve the problems while increasing data accuracy and consistency. As well, data lineage allows users to search upstream and downstream from source to destination to discover anomalies and correct them.
With the increase in information gathering from sources such as the internet, cloud computing, mobile devices, and the Internet of Things (IoT), data has increased in the amount that is generated and the amount that is accessible to users. The cloud, especially, can make data governance and the collection of processes, roles, policies, standards, and metrics that ensure data is used effectively and efficiently, which is imperative for helping a business succeed. Data lineage can help organize all the data and give businesses a clear window into their data for fact checking and rapid access. This suggests the increase of cloud data storage will progressively make data lineage important for those data governance efforts.
For organizations that need visibility into how information moves through various workflow steps to ensure the quality of query results, business reports, business intelligence dashboards, and training sets, data lineage can help. This is enhanced when data engineers could track who changed the data and why, how something was updated, and which process was used. Furthermore, data lineage allows for the provenance of lineage of data sources to be understood, which can offer a chance to:
- Evaluate the trustworthiness of data based on provenance
- Understand and correct sources of error
- Identify incorrect assumptions about data that may skew analysis
- Provide audit trails for data governance and regulatory purposes
- Ensure data flows are protected and not subject to tampering
- Identify and avoid data duplication to simplify operations
Data lineage can be used to establish traceability, which defines the connection between different source and target entities in a data warehouse. A data source is the point where data is first read or retrieved for processing, such as tables in a database, files on a storage system, or a message bus or queue. Source entities in a data lineage record can further tell users where data came from; while target entities, or data destinations, tell a user where data is written after processing. Both source and target entities can be defined as systems that store data temporarily or permanently. Traceability can also help enterprises to meet data governance objectives, by:
- putting safeguards on sensitive data to meet compliance and auditability standards, such as GDPR legislation;
- enforcing access policies; and
- performing root cause analysis on low-quality data.
With the increasing number of regulations that require data lineage, the accurate tracking of reported data becomes more important. Data lineage systems can provide this function by generating accurate reports for where the data came from and how it got into the system.
Similar to governance and compliance concerns, data lineage can be important for the management of data access controls. An effective way to gather information about data lineage for data governance is by harnessing the operational metadata generated by operations in the data warehouse.
In the case of data problems or unexpected results in a project, data lineage can offer a chance to understand what and where something went wrong. Especially in the case of automated data lineage, it can remove hours of work often necessary to track the root causes of an error when such an analysis is done manually.
An impact analysis allows a business to look ahead and determine how chances could impact an organization. When done incorrectly, they can result in delays, low-quality data, emergency measures, and extra work to correct these mistakes. But with a unified data lineage, it can be easier to identify the impacts of changes through an entire environment. And changes can be propagated where applicable.
Data lineage that is ignored or mapped incorrectly can cause decision makers to lose faith in their reports and analytic models. Report developers, data scientists, and data citizens are better served by data they are confident in and trust. A complete understanding of the data, with a data lineage, can help develop that confidence and trust and lead to greater overall efficiency.
As data grows in quantity and increases in complexity, it is increasingly consolidated from multiple sources to a single place. Otherwise, data virtualizations offer a chance for the data to appear as if it is in a single place. Whether it is in a data lake or a central repository, it is important to identify data's original source, how it arrived at its current location, and where the data is actually located if data virtualization is enabled.
Data lineage can be useful in a variety of contexts:
- Business users, who can use data lineage to validate report fields by confirming data sources and transformations and performing data discovery, combining data lineage with a data glossary or catalogue
- Data engineers, who can use data lineage to identify stale reports that could trigger upstream job failures and troubleshoot data quality issues in datasets by identifying sources to identify and fix any upstream issues
- Data governance users, who can use data lineage to enforce data access control policies on data conflicts, such as updating a dataset by more than one process, removing unused or duplicated datasets for compliance and to reduce storage, identifying datasets and columns to improve storage performance and access scrutiny, monitoring data quality over time, and flagging data exfiltration when data is exported outside of the data warehouse
To implement a data lineage system, there is a need for an enterprise to keep track of each process within the system where data is transformed or processed. This can require the mapping of data assets going through any processes and makes it necessary to track tables, views, columns, and reports in databases. Following this, there is a need to collect the metadata after each of the data transformations and collect and store data about each stage of the transformations stored in the metadata, which can be used for lineage representation. Strategies for creating and using data lineage include:
This includes discovering and documenting where data exists in an enterprise, including through key business processes and the flow of data between these processes. Furthermore, understanding the technical lineage of data or where data flows through underlying applications, services, and data stores can be key towards tracking where data moves, how it changes, and to track those changes in a repeatable and speedy manner. This can be important for ensuring the data remains up to date and relevant.
This includes understanding beyond the where and the how of data and extends to who is using the data, what the data means, when the data was captured, when the data is being used, and why that data is being stored and used.
Understanding the relationships between data through an enterprise can include understanding how data originates and moves between people, processes, services, and products. This information can be further conceptualized from the internal entities, such as departments with a business, external players, including from buyers and sellers to the business, and the interaction between the internal entities and external entities.
Part of understanding the relationships of data, a data processing lineage can help users trace where a job has failed and what partitions are lost during the failure. Further, these systems can be used to visualize the results of jobs.
The query history lineage offers a history of when users query a data warehouse. Query lineage also becomes necessary to allow data engineers to observe what the most frequent filters and joins are used, and then optimize the querying engines accordingly.
Data lake and warehouse access lineage can provide proper data governance for applications that work on a data lake or warehouse. With a query lineage and metadata, this can help visualize what users and when users are trying to access non-authorized data, and administration teams can accordingly take action on it.
Maintaining a data lineage map manually creates chances for data lineage to fall out of date and introduce mistrust in the data. However, tools that can automate the process of data lineage tracking, especially if those tools offer a reverse tracing methodology and a baseline, can help an enterprise get a comprehensive and end-to-end lineage for their data, which can also be trustworthy. Such a tool can also offer automated metadata scanning and gathering for enhanced data lineage.
In data lineage practices, artificial intelligence (AI) and machine learning (ML) can be important tools. AI can offer powered discovery capabilities, which can, in turn, streamline the processes of identifying connected systems using metadata from ETL software and describe lineage from customer applications without offering direct access to the metadata. Further, AI and ML can infer data lineage when it is impractical or impossible to do by other means.
Also called AI-powered data similarity discovery, this capability allows users to infer data lineage by finding similar datasets across sources, and can also be called pattern-based lineage. AI and ML can also offer data relationship discovery, which can be an essential asset for impact analysis.
AI-powered data lineage helps users understand data flow relationships and "control" relationships, such as joins and logical-to-physical models. For example, deleting a column that is used in a join can impact a report that depends on that join. An AI-powered solution that infers joins can provide end-to-end lineage that enables a more complete impact analysis, even when these relationships are not documented.
To deliver data lineage requires more than delivery static metadata, but rather is closer related to logic and includes instructions or code. It can be an SQL script, a database stored procedure, a job in a transformation tool, or a complex macro in an Excel sheet. Data lineage can, specifically, be anything that moves data from one place to another, transforms it, or modifies it. And appropriately, there are options for outlining, diagramming, and understanding that logic, with a variety of approaches to achieve data lineage.
This includes solutions that estimate lineage information without actually touching or looking at any code. They read metadata about tables, columns, reports, or other sources of data and work to profile that data. The system then uses this information to create lineage based on common patterns or similarities. Tables or columns with similar names and columns with similar data values are examples of such similarities. And, if there are a lot of similarities between two columns, they can be linked together in a data lineage diagram. Some will call this approach AI-based.
The advantage of this approach is it can help in some cases where reading the logic hidden in programming code is impossible, either because the code is unavailable, proprietary, or cannot be accessed. It can also give an overall view of the data and watch the data as it moves.
The challenge of this approach is that it is not always accurate and the impact on performance can be significant if a business works primarily and foremost with data. The data privacy can also be at risk with this approach, as there can be a lot of details missed, such as the transformation logic, and the lineage is typically limited to the world of the database and ignores the application part of the environment.
In manual lineage, data lineage is, as the name implies, resolved manually, usually starting from the top by mapping and documenting knowledge in people's heads. Interviews with application owners, data stewards, and data integration specialists will give users a fair amount of information about the movement of data through an organization. From here, lineage can be defined, usually in spreadsheets or other straightforward mapping mechanisms, to reflect what the subject matter experts have described. Of course, one downside to the approach is that information can be contradictory, or pieces could be missed if individuals are not properly consulted or interviewed. This can result in data lineage where there is incorrect data and result in an unusable dataset in real case scenarios with a lack of trust in the lineage and the metadata.
In addition to interviews, a manual approach offers a chance to manually examine code, columns, and files; this can be tedious and requires a certain amount of skill and expertise, but could provide a chance to catch errors in the code that may otherwise go uncaught. Further, despite the challenges of this approach, it cannot be sidelined, as it is often the initial approach used to gain insight into what is going on across an entire environment, and sometimes in the case where there is no code or any permissions to access data and profile the data without consulting domain experts.
As mentioned above, the manual process can miss out on individuals and key pieces of process information or other related data those individuals posses. And in the scenario of a manual code and data review, the expertise and skill required for such a review may not be possessed by enough team members to make the exercise anything other than a tedious and long process. Due to the volume and complexity of a lot of code, and the rates of change of a given piece of code, the manual method also becomes unsustainable and leaves manually managed data lineage out of sync with actual data transfers in the environment and results again in data that cannot be trusted.
The idea behind data tagging is that each piece of data being moved or transformed is tagged or labeled by a transformation engine that tracks that label from start to finish. This approach works best when there is a consistent transformation engine or a tool controlling every movement of the data. The approach is promising but often excludes anything that happens outside the walls of the selected engine or technology. Lineage reaches a dead end because the tags only exist in the closed system.
Equally important is realizing the lineage is only there if the transformation logic is executed. In some systems, this method would not be an option because application developers and architects would not want to add formal data columns to the solution model at every touchpoint and for every transfer method applied along the way. One potential solution for these complexities and with the concept of tagging is blockchain, but it is not yet widespread enough to have an impact across the data lifecycle for most organizations.
Some departments have an all-in-one environment providing the necessary processing logic, lineage, and master data management. These tools are more common with newer big data and data lakes software. If this software is installed, it controls everything, including every data movement and every change in the data. This tool then tracks the lineage but remains exclusive to the controlled environment. The lineage built into such a system is therefore blind to transactions happening outside of it. And as new needs appear, along with new tools acquired to address those needs, gaps and dead ends in the lineage can appear.
A data lifecycle can be complex, heterogenous, wild, and constantly evolving. The most effective way to manage all the lineage is to do it automatically. This means automatically, or programmatically, reading through all the logic and then understanding and reverse engineering it for complete end-to-end tracking. This requires a solution that understands the programming languages and tools used in an organization for data transformation and movement. By programming languages, this can mean everything including graphical flow tools, JAVA, legacy solutions, XML solutions, and ETL reports, to name a few.
It can be difficult to build a solution sufficient to support a single language or tool. Increasing the overall challenge is the myriad of ways that tools and solutions support dynamic processing. An effective automated lineage solution has to account for input parameters, default values, and runtime information. To effectively automate the delivery of end-to-end lineage to an enterprise, all of these things have to be parsed.
There are, broadly speaking, two types of data lineage systems: active data lineage systems or passive data lineage systems. In an active system, developers program data pipelines to provide source and transformation information to the lineage system. The system acts as a repository for data lineage records can can provide access to that information through a visual interface or an API. These systems can be useful for non-query language-based data operations.
Passive data lineage systems are more suitable for SQL-like operations, as these systems allow lineage information to be identified by parsing the SQL statements captured from an operation log. This functionality can cut down on instrumentation work for analysts and developers. These systems are also more suitable for typical data warehouses, as they predominantly support SQL-based operations.
There are various levels of granularity that can describe data lineage. The granularity is often determined by the use case for the data—for example, an enterprise using data lineage for governance may have a lower granularity need than an enterprise wishing to have maximum fidelity of their data lineage. However, the increased granularity comes with an increased cost for storage, which can, in turn, exceed the benefits of the details.
Data lineage granularity
In a data lineage system, the lineage data model is often represented as a tuple. The properties of the tuple are such that I is the source or a set of inputs, T is the transformation or operation, and O is the sink or output. The tuple is thus I,T,O.
The source (I) is the data entity that is the data source and the sink (O) is the target of the data. The source is often described with the attributes of type (also called kind) and identifier. A type is described so a user can identify the right sets of data readers to use, and an identifier describes the location to identify the source and destination in the data system described by the type or kind.
The transform instruction (T) records the processing steps used to manipulate the data source. The transform instruction varies by lineage granularity.
This is used to describe the data lineage tuple for a data warehouse system in combination with information components including reconcile time, transform job information, table-level lineage, and column-level lineage information.

