Databricks Inc. today is adding data lineage features to its Unity Catalog governance platform, a move that it says significantly expands data governance capabilities on the hybrid data warehouse or data lake that it calls a lakehouse.
Data lineage describes how data flows throughout an organization, giving customers the ability to see where lakehouse data came from, who created it and when, how it was modified over time and how it’s currently being used, among other features. The feature is now available for preview on the Amazon Web Services Inc. and Microsoft Corp. Azure clouds.
The feature helps organizations cope with the growing volume and variety of data coming in from multiple sources, how it moves and changes, who has access to it and how it’s used. Databricks says it’s bringing an updated approach to the process and that adding the feature required modifying the core database engine to accommodate nonstandard use cases such as machine learning models.
“Understanding how data flows through the organization is fundamental to being able to trust your data,” said Joel Minnick, Databricks’ vice president of marketing. “We’re going back to the core principle of the Unity Catalog, which is not just trying to govern tables and files but also modern assets like dashboards, notebooks and models.”
Data lineage enables data management teams to see all downstream functions that are affected by data changes — including applications, dashboards, machine learning models and data sets — and understand the severity of the impact so stakeholders can be notified. “The minute data comes into the lakehouse, we start to track it,” Minnick said. Metadata that travels with data elements such as the author and creation date are also imported.
The feature also helps organizations better meet compliance rules because of better traceability, Databricks said. “We capture all the data we can see at a pretty fine-grained level of detail: who created it, what changes were made, when was it changed, what pipelines it was used in and who has access to it,” Minnick said. “Ultimately, if you share that data, we can also see who it is shared with.”
Data lineage enables data consumers such as data scientists, data engineers and data analysts to conduct context-aware analysis. Data stewards can see which data sets are no longer accessed or have become obsolete so stale or unnecessary data can be removed to improve overall data quality.
Key features of Unity Catalog include automated run-time lineage to capture all lineage generated in Databricks, which provides more accuracy and efficiency compared to manual tagging. Information is captured for tables, views and columns to give a granular picture of upstream and downstream data flows. Additionally, lineage works across all languages supported by Databricks — including SQL, Python, R and Scala – as well as notebooks, workflows and dashboards.
Databricks aims to make the capability available across all the cloud platforms it supports, Minnick said.