Databricks releases Delta Live Tables automated ETL framework

Databricks adds data lineage feature to its catalog with support for non-traditional uses

Posted on



Databricks Inc. today is adding data lineage features to its Unity Catalog governance platform, a move that it says significantly expands data governance capabilities on the hybrid data warehouse or data lake that it calls a lakehouse.

Data lineage describes how data flows throughout an organization, giving customers the ability to see where lakehouse data came from, who created it and when, how it was modified over time and how it’s currently being used, among other features. The feature is now available for preview on the Amazon Web Services Inc. and Microsoft Corp. Azure clouds.

The feature helps organizations cope with the growing volume and variety of data coming in from multiple sources, how it moves and changes, who has access to it and how it’s used. Databricks says it’s bringing an updated approach to the process and that adding the feature required modifying the core database engine to accommodate nonstandard use cases such as machine learning models.

“Understanding how data flows through the organization is fundamental to being able to trust your data,” said Joel Minnick, Databricks’ vice president of marketing. “We’re going back to the core principle of the Unity Catalog, which is not just trying to govern tables and files but also modern assets like dashboards, notebooks and models.”

Lifecycle view

Data lineage enables data management teams to see all downstream functions that are affected by data changes — including applications, dashboards, machine learning models and data sets — and understand the severity of the impact so stakeholders can be notified. “The minute data comes into the lakehouse, we start to track it,” Minnick said. Metadata that travels with data elements such as the author and creation date are also imported.

The feature also helps organizations better meet compliance rules because of better traceability, Databricks said. “We capture all the data we can see at a pretty fine-grained level of detail: who created it, what changes were made, when was it changed, what pipelines it was used in and who has access to it,” Minnick said. “Ultimately, if you share that data, we can also see who it is shared with.”

Data lineage enables data consumers such as data scientists, data engineers and data analysts to conduct context-aware analysis. Data stewards can see which data sets are no longer accessed or have become obsolete so stale or unnecessary data can be removed to improve overall data quality.

Key features of Unity Catalog include automated run-time lineage to capture all lineage generated in Databricks, which provides more accuracy and efficiency compared to manual tagging. Information is captured for tables, views and columns to give a granular picture of upstream and downstream data flows. Additionally, lineage works across all languages supported by Databricks — including SQL, Python, R and Scala – as well as notebooks, workflows and dashboards.

Databricks aims to make the capability available across all the cloud platforms it supports, Minnick said.

Photo: Robert Hof/SiliconANGLE

Show your support for our mission by joining our Cube Club and Cube Event Community of experts. Join the community that includes Amazon Web Services and Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *