Starburst adds a data catalog, high-speed indexing and Python support to its distributed query engine

Starburst adds a data catalog, high-speed indexing and Python support to its distributed query engine

Posted on

Starburst Data Inc., which sells a commercial distribution of the Trino distributed SQL query engine, used its third annual Datanova conference today to announce updates that it says significantly speed the performance of its engine while reducing barriers to the ability of users to find data.

The company also announced a private preview of a line of low-code tools it is building for creating, sharing and curating data products as part of a distributed data mesh. A data mesh is an emerging concept that invests ownership of data and the people who create it and in which data is managed with the same care and attention as a product.

Trino, which is a fork of the open-source Presto distributed query engine, supports analytics across a distributed data fabric regardless of where the data is located. A new automated data catalog can search and discover data across sources in the company’s Starburst Galaxy cloud service. It automatically creates metadata from roles, user queries and other user actions such as adding a new dataset, the company said.

Schema Discovery can be run on the file systems of all three major cloud platform providers with new files available on demand as soon as they are added, said Vishal Singh, head of data products at Starburst. Files can be searched by such criteria as creation date, ownership and usage within the business, he said.

The catalog complements previously announced schema discovery and data privilege capabilities aimed at streamlining the extract/transform/load or ETL process. It can automatically add metadata such as data ownership details to make it easier for users to find and obtain permission to use data. The catalog can also be populated with information about the source of data and how it’s used by other applications at the schema, table and view levels.

Auto-populating catalog

Singh drew an analogy to what happens when a user creates a Google Doc. “The information about who owns the doc gets automatically populated and you can request permissions from that person to get access,” he said. “We are doing a similar concept where as soon as the user creates a table that user becomes the owner of the table and can grant privileges to give to other people or domains.”

The discovery, permission and catalog features are collectively intended to bring a cloud marketplace experience to the process of finding and using data products, Starburst said. “All that information is now being packaged up in a way that data engineers can expose it to data consumers and data consumers can find information without jumping through multiple hoops,” Singh said.

Starburst isn’t positioning the feature as a competitor to will enterprise data catalogs and will integrate with other major players through APIs, Singh said.

Native Python support

Starburst is also announcing that it has opened up the development environments for both its on-premises and cloud product to be used with the Python programming language that is a favorite of data scientists. Users can migrate workloads built in PySpark, which is a Python application program interface to the Apache Spark analytics framework, to Starburst and Trino without rewriting code.

Python support eliminates the need for developers to include SQL functions within their Python code, Singh said. “We can now use the Python function to generate the query for Trino,” said Singh, who estimated that nearly all of the company’s customers use at least some Python.

Finally, the company is adding smart indexing and caching to its products with a capability it calls Warp Speed. The feature, which will be generally available in the Starburst Enterprise on-premises product by end of February and is in a private preview stage in the Starburst Galaxy cloud, is claimed to accelerate queries up to sevenfold.

Warp Speed indexing autonomously identifies and caches the most-used or most-relevant data based on usage pattern analysis while the rest of the data is kept close to the source. That eliminates the need to manually select which data is kept in the data lake and which is optimized and cached. Multiple databases can function as one, eliminating the need to manually join different systems before query and analysis.

The technology came from last year’s acquisition of data lake analytics accelerator Varada Ltd. “We’ve been working steadily since then to integrate that solution fully within our commercial offerings,” said Alison Huselid, senior vice president of product at Starburst.

The new feature automatically chooses which data to index and to cache based on the workload patterns,” Huselid said. “Customers can turn this on and start to see a lot of performance improvements.” The feature is optional and best used on highly repeatable workloads, she added.

Photo: Wikimedia Commons

Show your support for our mission by joining our Cube Club and Cube Event Community of experts. Join the community that includes Amazon Web Services and CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *