Delta Sharing is an open protocol for securely sharing data across organizations in real time, completely independent of the platform on which the data resides.
Databricks made several announcements at this week’s Data + AI Summit. Top among them was the launch of a new open-source project called Delta Sharing, the world’s first open protocol for securely sharing data across organizations in real time, completely independent of the platform on which the data resides. Delta Sharing is included within the open-source Delta Lake project and supported by Databricks and a broad set of data providers, including NASDAQ, ICE, S&P, Precisely, Factset, Foursquare, SafeGraph, and software vendors like AWS, Microsoft, Google Cloud, and Tableau.
The solution takes aim at a common industry problem. Namely, data sharing has become critical to the digital economy as enterprises wish to easily and securely exchange data with their customers, partners, and suppliers, such as a retailer sharing timely inventory data with each of the brands they carry. However, data sharing solutions have historically been tied to a single vendor or commercial product, tethering data access to proprietary systems and limiting collaboration between organizations that use different platforms.
In a call with RTInsights, Joel Minnick, Vice President, Marketing at Databricks, explained the rationale behind Delta Sharing. “What you’ve got now is a proliferation of a bunch of silos of data sharing networks that let folks share some of the data with some of the people some of the time. And it’s been this way since the ’80s, yet we still see new entrants into the market all the time, standing up new proprietary data sharing networks.”
He continued: “Our heritage, our roots are always in open source. This feels like a problem that could be solved in a really effective way if we approached it from an open point of view.”
He noted that Delta Sharing solves a couple of problems. One is that it is a fully open, secure protocol for sharing data, so it removes any proprietary lock-in. But it also solves a second really big problem, which is that a lot of these data-sharing networks and data-sharing tools that are out there today were built for sharing structured data. And that is what they govern, and what they express is just a SQL interface most of the time.
Minnick noted that the types of data that customers want to share these days more and more are leaning towards being unstructured. For example, businesses frequently want to share images, videos, dashboards, and machine learning models.
Delta Sharing is built out of the gate to also support data science and be able to provide governance to unstructured data as well, as well as to express itself, not just through SQL, but through Python. And so, it can meet the needs of data engineers, data analysts, and data scientists.
These points were emphasized at the announcement. “The top challenge for data providers today is making their data easily and broadly consumable. Managing dozens of different data delivery solutions to reach all user platforms is untenable. An open, interoperable standard for real-time data sharing will dramatically improve the experience for both data providers and data users,” said Matei Zaharia, Chief Technologist and Co-Founder of Databricks. “Delta Sharing will standardize how data is securely exchanged between enterprises regardless of which storage or computing platform they use, and we are thrilled to make this innovation open source.”
The bottom line is that Delta Sharing extends the applicability of the lakehouse architecture that organizations are rapidly adopting today, as it enables an open, simple, collaborative approach to data and AI within and now between organizations.
Enhanced data management
Also at the summit, Databricks announced two data management enhancements to its lakehouse platform. They include:
Delta Live Tables, which is a cloud service in the Databricks platform that makes ETL (extract, transform and load) capabilities easy and reliable on Delta Lake to help ensure data is clean and consistent when used for analytics and machine learning.
Unity Catalog simplifies governance of data and AI across multiple cloud platforms. Unity Catalog is based on industry-standard ANSI SQL to streamline implementation and standardize governance across clouds. Unity Catalog also integrates with existing data catalogs to allow organizations to build on what they already have and establish a future proof and centralized governance model without expensive migration costs.
Bringing data and machine learning together
The company also announced the expansion of its machine learning (ML) offering with the launch of Databricks Machine Learning, a new purpose-built platform that includes two new capabilities: Databricks AutoML to augment model creation without sacrificing control and transparency, and Databricks Feature Store to improve discoverability, governance, and reliability of model features.
With Databricks Machine Learning, new and existing ML capabilities on the Databricks Lakehouse Platform are organized into a collaborative, role-based product surface that provides ML engineers with everything they need to build, train, deploy, and manage ML models from experimentation to production, uniquely combining data and the full ML lifecycle.
The Databricks Feature Store streamlines ML at scale with simplified feature sharing and discovery. Machine learning models are built using features, which are the attributes used by a model to make a decision. Feature Store allows data teams to easily facilitate the reuse of features across different models to avoid rework and feature duplication, which can save data teams months in developing new models. Features are stored in Delta Lake’s open format and can be accessed through Delta Lake’s native APIs.