Apache Hudi provides tools to ingest data into HDFS or cloud storage, and is designed to get data into the hands of users and analysts quickly.
At a busy, data-intensive enterprise such as Uber, the volumes of real-time data that need to move through its systems on a minute-by-minute basis reaches epic proportions. This calls for a data lake extraordinaire, in which data can immediately be extracted and leveraged across a range of functions, from back-end business applications to front-end mobile apps. Uber depends on up-to-the-minute bookings and alerts as part of its appeal to customers, so its reliance on real-time data streaming platforms is off-the-charts. It has turned to Apache Hudi, an emerging platform that brings stream processing to big data, providing fresh data while being an order of magnitude efficient over traditional batch processing.
I recently had the opportunity to moderate a webcast about Apache Hudi with Nishith Agarwal and Sivabalan Narayanan, both engineers with Uber. Both Agarwal and Narayanan are active members of the Hudi programming committee.
The Hudi data lake project was originally developed at Uber in 2016, open-sourced in 2017, and submitted to the Apache Incubator in January 2019. Apache Hudi data lake technology enables stream processing on top of Apache Hadoop compatible cloud stores and distributed file systems. The solution provides tools to ingest data onto HDFS or cloud storage, as well as provide an incremental approach to resource-intensive ETL, Hive, or Spark jobs. It is designed to get data into the hands of users and analysts much quicker.
At Uber, “Hudi powers many different use cases,” says Agarwal, noting that the company’s enterprise data lake is built on Hudi. “We have about 250 petabytes of data that’s managed by the data lake platform. The kinds of use cases that it enables are, for example, whenever you build machine learning pipelines. One of the challenges are if data is changing upstream, and I want to update my feature set, how do I update my feature set without actually reading the entire data and re-snapshotting it? That becomes a really costly process. For example, if we run the data models for UberEats, which are massive, hundreds and hundreds of terabytes and consuming that data becomes tricky. One of the ways where Hudi is being employed is to make all of this incremental, with all of these primitives.”
Another use case is around managing earnings data, Argawal continues. “As we go through all of the business use cases that Uber has, exposing different data to different customers to different users, how do we do that in an efficient way? How do you point out exactly where the data lies and then be able to expose this data again to the record level, indexing all of these things? Hudi helps immensely in those kinds of use cases.”
Going forward, Argawal anticipates tighter integration with other streaming platforms such as Kafka. “Generally, Hudi will connect to Kafka directly and pull streams. Kafka Streams itself is also an execution framework, like Apache Fling, but has some custom semantics, and right now, there is no support for running Hudi on Kafka Streams, but we are looking at providing connectors that may be able to do that.”