Hadoop struggles as enterprises explore newer, simpler alternatives for near real-time.
A funny thing appears to be happening on the way to Big Data nirvana. The assumption was that massive data lakes based on instances of the open source Hadoop framework would dominate the IT landscape, both on-premises and in the cloud. However, as more data became inexpensively available in the cloud, many organizations now appear to be foregoing Hadoop in favor of other, less complex approaches.
A collection of open-source software utilities, Apache Hadoop provides a storage and processing framework for networked computers solving problems involving massive amounts of data and computation. It was introduced in 2006.
See also: Special Report: Why Companies Use Hadoop
Today, providers of Hadoop-based platforms are clearly struggling. Following the merger of the Cloudera and Hortonworks, the combined company reported a loss of $103 million for the first quarter on sales of $187.5 million and replaced its CEO. Mapr Technologies announced it is shutting down operations while looking for a buyer.
On the plus side, however, IBM and Cloudera have announced an alliance under which IBM will sell and support offerings from Cloudera. Previously, IBM has a similar relationship with Hortonworks. Now that relationship is being extended to the combined entity, says Daniel Hernandez, vice president of data and artificial intelligence at IBM.
“We don’t see any signs of customers slowing down,” says Hernandez.
Near Real-Time Driving New Approaches
Customers are not only processing more data than ever on-premises and in the cloud, they’re also increasingly trying to access that data in near real-time, notes Hernandez. That requirement is driving organizations to make more extensive use of the Apache Spark in-memory computing framework that is part of the core Apache Hadoop distribution, he adds.
At the same time, customers are embracing a wider variety of Big Data platforms that include everything from document databases from MongoDB to instances of relational databases that have been extended to handle much larger volumes of data, Hernandez acknowledges. IBM makes all those offerings available alongside the Hadoop distribution from Cloudera.
Meanwhile, MongoDB this week at its annual conference launched an Atlas Data Lake, which allows end users to directly query object-based storage on the Amazon Web Services (AWS) public cloud. The offering is positioned as a direct competitor to rival Big Data platforms, says Seong Park, MongoDB VP of prod marketing and dev relations
“This approach makes it easier to scale out,” Park says.
Moving Away from One Primary Platform
Data may be the “new oil”, but the number of platforms relied upon to process that data has never been greater. The biggest issue many organizations will confront in the months ahead is simply training their staffs to manage polyglot data platforms. Multiple copies of the same data are also likely to reside in platforms employed for different use cases depending, for example, on the latency requirements of the application.
Regardless of what vendor supports the underlying platform, platforms such as Hadoop are not going to fade away any time soon. Rather, the idea there can only be one primary platform will yield to a much more nuanced approach that will put a premium on data management processes applied across a much wider variety of platforms.