Modern data warehouse design often involves new platforms that can deal with new sources of unstructured and real-time data, as well as use of Hadoop.
Most business analysts have spent their careers dealing exclusively with structured data in relational database management systems, but that’s no longer the case.
There are now streams of data coming from every direction: social media, IoT sensors, financial transactions, ecommerce, and web applications.
“We’re starting to see a lot of people modernize so they can capture streaming data,” said Phillp Russom, an industry analyst at TDWI, during a recent webinar. He described a “fork in the road” where business intelligence professionals now must deal with unstructured data.
To deal with streaming and unstructured data, businesses are acquiring a broader range of platforms, Russom said. “That doesn’t mean that your data warehouse platform is going away — not at all. It’s still there for the relational data, dimensional data, metrics, and KPIs,” he said. Rather, “we’re seeing people adding additional platforms into the extended warehouse environment.”
To Upgrade or Not
Often the new data warehousing platforms are upgrades to the existing database and integration software, and updates to everything around it such as reporting and analytics tools.
“For a lot of people modernization is more about getting caught up,” Russom said. “And if you’re far behind, getting caught up seems like an innovation.”
In other cases, however, a business may want to add more functionality through an advanced platform that offers complex event processing, streaming analytics (whether for IoT, financial, or web data); or natural language processing for text analytics (useful in sentiment analysis/opinion mining).
According to a TDWI best practices report, 42 percent of businesses plan to augment their existing data warehouse in the next three to four years, whereas 15 percent plan to replace the data warehouse entirely. That means the systems architecture of the average data warehouse will look very different in the future.
Respondents to the TDWI survey stated that data warehouse modernization is being pursued for a variety of reasons, but there was a clear trend towards real-time data and operational efficiency:
“A lot of people are going heavily into visualization — that’s their preferred analytic method,” Russom said. Other trends include SQL-based analytics, data mining, graph analytics, and NLP.
Russom also spoke of the use of data exploration, which is at an “all-time high,” and the use of “analytic sandboxes” where analysts and data scientists can perform ad-hoc queries on data. “Before you can get full business value from a decision-making standpoint out of new data sources you have to explore it,” he said.
A Massive Move to Hadoop
Users seem to be leaving older relational systems for newer ones based on racks, grids, or appliances, and in some cases may stay with the same vendor, according to TDWI. One trend is to keep the older relational DBMS and optimize it for enterprise reporting while using the older one for data exploration and analytics.
In a minority of cases, however, the data warehouse may be entirely ripped and replaced. TDWI found this was the method being pursued by nine to 15 percent of survey respondents.
“I’ve got mixed feelings about rip-and-replace because it’s so highly disruptive to the business,” Russom said. “And also it’s just disruptive to the other work of a data warehouse professional.”
For architectural purposes, many users have already integrated Hadoop clusters and tools into their data warehouse environments, according to the TDWI report. Hadoop’s massive and cheap storage off-loads older systems by taking responsibility for data staging, ELT pushdown, and archiving of detailed source data (retained for advanced analytics). Hadoop also serves as a massively parallel execution engine for a wide variety of set-based and algorithmic analytics methods.
While 78 percent of users do not have Hadoop in their data warehouse environment, that number will fall to less than 15 percent in the next three years as businesses deploy the platform for data warehousing, analytics, and integration.
Hadoop usually complements a data warehouse without replacing it, according to the report, but in some organizations Hadoop will eclipse the data warehouse as the primary platform. For example, several Internet firms have multiple Hadoop clusters each managing petabytes of data in support of numerous analytics applications that operate on Internet-scale data sets, and many utilities, telecommunications companies, and federal intelligence agencies are headed in that direction.
Just a few vendors that support Hadoop for data warehouse modernization include IBM; Pentaho; SAP, and SAS.
While Hadoop can be used for real-time analytics and operational intelligence – as well as an operational data store — another trend (not mentioned in the TDWI report) is use of Apache Spark in concert with Hadoop for even faster processing.
Priorities for Modern Data Warehouse Design
TDWI advises organizations to embrace change and make realignment with business goals the top two priorities.
“It’s unlikely the status quo of your current warehouse will serve you and your organization for much longer so you have to embrace change,” Russom said. If companies modernize data to the point where they didn’t have it before – such as real-time or unstructured data – “that actually enables them to innovate.”
However, TDWI also advises to make analytics a priority and to reduce the temptation for new platforms just because they are new and exotic.
“Before you adopt one of these shiny objects, be sure that it goes beyond the bling to also satisfy real world requirements, in a high-performance and cost-effective manner,” Russom said.