With IoT, AI and machine learning initiatives, the need for an enterprise to establish a data lake is critical. What are the important issues to consider?
Organizations are looking to leverage modern analytics such as AI, machine learning, and streaming analytics to provide a competitive advantage. To accomplish this, they must establish a physical or logical foundation to integrate the enterprise’s disparate data, along with the proliferation of real-time streaming data including both IT (transactional) and OT (operational) data that is coming in even greater volumes and variety.
Cloud and hybrid data lakes are increasingly becoming the primary platform on which data architects can harness big data and enable analytics for data scientists, analysts and decision makers. As the speed of business accelerates and insights become increasingly perishable, the need for real-time integration with the data lake becomes critically important to business operations.
See also: Building a smart data lake while avoiding the “dump”
Recent research conducted by TDWI found that approximately one quarter (23%) of organizations surveyed already have a production data lake, and another quarter (24%) expect to have a data lake in production within one year. More enterprises are turning to data lakes – both on-premises and in the cloud – as the preferred repository for storing and processing data for analytics.
For effective data ingestion pipelines and successful data lake implementation, here are six guiding principles to follow.
#1: Architecture in motion
The architecture will likely include more than one data lake and must be adaptable to address changing requirements. For example, a data lake might start out on-premises with Hadoop and then be moved to the cloud or a hybrid platform and based on object stores from Amazon Web Services, Microsoft Azure, or Google platforms, to complement on-premises components.
These may also introduce new architectural patterns, such as the Lambda or Kappa architectures. This first one combines a batch-processing layer (often based on MapReduce and Hive) with a “speed layer” (Apache Storm, Spark Streaming, etc.), combined with change data capture (CDC) technology that minimizes latency and provides real-time data feeds that can be incorporated into the batch layer. Alternatively, Kappa Architectures require integration across multiple streaming tools and streaming applications on top of Hadoop infrastructures.
To best handle constantly-changing technology and patterns, IT should design an agile architecture based on modularity. And have in mind that key processes related to the data lake architecture include data ingestion, data streaming, change data capture, transformation, data preparation, and cataloging.
#2: Data in motion
For data lakes to support real-time analytics, the data ingestion capability must be designed to recognize different data types and multiple SLAs. Some data might only require batch or micro-batch processing, while others might require stream processing tools or frameworks – e.g., to analyze data in motion. While some data sources were built to be streamed, like IoT sensor data and edge devices, core transactional systems were not. Change data capture plays a vital role in creating data streams from transactional systems based on relational database management systems (RDBMS), mainframe or complex applications like SAP.
To meet the architecture in motion principle decried above, IT teams should look for the ability to support a range of technologies such as Apache Kafka, Hortonworks DataFlow (HDF), Amazon Kinesis, Azure Event Hubs, or MapR Streams as needed. Additionally, all replicated data needs to be moved securely, especially when sensitive data is being moved to a cloud-based data lake. Robust encryption and security controls are critical to meet regulatory compliance, company policy, and end-user security requirements.
#3: Data structures matter
The data lake runs the risk of becoming a murky quagmire if there is no easy way for users to access and analyze this data. Applying technologies like Hive on top of Hadoop helps to provide a SQL-like query language that is supported by virtually all analytics tools. Ideally, an organization would provide both an operational data store (ODS) for traditional BI and reporting and a comprehensive historical data store (HDS) for advanced analytics.
Organizations need to think about the best approach to building and managing these stores, so they can deliver the agility needed by the business. Key questions include:
- How can we manage continuous data updates and merging these changes into Hive?
- How can we implement this approach without having to manually script these transformations and becoming resilient to source data structure changes?
- How can we implement an automated approach?
Consider the skill sets of the IT team, estimate the resources required, and develop a plan to either fully staff the project or use a technology that can reduce the skill and resource requirements without compromising the ability to deliver.
#4: Scale matters
Data ingestion processes should minimize any impact to your core transactional systems regardless of the increased data volumes and diversity of target systems.
When organizations have hundreds or thousands of data sources, that volume of data affects implementation time, development resources, ingestion pattern, the IT environment, maintainability, operations, management, governance, and control. Organizations find that automation reduces time and staff requirements, as the scaling considerations and management methods need to be the focus. Often the environmental issues create too many threads and derail progress. Laying the foundational tools and strategy first elevates that issue.
Other best practices include implementing an efficient ingestion process, avoiding software agent installs on each source system, and using a centralized task and source management system.
#5: Breadth matters
Data architects must plan for many sources, many targets, and hybrid or varying architectures. The most successful approach will standardize on one tool for data ingestion that is agnostic to the source and targets and can meet the needs both today and in the future. The solution should also be certified on the environments that you plan on deploying to ensure interoperability.
#6: Depth matters
Whenever possible, organizations should adopt specialized technologies to integrate data from mainframe, SAP, cloud, and other complex environments.
For example, enabling analytics on SAP-sourced data on external platforms requires the ability to access data through both the application and data layer to decode that data from SAP pool and cluster tables to provide both the right data and metadata needed for analytics. The solution must do this complex access and transformation based on deep knowledge of the SAP application portfolio.
Mainframe sources like VSAM and IMS provide similar challenges. Done right, the mainframe system will not need to install any agents or introduce any additional processing on the server and still provide the real-time change data capture and delivery.
By adhering to these six principles, enterprise IT organizations can more readily build an advanced cloud or hybrid architecture that supports both historical and real-time business analytics requirements. Advanced CDC data ingestion enables the onboarding of new sources quickly and the delivery of real-time analytics and business value from your data lake investments.