The challenges of integrating time-series data into data lakes can be overcome by using the right architecture and providing the appropriate metadata.
Technology companies, including Amazon and Microsoft, provide organizational data storage called data lakes. These are inexpensive cloud-based solutions that pack substantial promise. The value of a data lake is not just about providing data storage but about establishing a central data location in which users can simply access and use the data. When data is democratized, users can easily discover what data is available and define data views or combinations for specific use cases. They can then make decisions and improvements based on this data and contribute to building a data-driven organization. Therefore, the true value of a data lake can be unlocked.
However, organizations must take care to ensure that a data lake does not become a data swamp – an unwanted and unacceptable situation where a lot of data is just sitting and not being used. This can happen when users need special development or technical skills to get to and use the data. In this situation, the value of a data lake will not be realized, nor will it meet organizational and user expectations.
See also: Considerations for Successful Continuous Data Ingestion and Analysis
Currently, data lakes are becoming increasingly important to process industries that capture and store immense amounts of sensor-generated time-series data. Additionally, to fully leverage this data, many industrial manufacturing companies are using advanced industrial analytics to gain crucial insight into their processes. Advanced industrial analytics is the application of data science to raw data generated by manufacturing processes via both continuous sensors and discontinuous sampling. To make data lakes work for time-series data, it is important to understand that this data cannot just be dumped into the lake with the expectation of extracting its value. Therefore, when integrating time-series data into data lakes, several challenges must be addressed.
1) Ensure you provide the required metadata for easy data ingestion into data lakes
The first integration challenge deals with data ingestion. There is no standard lake tool, no single solution or platform that an organization can use to magically solve data lake issues such as data mapping and correlating. Each case needs to be looked at individually. Typically, when an organization uses advanced industrial analytics, it wants to know how to access its time-series data for analysis. To ease data ingestion, organizations must provide the required metadata, which includes data lineage, data structure, data age, and other metadata that provides common attributes/properties that link the data together.
2) Ensure your industrial advanced analytics platform can connect to existing data lakes
The second integration challenge is knowing whether the advanced industrial analytics platform being used can connect to the existing data lake. Although there is no single standard or solution, there are common aspects across many different vendors and offerings. One common aspect is to provide a query abstraction layer. This is a tool or component in an organization’s data lake that allows for writing standard SQL language queries against the data. It also means that any tool that has support for standard ODBC or JDBC connectivity can be used to connect to the data lake.
3) Ensure proper performance of your data lakes
The third integration challenge is ensuring that an organization’s query layer can extract data in an acceptable performance. Commonly, a data lake uses inexpensive block storage with a massive storage capacity. The downside of this storage type is that it typically is not the fastest to access. When working with advanced industrial analytics, users are interactively working with data, so they expect the data to be where they need it and to come in as fast as possible when they need it.
Additionally, it is problematic when all of an organization’s data is sitting in one huge file in the data lake as this structure is highly inefficient for extracting data. There are a lot of best practices for addressing this issue. One practice is to use columnar file formats, which allow users to read data columns/properties that are only needed for a specific case. Since the entire file would not have to be read, less data is loaded, resulting in faster response times. A second practice is to use partitioning. This involves arranging data in folder-like structures using key properties or time or a combination of these depending on the data, thus splitting all available data into much smaller files. This structure allows users to drill down to specific data sets, which again means less data to transfer and much less time to process the data or to query against it.
Concluding thoughts
Data lakes offer substantial benefits, especially to industrial process companies that capture immense amounts of sensor-generated time-series data. There are some challenges to address when integrating time-series data into data lakes for the use of advanced industrial analytics. However, these challenges can be overcome by taking the necessary steps in setting up the appropriate data lake architectures and by providing the appropriate metadata.