With data-driven companies, project bottlenecks are often the size and shape of your own data team. Here’s how data-as-a-service might help unplug it.
Modern IT organizations face an impossible challenge: the ratio of data consumers—BI users, data scientists, analysts, decision-makers—to capable data engineers is 100:1.
As a result, data consumers wait in line for their turn with IT, hopeless without a data engineer to provision data for their particular needs. As a result, data consumers are under-utilized, and companies aren’t moving as quickly to the next insight as they would like to be.
Why does this happen? Today organizations have data stored in hundreds or thousands of silos, across a mix of technologies and formats. In addition, data consumers rely on a number of different tools to do their jobs effectively, including multiple BI tools, Python, R, SAS, Excel, and others. IT is left in the middle, fulfilling requests to source data from multiple systems, and provision it in a way that works for a given tool and group of users. Each day is a race to put out the next fire.
Data-as-a-Service has been introduced to simplify these challenges by allowing companies to leave data where it is already managed, and to provide fast access for the data consumer regardless of the tool they use. By building on open source standards and new paradigms for in-memory computing, Data-as-a-Service helps IT organizations to empower data consumers to be more independent and self-sufficient while making the data engineer more productive.
Data Lake or Data Swamp?
Data lakes are an agile, low-cost way for companies to store their data, but without the right tools, the data lake can grow stagnant and become a data swamp. Often, data lakes suffer or fail when there is no way to govern the data, no easy way for data consumers to access the data, and no clear goal for what it is supposed to achieve.
Over the past decade, companies have worked to deploy data lakes into their technology stacks in order to solve the great challenge of too many silos of data. While data lakes made it easy to store the data “as is,” they leave the challenges of data quality, security, performance, and accessibility to software engineers to solve on a project by project basis. While data lakes simplify where to find the data, they don’t help with how to access the data for data consumers.
The Workaround
To improve access for data consumers, many organizations turn to the tried and true relational database. They move the data from the data lake into one or more data warehouses or data marts and spend enormous time and effort to create ETL scripts that transform the data from the lake in order to ensure quality and integrity of the data before loading it into the database.
As a result, companies find themselves struggling with the same problems that drove them to consider the data lake in the first place: data warehouses are expensive and complex to operate; data warehouses are challenging to scale; the lead time to make the data available is too long. And finally, with this approach companies end up with yet another silo to manage, secure, and govern, the very problem they were trying to solve in the first place!
A New Paradigm: Data-as-a-Service
Recently a new approach has emerged that provides a fundamentally different way to approach the problem. With advances in hardware and in-memory computing, software engineers are now able to tackle this old problem in a novel way.
First, Data-as-a-Service recognizes that organizations will never finish the job of consolidating all their data into a single system. Instead, Data-as-a-Service assumes data will exist in many different systems and formats, and it queries the data in situ. Instead of moving the data into a new silo, Data-as-a-Service solutions connect to the underlying databases, file systems, and object stores and query the data directly.
Second, Data-as-a-Service is designed for data consumers. Instead of placing the burden entirely on IT, Data-as-a-Service provides a self-service experience for data consumers to easily search and discover datasets using a searchable data catalog; to easily preview data in order to verify its applicability to the task at hand; to curate new datasets by filtering, transforming, and joining different datasets together, without writing any code; to share their work with other individuals or teams, so people can build on the efforts of others; and to continue to use their favorite BI and data science tools, but more productively.
Third, Data-as-a-Service recognizes that it is essential to separate the logical and physical aspects of data. Data consumers should be able to name, describe, and categorize datasets on their own terms, independent of how the physical data is managed. In addition, optimizing the access to the physical data should be independent of the logical model used for organizing and accessing the data. To these ends, Data-as-a-Service provides the ability to accelerate and scale access to the data using modern columnar in-memory data structures and scale-out architecture, without being coupled to the limitations of a given physical source.
Fourth, Data-as-a-Service must address the security and governance goals of an organization, ensuring that users are only able to access the appropriate data they are entitled to access. Data-as-a-Service must provide fine-grained controls, including row and column-level access controls and masking of sensitive data, even when the underlying source does not provide these abilities.
Fifth, finally, Data-as-a-Service must be open source in order to remove vendor lock-in and to ensure the contributions of a broader community can be made to benefit all users. Data-as-a-Service is deployed in a critical piece of the data technology stack, and as a critical component must be open source so that any company can benefit.
Conclusion
Ten years into the AWS era, companies love the “as-a-service” model. Infrastructure was once a multi-month lead time for new applications but now is available with the click of a button. Tools and applications have followed suit, and the idea of waiting weeks or months for IT to install and provision for a new technology seems antiquated.
Yet when it comes to data, companies today still follow the equivalent of “racking and stacking” in order to provision a new dataset. Data-as-a-Service opens the door to making data more like the services used for provisioning infrastructure, tools, and applications.