Real-time analytic databases that combine CRUD with streams for high concurrency and sub-second response times across billions of data points are needed for the next generation of data analytics.
Using data to garner insights helps further almost any organization’s success, plain and simple. Some of the benefits are getting the right products to the right people, leveling the playing field, understanding risks, helping people find their preferences, and a plethora of other compelling outcomes.
In the world of data analytics, some would say there are three major shifts in finding insights from data – and now we’re seeing a fourth.
The creation of CRUD
It all began with Codd’s creation of the relational database. For some time, hierarchical and network models were focused on automating legacy processes that had been done using pens, paper, and mechanical calculators. In 1970, IBM’s Dr. Ted Codd published “A Relational Model of Data for Large Shared Data Banks,” which was the start of a new era for data. The relational databases became the basis of a data revolution in the 1980s and 1990s, deriving the tables with rows and columns that we use today.
Codd’s idea inspired another group at IBM to develop SQL, which made it much easier to get data in and out of databases. Many groups around the world began using SQL, and a new wave of relational databases came to be.
See also: Why SQL Will Remain the Data Scientist’s Best Friend
To put it simply, relational SQL is CRUD (Create, Read, Update, and Delete data,) which became revolutionary in the way that it made large data sets practical at a time when compute and storage were very costly. CRUD helped lower these costs by garnering a collection of tools to store data more efficiently by breaking data in numerous smaller tables, which Dr. Codd called normalization.
This made data management more complex, which meant more developer time to work with data. To give an example, if GB of storage is the same price as five person-years of developer time, then that means complexity is worth the price.
Normalized CRUD allowed for a new level of complex questions to be asked of data, such as “what are my most profitable products, and how have they changed over the last few quarters?”
Analytical databases need data to be stored in an analytics-friendly environment, with the data partially denormalized–meaning bigger and fewer data tables. With this approach, it was soon discovered that the same dataset for both transactions and analytics made them work poorly. Developers began using a second copy of data on a second installation of the database software.
Appliances that use CRUD
As analytics evolved, a new wave of appliances came about. These appliances used relational CRUD but incorporated new categories of software to extract data transactional systems in order to adapt to a different CRUD schema that could easily load into analytic databases using software. In response to this, business intelligence tools were also used to turn data into pictures and reports that people could more easily use.
The Internet radically transformed the data ecosystem and increased the amount of data that was being created and used. In the ’90s, a “big application” would have been considered 5,000 users and a 1TB data warehouse. By the early 2000s, “big applications” were social media giants supporting millions of users. What was soon found is that pushing this much data through CRUD pipelines was costly and limited.
CRUD to cloud
A new era of analytics databases came about to deal with larger datasets. Many thought that these databases would change data warehousing and connect the new discoveries of the Internet with an outdated CRUDdy infrastructure. The Internet is what prompted the creation of the cloud, which completely changed the approach to data.
The cloud made unlimited cheap computing power possible, as well as affordable storage on-demand. A re-design and re-create approach to analytics was created.
On-premise applications were limited in capacity. Infrastructure and software licenses were expensive, and increasing capacity took time and money. On the cloud, compute can be added and removed on-demand, and storage is both durable and cheap. Suddenly, analytics was more scalable and less expensive than ever before. New ecosystems of cloud data warehouses, cloud data pipelines, cloud visualization tools, and cloud data governance redefined analytics.
Cloud computing inspired rapid growth of applications, which allowed the average business, not just internet giants, the ability to operate applications that supported millions of users–but once again, they found that pushing this much data through CRUDdy pipelines is inefficient and costly.
Data engineers, around the time of 2010, were struggling to tackle this problem. They wanted to figure out how to have interactive conversations with a high volume of data. Data streams in from the Internet and other applications, so why not just analyze the data stream instead of applying it all to relational CRUD?
The need for a fourth shift
In response to a growing need for a technology that utilizes both streaming and historic data, there needs to be a database with sub-second response times for questions from billions of data points that pulls from data in streams and in historical datasets. Concurrency is also crucial because there may be hundreds of people asking questions at the same time. It also needs to be affordable–where cost equates to tangible value.
Storage and computing costs money, but in modern development this cost is much less than developer time. Take into account a developer’s salary, benefits, equipment, and management. It is easy to see that developer time comes at a much greater cost.
From CRUD to modernity
From the beginnings of relational databases through data warehousing, there is now a need for modernity. The CRUD approach that until now has been the basis of data analytics shows how modern stream data needs a different architecture to succeed. Real-time analytic databases that combine CRUD with streams for high concurrency and sub-second response times across billions of data points helped ready developers for the next generation of data analytics. It is important for developers to see the importance of moving beyond CRUD and embrace modernity.