Picking the right database when there are so many available should not just be a case of using whatever was easy for your last project.
Companies today rely on their data. Their applications create it, they analyze it to find more opportunities, and it powers the experiences that customers want. No business today is saying, “I wish I had less data.”
Implementing the right database can make a huge difference to performance. This can involve knowing the right kind of database for your use case, as well as how the details of any deployment will affect performance over time. Looking at the small stuff – from which version of drivers or application language you support through to how you deploy – can have a significant impact on how many customers your application supports simultaneously. This can affect your real-time data deployment approach.
See also: Enabling Real-Time Applications with Change Data Capture
Picking the right database platform
There are plenty of options available today. According to the DB-Engines tracker service, there are 359 different databases that can be used in projects. Picking the right database can make a huge difference to your performance over time, but setting up that implementation in the right way can also have a huge impact.
In practice, this means getting the right database design, the right data types, and the right indexes. Picking the right database when there are so many available should not just be a case of using whatever was easy for your last project. While it is possible to hack most database set-ups to achieve what you want, the reality is that each database will have different strengths and weaknesses that will make them more or less viable for your current project.
For example, you may need the standard Atomicity, Consistency, Isolation, and Durability (ACID) characteristics for your transactions, which might point to relational databases like MySQL or PostgreSQL. Alternatively, you may want to handle more data and scale, which would mean looking at NoSQL databases instead.
You may want to prioritize developer speed, or you may want to handle specific data types like JSON, which would suggest MongoDB. Alongside databases that are designed for typical application workloads, there are also specialist databases for graph data and time-series data that may also be better suited for your use cases. Understanding these areas is essential if you want to pick the right database.
Another element of database design to consider is the workload growth and access pattern that your application will need. For some applications, you will be able to estimate your growth over time with a fair degree of accuracy. Other applications will be less predictable as they will be based on customer use and activities around a service – this may be less than you originally thought or be much higher. For customer-facing applications, the temptation is always there to over-specify just in case or to over-complicate the design side in an attempt to future-proof the service. In response to this, the guidance should be to keep things simple to start with and concentrate on what is needed today. You can always update the design to keep up with new query models or extended use when you need it.
Looking at the details
After you have selected your database, you should also look at how you implement and keep that up to date too. For example, you may select MySQL as your database of choice. However, how you set up your database instance with an application driver can affect the throughput for queries per second and, therefore, how well your application performs.
Looking at MySQL and Python together, the version of the MySQL Connector can affect performance. In testing, using MySQL with version 3.9.7 of the MySQL Python Connector performed significantly better than using version 3.10.10, with a drop of around 50 percent in transaction throughput. Using 3.10.0, the application delivered around 2,900 queries per second (QPS), while 3.9.7 achieved around 4,300 QPS. This was also lower than using the mysqlclient as an alternative connector, where both versions achieved around 4,750 QPS.
What did this show? In theory, it demonstrates how a small decision like which application driver to use can affect performance. In reality, many application developers do not look into the specifics of driver versions and performance over time. Either they do not think about these areas and track them, or they consider a cloud-hosted service or Database as a Service option where these decisions are effectively out of their hands.
So why is this kind of detail important to know about? It shows how much a little decision like which driver to use can affect application performance. More importantly, this can affect the ability to deliver that real-time experience that customers want compared to the budget that you spend to achieve it.
Adding a fraction of a second of load time to a user transaction might not be noticeable on its own under normal circumstances. However, as the load on the database increases, the time to deliver a transaction goes up. This would lead you to upgrade your instance or add another node, which would be more expensive. If you are running your application using a cloud-native deployment, this is a prerequisite for how you scale your service. Adding nodes to an application is cheap, but it is not free. This direct cost can be money that is wasted compared to applying that budget to other, more pressing concerns. Alongside this, it can lead to further costs for infrastructure when that data set expands.
For developers looking at how to build real-time applications, the temptation is there to hand everything over to a third party and let them take care of the situation. However, this can lead to much higher costs and a lack of control compared to running your choice of database. While this might include making choices around little things like language drivers, it can have a huge impact on the performance of your application and the cost for the infrastructure involved. Moving to new instances and scaling by credit card in the cloud is a short-term solution that may be the right answer for right now; however, it is possible to deliver better performance at a lower cost by asking the right questions in the first place.
Application workloads evolve over time – they grow, they shrink, and they have to expand with new queries or use cases. Modern application design and approaches to infrastructure can support those needs, scaling quickly up and down to meet the demand. However, these systems do not look out for hidden issues and bottlenecks. This remains our responsibility to look for potential deviations in what we expect to see around performance – we should always ask why something changed and dig in to find the answers.
Just because we can use automation to mitigate our problems, it does not mean that they are dealt with. Instead, this approach may store up trouble and lead to far higher expenses over time. Instead, we have to take these issues into account when dealing with real-time data. The small stuff can make a huge difference.