3 Tips for Selecting the Right Database for Your App 

Perhaps you’re building a brand new application. Or maybe your current database isn’t working well. Choosing the right database for your application can be overwhelming, given all the choices available today.

Having used a variety of database vendors in production, I can easily say that there is no one right answer. So, to help you along with the decision-making process, I’m going to give you three tips for selecting the right database for your application:

Tip #1: It Isn’t a SQL vs. NoSQL Decision

There are countless articles on the pros and cons of SQL and NoSQL databases. While they may provide some insight into the differences, they miss many of the important factors of the decision-making process. Most importantly, we need to select the database that supports the right structure, size, and/or speed to meet the needs of our application.

Structure focuses on how you need to store and retrieve your data. Our applications deal with data in a variety of formats, so selecting the right database includes picking the right data structures for storing and retrieving data. If you select the right data structures for persisting your data, your application will require more development effort to work around these issues and may not scale as a result.

Size is about the quantity of data you need to store and retrieve critical application data. The amount of data you can store and retrieve before the database is negatively impacted may vary based on a combination of the data structure selected, the database’s ability to partition data across multiple filesystems and servers, and vendor-specific optimizations.

Speed and scale address the time it takes to service incoming reads and writes to your application. Some databases are designed to optimize read-heavy apps, while others are designed to support write-heavy solutions. Selecting a database that can handle your app’s I/O needs goes a long way to a scalable architecture.

The important thing is to understand the needs of your application; from the structure of your data, to the size of the data and the read and write speeds you need. If you’re uncertain, you may wish to perform some data modeling to help you map out what’s needed.

This leads us to my next tip:

Tip #2: Use Data Modeling to Guide Database Selection

Data modeling helps map your application’s features into the data structure you’ll need to implement them. Starting with a conceptual model, you can identify the entities, associated attributes, and entity relationships that you’ll need. As you go through the process, it will become more apparent the type(s) of data structures you’ll need to implement. You can then use these structural considerations to select the right category of database that will serve your application best:

Relational: stores data into classifications (‘tables’), with each table consisting of one or more records (‘rows’) identified by a primary key. Tables may be related through their keys, allowing queries to join data from multiple tables together to access any/all required data. Relational databases require fixed schemas on a per-table basis that are enforced for each row in a table.

Document-oriented: stores structured information with any number of fields that may contain simple or complex values. Each document stored may have different fields, unlike SQL tables, which require fixed schemas. Some document stores support complex hierarchies of data through the use of embedded documents. Additionally, document stores offer extreme flexibility to developers, as fixed schemas do not need to be developed ahead of time. Search stores are often document-oriented databases that are optimized for data querying across one or more fields. Search-based data stores typically support additional features such as sorting by relevance and data faceting for drill-down capabilities.

Key/Value: Key/Value stores offer great simplicity in data storage, allowing for massive scalability of both reads and writes. Values are stored with a unique key (“bob”) and a value (“555-555-1212”) and may be manipulated using the following operations: Add, Reassign (Update), Remove, and Read. Some storage engines offer additional data structure management within these simple operations.

Column-oriented: similar to relational, data is stored in both rows and columns. However, columns may contain multiple values, allowing data to be fetched by row or by column for highly optimized data retrieval.

Graph: graph stores focus on storing entities and the relationships between them. These stores are very useful in navigating between entities and querying for relationships between them to any depth — something that is difficult to do with traditional relational or document databases.

As you start to map your application’s features to data structures, consider the kinds of queries you will need to support for your UI or API. Some data structures will make your mapping logic easier into and out of your application to retrieve single entities, but they may not support the kinds of ad hoc queries you may need to support more complex data retrieval and reporting.

A final note on data modeling: Don’t depend on things like database migrations and scaffolding generators to define your database structures. Data modeling will help you understand the data structures necessary to build your application. Use these tools to accelerate the implementation process, based on your database model.

Tip #3: You May Need More Than One Type of Database

During the modeling process, you may realize that you need to store your data in a specific data structure, where certain queries can’t be optimized fully. This may be due to some complex search requirements, the need for robust reporting capabilities, caching, or the requirement for a data pipeline to accept and analyze incoming data. In these situations, more than one type of database may be required for your application.

When adopting more than one database, it’s important to select one database that will own a specific set of data. This database becomes the canonical database for those entities or for a specific context. Any additional databases that work with this same data may have a copy, but they will not be considered an owner of this data.

For example, we may decide that a relational database is the best data structure for our application. However, we need to support a robust, faceted search within our application. In this case, we may choose PostgreSQL or MySQL for our canonical data store for all our entities. We then choose to use a document database such as Elasticsearch to index our entities by specific fields and facets. Elasticsearch may also store some basic details about our entities, such as name and description, so that our search results provide useful results. However, Elasticsearch does not own our entity data and we do not query it for the latest details. Instead, we consider the relational database the canonical source for the entity details and updates. We then keep Elasticsearch updated when data is changed in our relational database.

It’s important to be thoughtful when adopting more than one database. Otherwise, your application may behave inconsistently and result in frustrated customers.

Putting it All Together

To summarize the process I use for selecting a database:

  1. Understand the data structure(s) you require, the amount of data you need to store/retrieve, and the speed/scaling requirements
  2. Model your data to determine if a relational, document, columnar, key/value, or graph database is most appropriate for your data.
  3. During the modeling process, consider things such as the ratio of reads-to-writes, along with the throughput you will require to satisfy reads and writes.
  4. Consider the use of multiple databases to manage data under different contexts/usage patterns.
  5. Always use a master database to store and retrieve canonical data, with one or more additional databases to support additional features such as searching, data pipeline processing, and caching.

Fuente: 3 Tips for Selecting the Right Database for Your App – DZone Database

re: Why Uber Engineering Switched From Postgres to MySQL

The Uber Engineering group have posted a really great blog post about their move from Postgres to MySQL. I mean that quite literally, it is a pleasure to read, especially since they went into such details as the on-disk format and the implications of that on their performance.

Image title

For fun, there is another great post from Uber, about moving from MySQL to Postgres, which also has interesting content.

Go ahead and read both, and we’ll talk when you are done. I want to compare their discussion to what we have been doing.

In general, Uber’s issue falls into several broad categories:

  • Secondary indexes cost on write
  • Replication format
  • The page cache vs. buffer pool
  • Connection handling

Secondary Indexes

Postgres maintains a secondary index that points directly to the data on disk, while MySQL has a secondary index that has another level of indirection. The images show the difference quite clearly:

Postgres MySQL
Postgres_Tuple_Property_ MySQL_Index_Property_

I have to admit that this is the first time that I ever considered the fact that the indirection’s manner might have any advantage. In most scenarios, it will turn any scan on a secondary index into an O(N * logN) cost, and that can really hurt performance. With Voron, we have actually moved in 4.0 from keeping the primary key in the secondary index to keeping the on disk position, because the performance benefit was so high.

That said, a lot of the pain the Uber is feeling has to do with the way Postgres has implemented MVCC. Because they write new records all the time, they need to update all indexes, all the time, and after a while, they will need to do more work to remove the old version(s) of the record. In contrast, with Voron we don’t need to move the record (unless its size changed), and all other indexes can remain unchanged. We do that by having a copy on write and a page translation table, so while we have multiple copies of the same record, they are all in the same “place”, logically, it is just the point of view that changes.

From my perspective, that was the simplest thing to implement, and we get to reap the benefit on multiple fronts because of this.

Replication Format

Postgres send the WAL over the wire (simplified, but easier to explain) while MySQL send commands. When we had to choose how to implement over the wire replication with Voron, we also sent the WAL. It is simple to understand, extremely robust and we already had to write the code to do that. Doing replication using it also allows us to exercise this code routinely, instead of it only running during rare crash recovery.

However, sending the WAL has issues, because it modify the data on disk directly, and issue there can cause severe problems (data corruption, including taking down the whole database). It is also extremely sensitive to versioning issues, and it would be hard if not impossible to make sure that we can support multiple versions replicating to one another. It also means that any change to the on disk format needs to be considered with distributed versioning in mind.

But what killed it for us was the fact that it is almost impossible to handle the scenario of replacing the master server automatically. In order to handle that, you need to be able to deterministically let the old server know that it is demoted and should accept no writes, and the new server that it can now accept writes and send its WAL onward. But if there is a period of time in which both can accept write, then you can’t really merge the WAL, and trying to is going to be really hard. You can try using distributed consensus to run the WAL, but that is really expensive (about 400 writes / second in our benchmark, which is fine, but not great, and impose a high latency requirement).

So it is better to have a replication format that is more resilient to concurrent divergent work.

OS Page Cache Vs Buffer Pool

From the post:

Postgres allows the kernel to automatically cache recently accessed disk data via the page cache. … The problem with this design is that accessing data via the page cache is actually somewhat expensive compared to accessing RSS memory. To look up data from disk, the Postgres process issues lseek(2)and read(2) system calls to locate the data. Each of these system calls incurs a context switch, which is more expensive than accessing data from main memory. … By comparison, the InnoDB storage engine implements its own LRU in something it calls the InnoDB buffer pool. This is logically similar to the Linux page cache but implemented in userspace. While significantly more complicated than Postgres’s design…

So Postgres is relying on the OS Page Cache, while InnoDB implements its own. But the problem isn’t with relying on the OS Page Cache, the problem is how you rely on it. And the way Postgres is doing that is by issuing (quite a lot, it seems) system calls to read the memory. And yes, that would be expensive.

On the other hand, InnoDB needs to do the same work as the OS, with less information, and quite a bit of complex code, but it means that it doesn’t need to do so many system calls, and can be faster.

Voron, on the gripping hand, relies on the OS Page Cache to do the heavy lifting, but generally issues very few system calls. That is because Voron memory map the data, so access it is usually a matter of just pointer dereference, the OS Page Cache make sure that the relevant data is in memory and everyone is happy. In fact, because we memory map the data, we don’t have to manage buffers for the system calls, or to do data copies, we can just serve the data directly. This ends up being the cheapest option by far.

Connection Handling

Spawning a process per connection is something that I haven’t really seen since the CGI days. It seems pretty harsh to me, but it is probably nice to be able to kill a connection with a kill –9, I guess. Thread per connection is also something that you don’t generally see. The common situation today, and what we do with RavenDB, is to have a pool of threads that all manage multiple connections at the same time, often interleaving execution of different connections using async/await on the same thread for better performance.

Fuente: re: Why Uber Engineering Switched From Postgres to MySQL – DZone Database

Relational vs. NoSQL Databases for API Traffic 

API Consumption drives the importance of API Traffic. Without insight into how your APIs are being consumed, you’re unable to get the analytics for your customers and API usage.

API-traffic-blog-banner.png

API traffic data has a few characteristics; high frequency, payload sizes, data structure, tables, volume, and objects. Persisting API traffic is important because most services have some sort of rate-limiting and different billing tiers for customers based on their usage. There is usually a threshold for alerting and scaling, so if someone is using more API calls than they should, you can partition them. You also want to be able to provide analytics to your internal organization for the different APIs/resources you are exposing. If your customers are exposing an API, it is a good bet that they are integrating with other services as well, allowing them to slice and dice their data with the data they are consuming from your product.

There are differences in typical API request payloads and response payloads. The request data vs. response data can change what your persistent strategy needs to be. With request payloads, you usually have moderate-to-large payload sizes. About 80% of API requests are GET requests, and POST requests can also be large. The response payloads are also typically large in size. The data structures are indeterministic, meaning when you are trying to run analytics on the data the table can be large and slicing and dicing the data responses can be variable.

Slack_for_iOS_Upload-3.jpg

Establishing that persisting the data is the best option, most will look to some SQL database, considering that they are most likely already using one. The question is whether a NoSQL database should be considered. SQL, being more than 40 years old, is the primary interface for RDMBS and, having commercial and open-source implementations, it is a strong consideration.

NoSQL, on the other hand, has existed since the 1960s and was the primary storage mechanism before SQL gained popularity. It has gained more traction in last 10 years.

With the comparison of the two technologies below, the main factors in the decision-making process are data growth, online versus archived data, search filter flexibility, search performance, and clustering and sharding. When deciding what the best is for you, consider the measure of your current inbound API traffic, the data-retention policy, estimated data growth, and if your customers need heavy slicing and dicing.

SQL NoSQL
Relational model with data organized in a tabular structure. Differenent Model — document, graph, key value.
Pre-defined schema definition. Dynamic schema definition.
Typically vertically scalable — higher cost VMs. Horizontally scalable — lower cost VMs.
Powerful and standardized query interface. Query interface varies by provider.
Most implementations are ACID-compliant. Follows CAP (Consistency, Availability, Partitioning) .

You can use a SQL datastore for:

  • Manageable data sizes.
  • Low time period or size-based retention policies.
  • Low usage frequency.
  • Lightweight analytics.
  • Query interface needs to be standardized.

You can use a NoSQL for when:

  • Scale and volume are important.
  • Deep analytics are required.
  • Fast queries are paramount.
  • You can live with a non-standard query interface

Fuente: Relational vs. NoSQL Databases for API Traffic – DZone Database