Database sharding – when and how we implement

Thursday, October 15, 2009 17:47
Posted in category Concept, Implementation

Shard your mindI started building an application for the education system and it will have to use huge amounts of data.
Some of the issues I have to address are size of the database, real-time reporting and speed of access to the database. Considering this aspects I started looking into database sharding so here are my first conclusions, a short presentation on what database sharding can do for you.

Solutions like should be planned before project starts in order to have an effortless scaling in time.

What is database sharding?
A decomposition of a database into multiple smaller units that can handle requests individually; A breaking into shards; A shard is a method of horizontal partitioning in a database or search engine.

What is database sharding beyond these definitions and how can it help us in real life when building online applications

When we plan an online application we have to think about it’s potential to grow. The growth translates, in general, into more requests to the application and huge resource consumption. From a certain point the resource consumption evolves exponentially, not linearly.
Even if today hardware is cheap and prices are going down in time there comes a time when investing in new, more powerful, hardware doesn’t bring too much performance and the cost per percent of performance increased is growing exponentially.

Where do problems appear more often?

  • web server: problems with accessing files or too many concurrent requests; you can deal with this kind of problems by using a CDN for file storage and by using balancers for web servers; there are a lot of solutions for this issue, both hardware and software, both cheap and expensive
  • database server: problems arise both at writing to database and reading from it; at a certain point even a replication model (few write servers and more read servers in replication, or whatever model suits you best) can be inefficient when problems are generated by huge tables and indexes.

How can database sharding help us in this case?

When using sharding as a solution for database horizontal partitioning we have to think about logical distribution of the database on multiple servers.
The structure of the database will be the same as for a normal architecture in most cases but on each server we will have smaller databases with fewer records.

Now let’s give an example to translate all that I wrote until now :)

Let’s consider we have an application that manages users and the messages exchanged between those users.

At start let’s consider we will only need 2 database servers.
The structure for the tables in the 2 databases will be the same, the only difference is that on server Alpha we will put the users that have and id that is even and on server Beta we will put the users that have an odd id.
On the Alpha server we will store also all the messages related to the Alpha users, both sent and received and on Beta we will do the same for Beta users.
If we do it this way we will double the number of messages in the entire system but we gain performance when doing readings from the database.

This is a simple solution for a simple problem. For more complex problems there are other solutions and I will write about those in the following lines.

One more thing about this example: how do we work with this database setup from the point of view of the developer?
Simple: all the requests to the database layer will have to contain the user id too and the database layer will decide, based on the user id, to which server it will send the request.

Moving on: types of database sharding implementations

choicesThere are many ways to implement database sharding for the application we want to develop.
Each one has its own advantages and disadvantages and it’s our mission to decide which one fits our project best.
Some ways of database sharding implementations can do us a lot of good but some can destroy our application so always plan smart.

  1. Vertical partitioning
    We split the database into logical entities that can live independently from others. Some entities will go on shard while others on different shards. We can also have a dedicated shard for one entity.
    Example: we have one shard for user profiles, one shard for messages and one shard for image gallery.
    In this way, when we want to show images we will send requests to the image shard. All we have to do is to maintain in all shards a reference to a central entity, in this case the user.
    This set-up works well in certain cases and it’s easy to implement but can’t handle individual growth of entities.
  2. Interval partitioning
    This set-up is best suited for financial applications and means that we choose the shard we write to/read from by checking in which interval a certain value fits.
    Example: all financial data for year 2008 is saved in shardX and for year 2009 in shardY.
  3. ID based partitioning
    In this case we copy the same database structure on all shards and we define the main entity of the application, for example the user.
    Considering we have n shards available we will have the following rule of writing / reading: for user id 1 we use shard 1, for user id 2 we use shard 2, for user id n we use shard n, for user id n+1 we use again shard 1 and so on.
    Such a solution is a good option when we can estimate the growth of the database.
    The good: all shards will be evenly loaded.
    The bad: is hard to increase the number of shards used and re-balance the shards.
  4. Directory based partitioning
    Think of a system where for each database request we also send to the database layer the entity identifier and the id (for example: entity – users, id – 567 or entity – images, id – 23).
    We also build a directory that relates entity + id to shard.
    Choosing such a system is the best solution when not knowing how your application will grow and when. It’s also the hardest to implement because for all database operations you have to update information’s in the directory.

When not to use database sharding

There are still a lot of situations when we are not forced to use database sharding, there a lot of other solutions we can use until we really need database sharding.
We can use sql cache, sql proxy, storing entire record sets with memory with memcache. Or we can use better hardware.

Such an example comes from the 37signals team, read here.

Before the end

Here is a nice presentation from NETLOG about database sharding, enjoy!

Related Posts

You can leave a response, or trackback from your own site.

7 Responses to “Database sharding – when and how we implement”

  1. seomoz says:

    May 2nd, 2010 at 7:16 am

    Hi, colleague! I love your blog, it’s so interesting! I think it’s pretty popular, isn’t it? I would like to invite you to my favorite Pay-Per-Click system, I believe you can earn with your blog a lot here. My crazy russian friend earns $3.000 per day here! Look, it doesn’t obligate you to anything http://klikvip.com/landings/en/landing2/index.php?aff=35357

  2. Jason Guglielmo says:

    May 16th, 2010 at 9:28 am

    I would like to thank you for the efforts you have made in writing this article. I am hoping the same best work from you in the future as well and i have start my own blog now, it’s about to domain name , thanks for your effort

  3. Peggy Saelens says:

    July 9th, 2010 at 4:42 am

    Thank you for another great article. Where else could anyone get that kind of information in such a perfect way of writing? I have a presentation next week, and I am on the look for such information.

  4. kampanie AdWords says:

    July 11th, 2010 at 12:49 am

    Great blog, I will add this blog to my favorites.

  5. Cherlyn Dekay says:

    July 11th, 2010 at 5:19 pm

    I would like to say, nice webpage. Im not sure if it has been addressed, however when using Firefox I can never get the entire page to load without refreshing alot of times. Could just be my modem.

  6. Madelyn Garlow says:

    July 12th, 2010 at 3:38 am

    Just saying thanks will not just be sufficient, for the tremendous lucidity in your writing. I will right away grab your rss feed to stay abreast of any updates.

  7. Pharmacy technician book says:

    July 12th, 2010 at 8:36 am

    What a great resource!

Leave a Reply