An experience with shards

Partitioning a database into multiple shards is a good way to build a scalable enterprise application [1]. This article describes our experience sharding our application. It focuses on issues that aren’t talked about in most descriptions about sharding.

Ours is an instant messaging application where people communicate with one another and store lists of their buddies, whose presence statuses are tracked.

2 Overview of Sharding

A typical architecture for enterprise applications is a cluster of application servers running on top of a single database. User requests are load balanced across the cluster. Important state is stored only in the database. This is a reasonably scalable architecture because one can keep adding application server nodes as the load increases. There comes a point, however, when the database becomes the bottleneck. A solution to this is to spread data across a number of databases, known as shards. One shard stores some people’s data, while another shard stores other people’s data. The application is designed to go to the correct shard for a user’s request.

There are different ways to partition data [2]. One is fixed hash key based partitioning, where a shard is treated as a hash bucket. A user’s shard is determined by hashing, say, their user id. The problem here is rehashing to add shards. A better approach is to use a dynamic directory that maps user id to shard. Such a directory could be stored in a single database. It is loaded far less than the application database, so scaling it isn’t an important requirement.

3 The Object Model

Our object model was almost perfectly suited to sharding. It was mapped by Hibernate [3] to the database. There was a user class at the centre surrounded by a bunch of other classes. Each user object with its subsidiaries could live in one shard. There were a couple of classes with instance variables that referenced another user’s data. In the relational model these were foreign keys. We couldn’t have foreign keys across shards, so we replaced these with instance variables containing the other user’s primary key. That user’s data was then explicitly loaded by application code when required.

4 Limiting Shard Accesses

Most operations required one or two shard accesses, one for the requesting user and one for the person they were communicating with. However, some operations required an indefinite number of database requests. These would limit scalability. One example is a person viewing their list of friends with their names and presence statuses. If we weren’t careful, there would be a database request to load each friend’s name and status. Each such request would go to its own shard. In the old un-sharded design, this didn’t happen because a single database request could perform a join that would return each friend’s details.

The solution was to store a tiny record for every user, containing their names and presence statuses in memory. We used memcached [4] to keep this scalable as our data grew. Now we only needed to query memcached to produce a person’s friend list.

The main idea here is that operations that require a fixed number of shard accesses are fine, but those that require an indefinite number of shard accesses will limit scalability, so it is important to redesign them.

5 Deadlocks

We configured the container to use two-phase commit in order to maintain the abstraction of a single application transaction that would be committed atomically to all shards or none, but not some.

PostgreSQL [5] can detect and break deadlocks between a set of transactions in a database. So our application was not initially designed to avoid deadlocks. Looking at PostgreSQL’s monitoring data in pg_stat_activity after deploying the new sharded application, we found that from time to time there would be a number of transactions marked waiting, and others marked ‘<IDLE> in transaction’. We thought these were signs of our container’s not ending transactions properly, but that wasn’t the case. They were signs of cross-shard deadlocks, which are undetectable by any single shard. Cross-shard deadlocks can occur when two application level transactions, each consisting of two shard transactions, contend for the same objects from each shard.


App. Xact	JTA1		JTA2

Shard Xact	JTA1.ST1	JTA1.ST2	JTA2.ST1	JTA2.ST2

Time	lock shard1.A			lock shard2.A
↓		lock shard2.A	lock shard1.A

	deadlocked		deadlocked

Table 1:

A schedule that leads to deadlocks between two application level transactions

In Table 1, Shard 1 only sees JTA1.ST1 and JTA2.ST1, which it doesn’t consider deadlocked against each other, so it makes the latter wait. Similarly Shard 2’s view is also local, and it makes JTA1.ST2 wait. The application level transactions, JTA1 and JTA2, are now deadlocked. It is only the application, and not PostgreSQL, that can detect this application level deadlock.

Our solution was to rewrite parts of our application to use standard deadlock avoidance techniques [6].

6 Limitations of Hashing

Usually objects in a hash structure, such as our shards, are keyed on a single attribute. For us, this attribute was the user id. One problem was that we needed lookups by other attributes too. One type of such an attribute was the auto-generated primary key of various tables. Another type was an alternate business key, such as a user’s phone number. How could we tell which shard to look in when searching for a user by these secondary keys?

The first problem was easy to solve by configuring each shard’s key generator with their own key spaces. That way primary keys were unique across shards, and the application knew the rule to compute a shard based on a key.

The second problem was trickier. Once a user has been mapped to their shard based on their user id, there isn’t a simple mathematical function that can now be applied to their phone number to tell their shard. This is a limitation of fixed hash key partitioning, and a good reason to use a dynamic directory approach instead.

In a dynamic directory, there would be a primary index mapping user id to shard number. Then there would be a secondary index mapping phone number to user id. To know a shard given a phone number, we would follow the path from the secondary index up to the primary index and then to the shard number. In general, there could be any number of secondary indices, one for each kind of lookup that an application performed. If the dynamic directory were stored in a database, with a table for each index, then the traversal from a secondary index to the primary index and from there to the shard number would simply be a join across tables. HiveDB [7] takes this a step further and lets you annotate class attributes representing primary and secondary indices, and then builds the dynamic directory schema for you.

Until our application moves up to using a dynamic directory, we perform lookups based on phone number by querying each shard in turn. This is not scalable.

7 Conclusion

The way to scale up the database tier of an object-oriented application is to break the object model into clusters of objects that share nothing with each other. Then each cluster can be stored on its own shard. When a cluster of objects must reference another cluster, perform the navigation in application code and not as a join in the database. Redesign operations that require accessing an indefinite number of shards. Sometimes this requires a small amount of denormalization, and at other times it requires using a second data source such as a memcached cluster, that is required to scale less than the database because it has less data.

References

[1] An Unorthodox Approach to Database Design : The Coming of the Shard. <http://highscalability.com/unorthodox-approach-database-design-coming-shard>

[2] MySQL Scaling and High Availability Architectures. <http://jcole.us/mysqluc/2007/>

[3] <http://www.hibernate.org/>

[4] <http://www.danga.com/memcached/>

[5] <http://www.postgresql.org/>

[6] Silberschatz, Galvin, Gagne, Operating System Concepts, 7th Edition. John Wiley & Sons, 2004, ISBN: 0471694665. Chapter 7, p. 254, Circular Wait.

[7] <http://www.hivedb.org/>