Consistent Hashing - System Design

Follow up post from Hashing example applied in database partitioning

Why do we need consistent hashing?

To address drawbacks with static hashing technique, to list a few,

tight coupling with number of nodes in the distributed system makes it cumbersome to dynamically add/remove nodes
adding/removing nodes makes us change the hash function and re-distribute load on new number of nodes by recalculating hash for all data/requests
this introduces a downtime until this migration is complete and causes a hit on system availability, which is not what we want in a distributed system
if a node goes down, it causes uneven distribution of load on remaining nodes

How does consistent hashing help

Consistent hashing solves the problem of rehashing by providing a distribution scheme which does not directly depend on the number of servers.

Consistent Hashing is a distributed hashing scheme that operates independently of the number of servers or objects in a distributed hash table by assigning them a position on an abstract circle, or hash ring. This allows servers and objects to scale without affecting the overall system.

Let's take an example of consistent hashing: scan 2022-07-17 19.10.19n_17.jpg

we have a distributed system serving requests from users

we use the email addresses as partition key to distribute the load on available servers

servers are arranged in circular fashion on a hash ring

we use same hash function f(h) to calculate hash for servers as well as incoming requests to place each on the hash ring

s1 handle requests from 2 users, s2 doing the same, s2 handling requests from only 1 user

let's assume s1 goes down, the requests handled by s1 will be handled by s2. There is no need of rehashing and s3 remains unchanged

this avoids rehashing but implementation is half baked which causes uneven load (s2 processing 4 requests vs s3 processing only 1)

Better implementation of consistent hashing

Let's extend above example and improve the placement of servers on hash ring: scan 2022-07-17 19.10.19n_18.jpg

instead of using one hash function, we apply 3 different hash functions to place the nodes at multiple places on the hash ring. f1(h) -> f2(h) -> f3(h)

These become the virtual nodes on hash ring, underlying its just one server placed virtually on hash ring making s1 as s11,s12,s13 (due to 3 hash functions) and respectively for s2,s3

we use only f1(h) to calculate hash for users to place on hash ring similar to first approach

s1 handle requests from only 1 user, s2 and s3 handling requests from 2 users each

let's assume s2 goes down, so s21, s22, s23 go down

requests from s22 will be handled by s32, requests from s23 will be handled by s12

s1 now handles 2 requests and s3 handles 3 requests, creating even load better than what we saw earlier without virtual nodes

Consistent Hashing - System Design

Table of contents

Why do we need consistent hashing?

How does consistent hashing help

Better implementation of consistent hashing

Did you find this article valuable?