Consistent Hashing 101

When working on distributed systems, we often have to distribute some kind of workload on different machines (nodes) of a cluster so we have to rely on a predictable and reliable key/value mapping algorithm.

If you’re not sure about what that means, just consider the following questions when working on a cluster with a lot of machines (nodes):

  • how could I make sure that all the data for a given user always gets delivered and processed on the same machine ?
  • how could I make sure that I store and query the same cache server for a given key ?
  • how do I split and distribute chunks of a large file across multiple storage machines and then make sure I can still access it through all those machines at once ?

A lot of technologies answer those kind of mapping questions by implementing a hashing based distribution of their workload (be it a distributed processing, file or cache).

Consistent hashing is one of those implementations and I felt like taking some time to explain it further.

Example use case

Here is my attempt to explain what is consistent hashing and why it is needed on distributed systems. To make things fun, we’ll take this simple use case:

  • I have 5 broken cars
  • There are 4 repair factories nearby
  • I have to implement a way to figure out where to send each car to get it fixed
  • I want to ensure that an even number of cars will get fixed on each factory


This gets down to two major questions to solve:

  • what is my selection criteria ? this will be my key.
  • what is the expected answer ? this is my value.

Static mapping

The first approach we could implement is to manually distribute the car based on their colour.

  • key = car’s colour
  • value = factory number

To implement this, you use what we usually call dictionaries on various languages : those are static data structures where you assign a value to a key.

We would then write a mapping of “car color” -> “factory n” and apply this simple rule to decide where to ship a broken car.

  "yellow": "factory 1",
  "orange": "factory 2",
  "red": "factory 3",
  "green": "factory 4",
  "black": "factory 1"

This way we could indeed distribute the car repairs, but we can already see that with that an uneven number of colours ends up in over provisioning the factory 1. But there’s worse:

What if I start getting only yellow broken cars ?

  • I would end up sending all of them to the factory 1 and the other factories would remain almost empty !


This is a serious limitation. We need a dynamic way to calculate the car distribution between the factories, for this we will use a hash algorithm !

Hash tables

A hash table is a data structure where we apply a hash function (algorithm) on the key to compute an index (pointer) into an array of buckets (values) from which we get the value.

MD5 gives very good hashing distribution and is widely available so this makes it a very good candidate for a hashing algorithm.

We can relate it like this to our example :

  • key = car’s plate number
  • hash function = md5
  • array of values = [factory 1, factory 2, factory 3, factory 4]

To find out where we send a car we just could do:

hash = md5(car plate number)
index = int(hash) % size_of(array)
index = 0 if index > size_of(array)
factory = array[index]

In python ? okay !

import md5

factories = [1, 2, 3, 4]

def get_factory(plate):
    hash = int(, 16)
    index = hash % len(factories)
    if index > len(factories):
        index = 0
    return factories[index]

>> 3

>> 3

Wow it’s amazingly simple right ? ūüôā

Now we have a way better car repair distribution !… until something bad happens:

What if a factory burns ?


Our algorithm is based on the number of available factories so removing a factory from our array means that we will redistribute a vast majority of the key mappings from our hash table !

Keep in mind that the more values (factories) you have in your array the worse this problem gets. In our case, given a car’s plate number we are sure that we wouldn’t be able to figure out where a vast majority of them were sent any more.

factories = [1, 2, 4]

>> 2 (was 3 before)

>> 1 (was 3 before)

Even worse is that when factory 3 gets repaired and back in my hash table, I will once again loose track of all my dispatched cars… What we need is a more consistent way of sorting this out.

Consistent hashing

The response to this kind of problem is to implement a consistent hashing algorithm. The goal of this technique is to limit the number of remapped keys when the hash table is resized.

This is possible by imagining our factories as a circle (ring) and the hash of our keys as points on the same circle. We would then select the next factory (value) by going through the circle, always on the same way, until we find a factory.


  • Red hashed plate number would go to factory 1
  • Blue hashed plate number would go to factory 3

This way, when a factory gets added or removed from the ring, we loose only a limited portion of the key/value mappings !

Of course on a real world example we would implement a ring with a lot more of slots by adding the same factories on the ring multiple times. This way the affected range of mappings would be smaller and the impact even more balanced !

For instance, uhashring being fully compatible and defaulting to ketama’s ring distribution, you get 160 points per node on the ring !

I hope I got this little example right and gave you some insight on this very interesting topic !

Consul on Gentoo Linux

As a clustering and distributed architecture enthusiast, I’m naturally interested in software providing neat ways to coordinate any kind of state/configuration/you-name-it over a large number of machines.

My quest, as many of you I guess, were so far limited to tools like zookeeper (packaged on my overlay but with almost no echo) and doozerd (last commit nearly 6 months ago) which both cover some of the goals listed above with more or less flavors and elegance (sorry guys, JAVA is NOT elegant to me).

I recently heard about consul, a new attempt to solve some of those problems in an interesting way while providing some rich fuctionnalities so I went on giving it a try and naturally started packaging it so others can too.

WTF is consul ?

consul logo

Consul is a few months’ old project (and already available on Gentoo !) from the guys making Vagrant. I especially like its datacenter centric architecture, intuitive deployment and its DNS + HTTP API query mecanisms. This sounds promising so far !

This is a descripion taken from the Hashicorp’s blog :

Consul is a solution for service discovery and configuration. Consul is completely distributed, highly available, and scales to thousands of nodes and services across multiple datacenters.

Some concrete problems Consul solves: finding the services applications need (database, queue, mail server, etc.), configuring services with key/value information such as enabling maintenance mode for a web application, and health checking services so that unhealthy services aren’t used. These are just a handful of important problems Consul addresses.

Consul solves the problem of service discovery and configuration. Built on top of a foundation of rigorous academic research, Consul keeps your data safe and works with the largest of infrastructures. Consul embraces modern practices and is friendly to existing DevOps tooling.

app-admin/consul ?

This is a RFC and interest call about the packaging and availability of consul for Gentoo Linux.

The latest version and live ebuilds are present in my overlay so if you are interested, please tell me (here, IRC, email, whatever) and I’ll consider adding it to the portage tree.

I want to test it !

Now that would be helpful to get some feedback about the usability of the current packaging. So far the ebuild features what I think should cover a lot of use cases :

  • full build from sources
  • customizable consul agent init script with reload, telemetry and graceful stop support
  • web UI built from sources and installation for easy deployment
# layman -a ultrabug
# emerge -av consul

Hope this interests some of you folks !

keepalived v1.2.11 & glusterfs v3.4.2

Quick post for two quick bumps related to clustering.


  • quite a lot of bug fixes and improvements
  • contains a backport for libgfapi support for integrating with NFS Ganesha
  • nfs/mount3: fix crash in subdir resolution


  • autoconf: better libnl3 detection
  • Fix memory allocation for MD5 digest
  • Quite some nice memory leak fixes on different components
  • vrrp: dont try to load ip_vs module when not needed
  • Pim van den Berg work on libipvs-2.6 to sync with libipvs from ipvsadm 1.27
  • vrrp: extend ip parser to support default and default6
  • vrrp: fix/extend gratuitous ARP handling (multiple people reported issues where MASTER didnt recover properly after outage due to no gratuitous ARP sent)
  • Multiple fixes to¬†genhash
  • vrrp: fix vrrp socket sync while leaving FAULT state (old old bug here)
  • Full changelog here

Tuning pacemaker for large clusters

We’ve been running quite a lot of production clusters using pacemaker/corosync for a while. Some of them are large, handling more than 200 resources across multiple nodes and we’ve exceeded some limits on pacemaker’s CIB size.

I thought I’d share how to tune your cluster to handle such a bunch of resources since there are some default limits on the IPC buffer size which can lead to problems when your resources (and thus CIB) grows too much.

Hitting the IPC limit

When running a large cluster you may hit the following problem :

error: crm_ipc_prepare: Could not compress the message into less than the configured ipc limit (51200 bytes).Set PCMK_ipc_buffer to a higher value (2071644 bytes suggested)

Evaluating the buffer size

Have a look at the size of your current CIB :

# cibadmin -Ql > cib.xml
# ls -l cib.xml
# bzip2 cib.xml
# ls -l cib.xml.bz2

The CIB is compressed on the wire using bzip2 so you have to compare the compressed cib.xml.bz2 with the IPC default buffer size of¬†51200 and you’ll find the sufficient¬†PCMK_ipc_buffer value for you (take more just to be safe).

Setting the environment variables

On Gentoo Linux, you’ll have to create the /etc/env.d/90pacemaker¬†file containing :

  • PCMK_ipc_buffer : you may need to increase this depending on your cluster size and needs
  • PCMK_ipc_type¬†: the shared-mem one is the default now, other values are¬†socket|posix|sysv

You will also need to set these env. vars in your¬†.bashrc so that the crm CLI doesn’t break¬†:

export PCMK_ipc_type=shared-mem
export PCMK_ipc_buffer=2071644


Finally, I wanted to let you know that the upcoming Pacemaker v1.1.11 should come with a feature which will allow the IPC layer to adjust the PCMK_ipc_buffer automagically !

Hopefully you shouldn’t need this blog post anymore pretty soon ūüôā

EDIT, Jan 16 2014

Following this blog post, I had a very interesting comment from @beekhof (lead dev of pacemaker)

beekhof> Ultrabug: regarding large clusters, the cib in 1.1.12 will be O(2) faster than 1.1.11.
Ultrabug> beekhof: that's great news mate ! when is it scheduled to be released ?
beekhof> 30th of Feb

Latest cluster releases

Now that I’m back I’ve bumped some of the sys-cluster packages. Users of keepalived will be interested in this since it was more than a year that upstream released a version.


This is a big and long awaited one. It features major enhancements, features and bug fixes. The changelog is pretty huge but here are some quick points which I particulary liked (biased view warning) :

  • Revisited the whole code to use posix declaration style
  • Boon Ang fixed comparison of primary IP addresses. If a router in the master state receives an advertisement with priority equal to the local priority, it must also compare the primary IP addresses (RFC 3768, section 6.4.3). The code to handle this was comparing two IP addresses with different byte-ordering, resulting in multiple routers in the master state. This patches resolves the problem by coverting the local primary IP address to network byte order for the comparison.
  • Henrique Mecking fixed memory leak in libipvs
  • Willy Tarreau and Ryan O’Hara add the ability to use VRRP over unicast. Unicast IP addresses may be specified for each VRRP instance with the ‘unicast_peer’ configuration keyword. When a VRRP instance has one or more unicast IP address defined, VRRP advertisements will be sent to each of those addresses. Unicast IP addresses may be either IPv4 or IPv6. If you are planing to use this option, ensure every ip addresses present in unicast_peer configuration block do not belong to the same router/box. Otherwise it will generate duplicate packet at reception point.


Many bug fixes with better performances for this release. This is quite impressive, good work upstream !


This one is about supporting live config reloading and fix high CPU usage when idle. See the release notes.

Soon to come

The resource-agents v3.9.6 and cluster-glue v1.0.12 should be released by their upstream pretty soon, stay tuned.

pacemaker v1.1.10 & corosync v2.3.1

More than 5 months since the last bump of pacemaker. I’m glad that @beekhof did release the final pacemaker-1.1.10 and that the officially stable corosync got bumped to 2.3.1.

The changelogs are quite heavy so I won’t go into details about them but they both have quite a nice bunch of bugfixes and compatibility features. That’s why I’m hoping we should soon be able to fix bug #429416 and drop corosync hard mask. Hopefully some users such as¬†@pvsa will give us some valuable feedback which will allow us to do it smoothly.


Using keepalived for a self-balancing cluster

Load balancing traffic between servers can sometimes lead to headaches depending on your topology and budget. Here I’ll discuss how to create a self load balanced cluster of web servers distributing HTTP requests between themselves and serving them at the same time. Yes, this means that you don’t need dedicated load balancers !

I will not go into the details on how to configure your kernel for ipvsadm etc since it’s already covered enough on the web but instead focus on the challenges and subtleties of achieving a load balancing based only on the realservers themselves. I expect you reader have a minimal knowledge of the terms and usage of ipvsadm and keepalived.

The setup

Let’s start with a scheme and some principles explaining our topology.

  • 3 web servers / realservers (you can do the same using 2)
  • Local subnet :
  • LVS forwarding method : DR (direct routing)
  • LVS scheduler : WRR (you can choose your own)
  • VIP :
  • Main interface for VIP : bond0


Let’s take a look at what happens as this will explain a lot of why we should configure the servers in a quite special way.

black arrow / serving

  1. the master server (the one who has the VIP) receives a HTTP port connection request
  2. the load balancing scheduler decides he’s the one who’ll serve this request
  3. the local web server handles the request and replies to the client

 blue arrow / direct routing / serving

  1. the master server receives a HTTP port connection request
  2. the load balancing scheduler decides the blue server should handle this request
  3. the HTTP packet is given to the blue server as-this (no modification is made on the packet)
  4. the blue server receives a packet whose destination IP is the VIP but he doesn’t hold the VIP (tricky part)
  5. the blue server’s web server handles the request and replies to the client

IP configuration

Almost all the tricky part lies in what needs to be done in order to solve the point #4 of the blue server example. Since we’re using direct routing, we need to configure all our servers so they accept packets directed to the VIP even if they don’t have it configured on their receiving interface.

The solution is to have the VIP configured on the loopback interface (lo) with a host scope on the keepalived BACKUP servers while it is configured on the main interface (bond0) on the keepalived MASTER server. This is what is usually done when you use pacemaker and ldirectord with IPAddr2 but keepalived does not handle this kind of configuration natively.

We’ll use the notify_master and notify_backup directives of keepalived.conf to handle this :

notify_master /etc/keepalived/
notify_backup /etc/keepalived/

We’ll discuss a few problems to fix before detailing those scripts.

The ARP problem

Now some of you wise readers will wonder about the ARP cache corruptions which will happen when multiple hosts claim to own the same IP address on the same subnet. Let’s fix this problem now then as the kernel does have a way of handling this properly. Basically we’ll ask the kernel not to advert the server’s MAC address for the VIP on certain conditions using the arp_ignore and arp_announce sysctl.

Add those lines on the sysctl.conf of your servers :

net.ipv4.conf.all.arp_ignore = 3
net.ipv4.conf.all.arp_announce = 2

Read more about those parameters for the detailed explanation of those values.

The IPVS synchronization problem

This is another problem arising from the fact that the load balancers are also acting as realservers. When keepalived starts, it spawns a synchronization process on the master and backup nodes so you load balancers’ IPVS tables stay in sync. This is needed for a fully transparent fail over as it keeps track of the sessions’ persistence so the clients don’t get rebalanced when the master goes down. Well, this is the limitation of our setup : clients’ HTTP sessions served by the master node will fail if he goes down. But note that the same will happen to the other nodes because we have to get rid of this synchronization to get our setup working. The reason is simple : IPVS table sync conflicts with the actual acceptance of the packet by our loopback set up VIP. Both mechanisms can’t coexist together, so you’d better use this setup for stateless (API?) HTTP servers or if you’re okay with this eventuality.

Final configuration


ip addr del dev lo
ipvsadm --restore < /tmp/keepalived.ipvs
  1. drop the VIP from the loopback interface (it will be setup by keepalived on the master interface)
  2. restore the IPVS configuration


ip addr add scope host dev lo
ipvsadm --save > /tmp/keepalived.ipvs
ipvsadm --clear
  1. add the VIP to the loopback interface, scope host
  2. keep a copy of the IPVS configuration, if we get to be master, we’ll need it back
  3. drop the IPVS local config so it doesn’t conflict with our own web serving


Even if it offers some serious benefits, remember the main limitation of this setup : if the master fails, all sessions of your web servers will be lost. So use it mostly for stateless stuff or if you’re okay with this. My setup and explanations may have some glitches, feel free to correct me if I’m wrong somewhere.

mongoDB and Pacemaker recent bumps

mongoDB 2.4.3

Yet another bugfix release, this new stable branch is surely one of the most quickly iterated I’ve ever seen. I guess we’ll wait a bit longer at work before migrating to 2.4.x.

pacemaker 1.1.10_rc1

This is the release of pacemaker we’ve been waiting¬†for, fixing among other things, the ACL problem which was introduced in 1.1.9. Andrew and others are working hard to get a proper 1.1.10 out soon, thanks guys.

Meanwhile, we (gentoo cluster herd) have been contacted by @Psi-Jack who has offered his help to follow and keep some of our precious clustering packages up to date, I wish our work together will benefit everyone !

All of this is live on portage, enjoy.

Follow-up on pacemaker v1.1.9 and updated pacemaker-gui

In my previous post¬†I talked about a permission problem introduced in pacemaker-1.1.9 which requires root to be a member of the haclient group. I’ve been helping @beekhof to investigate on this and I’m glad he found and fixed both the problem and a memory leak ! We’re still investigating on another issue but we should be seeing a new version bump pretty soon, thank you Andrew !

pacemaker-gui v2.1.2

One of my colleagues recently complained that pacemaker-gui-2.1.1 was not compatible with newer pacemaker releases (>=1.1.8) so he had to install pacemaker-1.1.7 if he wanted to benefit from the GUI. I contacted @gao-yan¬†from SUSE who’s the main upstream for this package and asked him for a tag bump. Here comes pacemaker-gui-2.1.2 which is compatible with all newer pacemaker releases ! Thanks again mate.

Pacemaker vulnerability and v1.1.9 release

A security vulnerability (CVE-2013-0281)¬†was found on pacemaker which permitted attackers to prevent your cluster from serving more CIB requests. Although this issue was quickly fixed by upstream, they didn’t add a new tag to pacemaker so I did ask Andrew Beekhof¬†for one so I could take care of bug #457572. Gentoo users, here comes pacemaker-1.1.9 !


While packaging and testing pacemaker-1.1.9, I ran into some weird permission issues which I debugged with @beekhof and @asalkeld (thx again guys). Turns out that when enabling ACL support on pacemaker, you now need to add root to the haclient group ! The reason is that pacemaker now uses shared memory IPC sockets from libqb to communicate with corosync (on /dev/shm/).

v1.1.9 changelog

  • corosync: Allow cman and corosync 2.0 nodes to use a name other than uname()
  • corosync: Use queues to avoid blocking when sending CPG messages
  • Drop per-user core directories
  • ipc: Compress messages that exceed the configured IPC message limit
  • ipc: Use queues to prevent slow clients from blocking the server
  • ipc: Use shared memory by default
  • lrmd: Support nagios¬†remote monitoring
  • lrmd: Pacemaker Remote Daemon for extending pacemaker functionality outside corosync cluster.
  • pengine: Check for master/slave resources that are not OCF agents
  • pengine: Support a ‘requires’ resource meta-attribute for controlling whether it needs quorum, fencing or nothing
  • pengine: Support for resource container
  • pengine: Support resources that require unfencing before start

Since the main focus of the bump was to fix a security issue, I didn’t add the new nagios feature to the ebuild. If you’re interested in it, just say so and I’ll do my best to add it asap.