Fencing Tokens and Distributed Locking

In this post I’m gonna share a concept called Fencing Tokens, which I learned from the book Designing Data-Intensive Applications.

Fencing tokens is a mechanism that is used to protect against faulty writes to storage systems that happen in distributed systems.

Let’s take an example. Suppose we have a storage resource (file, DB) and two nodes (A, B) that want write to this storage. Node A acquires the lock to write to this storage from a Lock Service. The Lock Service issues the lock based on a time-lease, which means that a node can have the lock only for fixed amount of time and then it expires. Now Node A has the lock and can start writing, however unfortunately it goes into a Garbage Collection (GC) cycle, at this point Node A has the lock, unresponsive and basically waiting for the GC cycle to finish.

While Node A is still waiting for the GC cycle to finish, the lock’s lease has expired. At this point Node B comes and decides to write to the storage and asks the Lock Service for a lock. The Lock Service gives Node B the lock since Node A’s lock lease has expired. Now Node B writes to the storage and it’s done with its work. However, the GC cycle of Node A is over and now it’s back to life. Since it believes that it still has the lock, then it simply writes to the storage and corrupts the data.

Now this is where Fencing Tokens comes into play to prevent this situation. The trick here is to make the Lock Service returns an incremental counter to indicate the sequence of assigning the locks.

For example, when Node A gets the lock from the Lock Service, it gets along with a it a counter (fencing token), for example 1. Now when Node A goes into a GC cycle then Node B also requests a lock and gets with it a fencing token of 2. Now Node B makes the write to the storage and it sends with the write request the fencing token. The storage keeps track of the last fencing token it received, in this case 2. When Node A is back to life and attempts to make the write to the storage, then the storage notices that the Node A sent a fencing token 1, which is lower than the most recent fencing token (2) and thus it rejects the write.

Leave a Reply