POSIX deadlock detection: A walk through linux kernel source

A note to readers: The following content is completely oriented for developers, especially in C. In case you find it difficult to understand certain terms, please help yourselves to google around and understand the concepts. I have tried including links wherever necessary in order to effectively interpret the article(see provided links towards the end). Last but not least, as per POSIX documentation[1], deadlock detection is guaranteed for locks between different processes and not between different threads inside a process or between threads belonging to different processes.

As you can see from the heading of this article, I am aiming at explaining the source code for detecting POSIX deadlocks. Before that let me quickly brief the concept of deadlock in operating systems. A deadlock is a situation where two processes sharing the same resource are preventing each other from accessing the resource which will result in an undefined wait for those resources. Following are the four necessary conditions(Coffman conditions)[2] which leads to a deadlock scenario:

* Mutual exclusion
* Hold and Wait
* No pre-emption
* Circular wait

It is therefore sufficient to discover any one of the above mentioned conditions to detect deadlocks. According to linux kernel sources we see that the fourth condition, circular wait, has been selected for deadlock detection. Note that they have used the term ‘deadlock detection’ where system determines that a deadlock exists and do the necessary identification of the processes and resources involved. Two other terms namely deadlock prevention and deadlock avoidance[3] are also related but different in nature. For the first time, it was really confusing for me to understand how these 3 terms differ in action. Later on better explanations helped me to use these terms in a more confident way. I don’t intend to cover that part here as I move forward towards the actual source code.

Where in kernel?
Good question. But rather easy to answer. Let me put it straight, the whole logic is included under fs/locks.c[4]. It would we good if you have a local git clone of linux kernel source(with ctags installed and populated) to browse through different function definitions as and when you feel to dig in deeper.

Let’s get serious
posix_lock_file( ) is the function used to apply POSIX style locks to a file which has the following arguments

* file for which the lock is to be applied
* original lock to be applied
* conflicting lock to be returned, if found

int posix_lock_file(struct file *filp, struct file_lock *fl, struct file_lock *conflock);

This function will invoke posix_lock_inode( ) with similar number of arguments and thereby passing inode[5] of the file instead of file pointer as follows:
return posix_lock_inode(file_inode(filp), fl, conflock);

The following walk through is based on ABBCCA circular wait example. You can always expand this to any number of iterations as the logic remains same. Consider the following sequence of exclusive lock request events with respect to some file foo:

Deadlock Detection - Event Sequence [ABBCCA]

When P1’s lock request reaches posix_lock_inode( ), it fetches the lock context based on the inode which will contain the list of currently held POSIX locks for that particular inode:

ctx = locks_get_lock_context(inode, request->fl_type);

Note that along with inode we are passing the requested lock type to differentiate F_SETLK and F_SETLKW from F_UNLCK which is treated in a separate manner. Filling the file_lock_context structure is followed by conditional creation of two new file_lock structures new_fl and new_fl2 in advance for further calculations and to avoid race scenarios. Next is the most important section where we perform the conflict check and deadlock detection for lock types F_SETLK and F_SETLKW.

if (request->fl_type != F_UNLCK) {
         list_for_each_entry(fl, &ctx->flc_posix, fl_list) {
                 if (!posix_locks_conflict(request, fl))
                         continue;
                 if (conflock)
                         locks_copy_conflock(conflock, fl);

                error = -EAGAIN;
                if (!(request->fl_flags & FL_SLEEP))
                         goto out;

                error = -EDEADLK;
                spin_lock(&blocked_lock_lock);
                if (likely(!posix_locks_deadlock(request, fl))) {
                         error = FILE_LOCK_DEFERRED;
                         __locks_insert_block(fl, request);
                }
                spin_unlock(&blocked_lock_lock);
                goto out;
         }
}

P1’s new lock request is compared to every other POSIX locks which were fetched from inode to ctx pointer. Looping over the list of locks is done using list_for_each_entry[6] preprocessor macro which iterates over list(doubly linked list[7]) of given type. In each iteration current lock structure is pointed by fl. Assuming P1’s lock request as the beginning since the list is empty, the lock is granted and is added to list of lock for that inode. Now when P2 requests a new lock, it finds P1’s lock in the list. At first locks_conflict check is made to ensure that the current byte range lock request doesn’t overlap with another for different lock owners. For more details see the function definition for posix_locks_conflict( ). Even though lock owners are different they don’t overlap each other which will return 0 making the if condition taking the true branch. And thus it also granted appending it to the list. Lock request from P3 will have to check conflict with both locks from P1 and P2. Here also lack of conflict will result in granting the lock and its subsequent addition to the inode lock list.

The second lock request from P1 conflicts existing P2’s lock(byte ranges overlap and have different owners) and posix_locks_conflict( ) returns 1. As I mentioned above, conflicting fl found is copied to conflock structure whose intention is to do the same. FL_SLEEP flag corresponds to F_SETLKW to indicate a blocking lock. Next comes the check for deadlock. posix_locks_deadlock( ) eventually traverses the global blocked_hash table. But as of now blocked_hash table is empty due to lack of blocking locks. So it will return 0 making the likely[8] if condition to take true branch where it adds the lock to global blocked_hash with some additional operations as we can see from its definition as follows:

static void __locks_insert_block(struct file_lock *blocker, struct file_lock *waiter)
{
         BUG_ON(!list_empty(&waiter->fl_block));
         waiter->fl_next = blocker;
         list_add_tail(&waiter->fl_block, &blocker->fl_block);
         if (IS_POSIX(blocker) && !IS_OFDLCK(blocker))
                locks_insert_global_blocked(waiter);
}

blocker is the existing conflicting lock and waiter the requested lock. fl_block is a pointer to circular linked list[9] of blocked processes. fl_next is a pointer to singly linked list of locks for this inode. This is the first time we are receiving lock request for this particular range. So the fl_block list should be empty for waiter or else this is a duplicate request. This check is included inside BUG_ON[10] macro to ensure better safety. In order to remember on which lock waiter was blocking on, we assign blocker to waiter->fl_next. Along with that we insert waiter into blocker’s blocked_list. Global blocked_hash table addition is performed only after making sure that blocker is of POSIX type and not an open file description lock. The next lock request from P2 again conflicts existing P3’s lock and posix_locks_conflict( ) returns 1 and copies it to conflock. As of now we know that deadlock doesn’t exists(because there is no circular wait as we only have up to ABBC) but let’s see how posix_locks_deadlock( ) determines the same.

static int posix_locks_deadlock(struct file_lock *caller_fl, struct file_lock *block_fl)
{
        int i = 0;

        lockdep_assert_held(&blocked_lock_lock);

        if (IS_OFDLCK(caller_fl))
                return 0;

        while ((block_fl = what_owner_is_waiting_for(block_fl))) {
                if (i++ > MAX_DEADLK_ITERATIONS)
                        return 0;
               if (posix_same_owner(caller_fl, block_fl))
                       return 1;
        }
        return 0;
}

caller_fl is P2’s lock request and block_fl is P3’s lock. If caller_fl is of F_OFDLCK type then we return 0 immediately. Following while loop terminates when we can’t find a lock that the owner of current block_fl is blocking on. Let analyse the function what_owner_is_waiting_for( ):

static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
{
         struct file_lock *fl;

        hash_for_each_possible(blocked_hash, fl, fl_link, posix_owner_key(block_fl)) {
                if (posix_same_owner(fl, block_fl))
                        return fl->fl_next;
        }
        return NULL;
}

In our first iteration block_fl is the lock held by P3. Now we check through global blocked_hash for an entry with owner matching to P3. Here we have only one entry from P1 which is blocked on P2’s lock. So owners mismatch and loop exits returning NULL to posix_locks_deadlock( ) and eventually returns 0 to likely if condition inside posix_lock_inode( ) thereby calls __locks_insert_block( ) with P2’s lock request.

The problem comes when P3 requests a lock conflicting P1’s lock to form a ABBCCA circular wait sequence resulting in a deadlock. posix_locks_conflict( ) returns 1 saying that lock overlap and comes from different owners. P1’s lock is copied to conflock and execution reaches poisx_locks_deadlock( ). caller_fl is P3’s lock request and block_fl is P1’s lock. In first iteration inside while loop we could find a blocking lock entry from P1(waiting for lock held by P2) via what_owner_is_waiting_for( ). It returns fl->fl_next i.e, lock held by P2 replacing block_fl. But inside while loop owner(caller_fl) and owner(block_fl) mismatch and we repeat the same procedure again by passing P2 as block_fl via what_owner_is_waiting_for( ). Again we could find a blocking entry from P2(waiting for lock held by P3). It returns fl->fl_next i.e, lock held by P3 replacing block_fl. Now owner(caller_fl) and owner(block_fl) matches and we return 1 to likely if condition inside posix_lock_inode( ) which returns error equal to EDEADLK.

Woo….hoo and finally we detected the deadlock by discovering the fourth necessary condition i.e, circular wait. ABBCCA example can be expanded to any number of circular loops like ABBCCDDA, ABBCCDDEEA, ABBCCDDEEFFA etc. But there is a limitation in linux kernel. Number of such iterations is limited upto 10. In the function definition for psoix_locks_deadlock() you can see number of deadlock iterations being compared to MAX_DEADLK_ITERATIONS(#define MAX_DEADLK_ITERATIONS 10).

The example mentioned in this article deals with deadlock detection during access to different ranges on same file only. Moreover, deadlocks can even occur in cases where different processes act on conflicting byte ranges of different files. See the following sequence diagram for one such example:

Deadlock Detection - Event Sequence [ABBCCA]

Use the numbered links to help yourself understand some unexplained concepts/terms. Please feel free to add your comments/thoughts on this article.

2 thoughts on “POSIX deadlock detection: A walk through linux kernel source

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s