A First-Principles-Driven Life
05 Dec 2008

[PAPER] Non-Blocking Algorithms

1. Introduction

1.1. Non-Blocking Algorithm Introduction

In computer science, non-blocking synchronization ensures that threads competing for a shared resource do not have their execution indefinitely postponed by mutual exclusion. A non-blocking algorithm is lock-free if there is guaranteed system-wide progress; wait-free if there is also guaranteed per-thread progress1.

This simple diagram shows more clearly the idea:

    +------------------------------+--------------------------------+
    |                              |                                |
    |      Mutual Exclusion        |  Non-Blocking                  |
    |                              |                                |
    |                              |    +-------------------------+ |
    |          |     |             |    |                         | |
    |          |  H  |             |    |   Lock-Freedom          | |
    |          |  y  |             |    |                         | |
    |          |  b  |             |    |    +------------------+ | |
    | Blocking |  r  | Busy-Waiting|    |    |                  | | |
    |          |  i  |  (Polling)  |    |    |   Wait-Freedom   | | |
    |          |  d  |             |    |    |                  | | |
    |          |  s  |             |    |    +------------------+ | |
    |          |     |             |    |                         | |
    |                              |    +-------------------------+ |
    |                              |                                |
    +------------------------------+--------------------------------+

Using mutual exclusion is the most instinctive and easiest way to protect data from corruption in concurrent multi-threaded programming, which simply creates a critical region around the operations of data structure. It works fine most of the situations, but in some situations, lock is just too expensive or even not available. One example is for some ISR(Interrupt Service Routines), if it disables the interruption, then the scheduler will not be run until they open the interruption again, if it also waits for some resources which is hold by some task, then it will never available because scheduler stops. Another example is in RTOS, higher priority wait for the lower priority task to release the resource, then it causes priority-inversion. So, non-blocking algorithms are born to solve these problem without using mutual exclusions.

Valois published his most cited paper talking about lock-free data structure at 19952. But unfortunately, there are some mistakes with the memory management method introduced by Valois3.

But algorithms from Valois needs special memory management, and it's just too complicated to use. Michael and Scott published their own algorithms based on the idea of cooperative task from Valois4.

There's another more detailed notes on lock-free and wait-free algorithms area from Internet5.

2. Fixed-size Slots Management

2.1. Problem

Provide a non-blocking algorithm that maintains an array of fixed-size slots with the following interface:

  • void InitSlots(void)

    Slots initialization.

  • SLOTID GetSlot(void)

    Exclusively get an id of one free slot, and prevent other user to get this slot.

  • void PutSlot(SLOTID id)

    Reclaim a slot returned by GetSlot, and make it free to get again.

  • DATA* DecodeSlotID(SLOTID id)

    Decode a id of slot, and return the memory address of that slot.

2.2. Primitive CAS

Atomic operation CAS(compare-and-swap) atomically compares the contents of a memory location to a given value and, if they are the same, modifies the contents of that memory location to a given new value6. The following algorithm assume that CAS is available on the target hardware.

2.3. First Algorithm (with mistakes)

typedef struct
{
    int next;
    DATA data;
}Node;

Node Slots[MAX_SLOT_NUM];

int head;

void InitSlots(void)
{
    int i;

    for (i = 0; i < MAX_SLOT_NUM - 1; ++i)
    {
        Slots[i].next = i+1;
    }
    Slots[i].next = NULL_ID;
    head = 0;
}

int GetSlot(void)
{
    int oldhead;
    int next;

    do {
        oldhead  = head;
        if (oldhead == NULL_ID)
        {
            return NULL_ID;
        }
        else
        {
            next = Slots[oldhead].next;
        }
    } while(!CAS(&head, oldhead, next));

    return oldhead;
}

void PutSlot(int id)
{
    int oldhead;

    do {
        oldhead = head;
        Slots[id].next = oldhead;
    } while(!CAS(&head, oldhead, id));
}

DATA* DecodeSlotID(int id)
{
  return &Slots[id].data;
}

The idea of this algorithm is to use CAS to check when modify the head, if it's still the old value. This is commonly called read-modify-write. But this arises the well-known ABA problem.

2.4. ABA Problem

The above algorithm has a subtle problem, it assumes that if the id didn't change, then the list remains the same also. But it's very common to happen that other tasks takes head and head.next and then returns the head, now the head.next actually changed. This problem is known as ABA problem7.

There are several ways to solve it. Valois gave a methodology of memory management which tracks use count of pointers8. This way assures that a pointer possessing by some one will never be allocated until no one has a copy of the pointer, thus avoiding the ABA problem to happen. Michael and Scott publishes their fixes on Valois's memory management mistakes9.

Another way is to use pointer tag, which adds an extra "tag" bits to the pointer. The "tag" usually increments itself on every copy operation. Because of this, the next compare-and-swap will fail, even if the addresses are the same, because the tag bits will not match. This does not completely solve the problem, as the tag bits will eventually wrap around, but helps to avoid it.

2.5. Use Pointer Tag to Avoid ABA Problem

typedef union
{
    /** to write */
    uint_t Value;

    /** to write */
    struct
    {
        /** to write */
        uhalfint_t Counter;
        /** to write */
        uhalfint_t Index;
    } Data;
} SLOTID;

Type "uhalfintt" is half length of uintt, uintt is unsigned integer type. The "Counter" here is the "tag" of the pointer.

Now the algorithm looks like this:

typedef struct
{
    SLOTID next;
    DATA data;
}Node;

Node Slots[MAX_SLOT_NUM];

SLOTID head;

static inline
SLOTID NewSLOTID(uhalfint_t index)
{
    SLOTID id;

    id.Data.Counter = 0;
    id.Data.Index = index;

    return id;
}

static inline
bool SLOTID_CAS(SLOTID *id, SLOTID oldid, SLOTID newid)
{
    /* Increae the counter to avoid ABA problem */
    ++newid.Data.Counter;

    return CAS(&id->Value,  oldid.Value, newid.Value);
}

void InitSlots(void)
{
    int i;

    for (i = 0; i < MAX_SLOT_NUM - 1; ++i)
    {
        Slots[i].next = NewSLOTID(i+1);
    }
    Slots[i].next = NewSLOTID(NULL_ID);
    head = NewSLOTID(0);
}

SLOTID GetSlot(void)
{
    SLOTID oldhead;
    SLOTID next;

    do {
        oldhead  = head;
        if (oldhead == NULL_ID)
        {
            return NULL_ID;
        }
        else
        {
            next = Slots[oldhead.Data.Index].next;
        }
    } while(!SLOTID_CAS(&head, oldhead, next));

    return oldhead;
}

void PutSlot(SLOTID id)
{
    SLOTID oldhead;

    do {
        oldhead = head;
        Slots[id.Data.Index].next = oldhead;
    } while(!SLOTID_CAS(&head, oldhead, id));
}

DATA* DecodeSlotID(SLOTID id)
{
  return &Slots[id.Data.Index].data;
}

The key algorithm is the SLOTIDCAS operation: every time it's going to change the SLOTID, it uses SLOTIDCAS, which increase the Counter then CAS. This makes the ABA like ABA'. The index can be the same, but the counter is unlikely the same, the wider range of Counter is, the lesser possibility ABA will happen. On a 32-bit machine, range of Counter is [0..216].

2.6. Wider CAS

The problem of packing counter and index into a integer is obvious: the limitation number of array elements on a 32-bit machine is 216. And the counter limitation is 216, after that it wraps to 0. 216 is not a big enough value to soothe the skeptics, so on some architecture wider CAS is provided. Wider CAS is different from Multi CAS, the former can CAS on an adjacent memory fields thus be called wider, but the latter can CAS on unrelated memory fields.

On later inter x86 processor, it provides an instruction called CMPXCHG8B, which compare-and-swap-8-bytes10. By this instruction, we can operate on a normal memory pointer instead of a memory pointer, and its "tag" which has a range as large as 232.

CAS2 is a realistic existence of Multi CAS, but only on some Motorola 680x0 processors.

2.7. Load-Link and Store-Conditional(LL/SC)

In computer science, load-link (LL, also known as "load-linked" or "load and reserve") and store-conditional (SC) are a pair of instructions that together implement a lock-free atomic read-modify-write operation.

Load-link returns the current value of a memory location. A subsequent store-conditional to the same memory location will store a new value only if no updates have occurred to that location since the load-link. If any updates have occurred, the store-conditional is guaranteed to fail, even if the value read by the load-link has since been restored. As such, an LL/SC pair is stronger than a read followed by a compare-and-swap (CAS), which will not detect updates if the old value has been restored.11

LL/SC can finally make skeptics happy, it doesn't just make ABA problem look like ABA', but solve it in another say. The algorithm with LL/SC can be:

typedef struct
{
    int next;
    DATA data;
}Node;

Node Slots[MAX_SLOT_NUM];

int head;

void InitSlots(void)
{
    int i;

    for (i = 0; i < MAX_SLOT_NUM - 1; ++i)
    {
        Slots[i].next = i+1;
    }
    Slots[i].next = NULL_ID;
    head = 0;
}

int GetSlot(void)
{
    int oldhead;
    int next;

    do {
        oldhead  = LL(&head);
        if (oldhead == NULL_ID)
        {
            return NULL_ID;
        }
        else
        {
            next = Slots[oldhead].next;
        }
    } while(!SC(&head, next));

    return oldhead;
}

void PutSlot(int id)
{
    int oldhead;

    do {
        oldhead = LL(&head);
        Slots[id].next = oldhead;
    } while(!SC(&head, id));
}

DATA* DecodeSlotID(int id)
{
  return &Slots[id].data;
}

To use the above algorithm, the users have to also use LL/SC to load and store the slot id because of the ABA problem. But since realistic LL/SC implementations all have limitations, actually LL/SC version algorithm is more difficult to use than the CAS version.

2.8. Realistic Limitations

Real implementations of LL/SC can be found on Alpha, PowerPC12, MIPS, ARMv6(or above). But they're all Weak LL/SC, SC can fail even if there's no update between LL and corresponding SC, for example:

  • The CPU can only reserves a memory region at a time.
  • A context switching between them can cause the SC fail.

The first limitation can cause a lot of ideal algorithms based on LL/SC fail. So CAS version algorithm is still preferable.

2.9. Conclusion

Wait-Free Fix-size Slots Management is a simple version of non-blocking memory management implementation, it only manage fix-size slots. But it already make a lot of Non-Blocking algorithm common problems and tricks emerging. Later we'll see a implementation of Wait-Free Fifo Queue based on this Slots Management.

3. FIFO Queue

3.1. Problem

Provide a non-blocking algorithm that provides a FIFO Queue with the following interface:

  • void InitFIFO(void)

    FIFO queue initialization.

  • bool Enqueue(DATA data)

    Push a data into the FIFO queue. Return TRUE if success, return FALSE only if the FIFO is full.

  • bool Dequeue(DATA* pdata)

    Pull a data from the FIFO queue. Return TRUE if success, return FALSE only if the FIFO is empty.

3.2. Link-List Structure

The algorithm introduced here is based on a link-list structure13. It's a single link-list with head and tail pointers. It looks like:

   +-------+     +-------+     +-------+     +-------+
   |       |     |       |     |       |     |       |
   |      -+---->|      -+---->|      -+---->|      -+---->NULL
   |       |     |       |     |       |     |       |
   +---^---+     +-------+     +-------+     +---^---+
       |                                         |
       |                                         |
       |                                         |
      HEAD                                      TAIL

New data is pushed to the tail, and old data is pulled from head.

3.3. Dequeue Data, first thought

To dequeue a data, first read the HEAD value, and use CAS to swap the HEAD pointer to it's next pointer. And the old HEAD value is the data that dequeued. To avoid ABA problem14, we should use SLOTIDCAS which described in the "Fix-size Slots Management"15.

   +-------+     +-------+     +-------+     +-------+
   |       |     |       |     |       |     |       |
   |      -+---->|      -+---->|      -+---->|      -+----> NULL
   |       |     |       |     |       |     |       |
   +-------+     +---^---+     +-------+     +---^---+
                     |                           |
                     |                           |
                     |                           |
                    HEAD                        TAIL

3.4. Enqueue Data

Enqueue operation is more complicated than dequeue operation, because it has to change two state: the next pointer of the tail slot, and the tail pointer. If we have MCAS(Multi CAS) or at least Double CAS16, then the problem is easy to solve. But since DCAS is not widely available in computer system, so a so-called cooperative method is introduced firstly by Valois17.

In cooperative method, enqueue operation first allocates a new slot, and sets next pointer of tail slot to the new slot. Then set tail pointer to new slot. All theses set operation use SLOTIDCAS, so the second step can fail, if someone others enters in, and set the tail pointer. Which leaves the FIFO in a intermediate state:

   +-------+     +-------+     +-------+     +-------+
   |       |     |       |     |       |     |       |
   |      -+---->|      -+---->|      -+---->|      -+---->NULL
   |       |     |       |     |       |     |       |
   +---^---+     +-------+     +---^---+     +-------+
       |                           |
       |                           |
       |                           |
      HEAD                        TAIL

To complete the intermediate state:

  1. The next enqueue operation should try to move the TAIL forward, until the TAIL at the end, the enqueue operation should not change the TAIL slot next pointer.
  2. The next dequeue operation should also try to move the TAIL forward when the head meets the tail pointer.

3.5. Algorithm

This algorithm uses CAS version of "Fix-size Slots Management"15.

typedef struct
{
    DATA data;
    SLOTID next;
} Node;

SLOTID head;
SLOTID tail;

void InitFIFO(void)
{
    Node *sentinel;

    /* This algorithm relies on the fix-size slots management */
    InitSlots();

    /* Sentinel slots */
    head = tail = GetSlot();
    sentinel = DecodeSlotID(head);
    sentinel->next = NULL_ID;
}

/*
  Origin Algorithm:
  E01: node = new node()
  E02: node->value = value
  E03: node->next.ptr = NULL
  E04: loop
  E05:   tail = Q->Tail
  E06:   next = tail.ptr->next
  E07:   if tail == Q->Tail
  E08:     if next.ptr == NULL
  E09:       if CAS(&tail.ptr->next, next, <node, next.count+1>)
  E10:         break
  E11:       endif
  E12:     else
  E13:       CAS(&Q->Tail, tail, <next.ptr, tail.count+1>)
  E14:     endif
  E15:   endif
  E16: endloop
  E17: CAS(&Q->Tail, tail, <node, tail.count+1>)
*/
bool Enqueue(DATA data)
{
    SLOTID newid;
    Node *newnode;
    SLOTID oldtail;
    Node *oldtailnode;
    SLOTID nextid;
    Node *nextnode;

    /* E01 */
    newid = GetSlot();
    newnode = DecodeSlotID(newid);
    if (newnode == NULL) return FALSE;
    /* E02 */
    newnode->data = data;
    /* E03 */
    newnode->next = NULL_ID;
    /* E04 */
    do {
        /* E05 */
        oldtail = tail;
        oldtailnode = DecodeSlotID(oldtail);
        /* E06 */
        nextid = oldtailnode->next;
        nextnode = DecodeSlotID(nextid);
        /* E07 */
        if (oldtail == tail)
        {
            /* E08 */
            if (nextnode == NULL)
            {
                /* E09 */
                if (SLOTID_CAS(&oldtailnode->next, next, newid))
                {
                    /* E10 */
                    break;
                /* E11 */
                }
            }
            /* E12 */
            else
            {
                /* E13 */
                SLOTID_CAS(&tail, oldtail, next);
            /* E14 */
            }
        /* E15 */
        }
    /* E16 */
    } while(1);
    /* E17 */
    SLOTID_CAS(&tail, oldtail, newid);

    return TRUE;
}

/*
  Origin Algorithm:
  D01:  loop
  D02:    head = Q->Head
  D03:    tail = Q->Tail
  D04:    next = head->next
  D05:    if head == Q->Head
  D06:      if head.ptr == tail.ptr
  D07:        if next.ptr == NULL
  D08:          return FALSE
  D09:        endif
  D10:        CAS(&Q->Tail, tail, <next.ptr, tail.count+1>)
  D11:      else
  D12:        *pvalue = next.ptr->value
  D13:        if CAS(&Q->Head, head, <next.ptr, head.count+1>)
  D14:          break
  D15:        endif
  D16:      endif
  D17:    endif
  D18:  endloop
  D19:  free(head.ptr)
  D20:  return TRUE
*/
bool Dequeue(DATA *pdata)
{
    SLOTID oldhead
    Node *oldheadnode;
    SLOTID oldtail;
    Node *oldtailnode;
    SLOTID next;
    Node *nextnode;

    /* D01 */
    do {
        /* D02 */
        oldhead = head;
        oldheadnode = DecodeSlotID(oldhead);
        /* D03 */
        oldtail = tail;
        oldtailnode = DecodeSlotID(oldtail);
        /* D04 */
        next = oldhead->next;
        nextnode = DecodeSlotID(next);
        /* D05 */
        if (oldhead == head)
        {
            /* D06 */
            if (headnode == tailnode)
            {
                /* D07 */
                if (nextnode == NULL)
                {
                    /* D08 */
                    return FALSE;
                /* D09 */
                }
                /* D10 */
                SLOTID_CAS(&tail, oldtail, next);
            }
            /* D11 */
            else
            {
                /* D12 */
                *pdata = nextnode->data;
                /* D13 */
                if (SLOTID_CAS(&head, oldhead, next)
                {
                    /* D14 */
                    break;
                /* D15 */
                }
            /* *D16 /
            }
        /* D17 */
        }
    /* D18 */
    }
    /* D19 */
    PutSlot(oldhead);
    /* D20 */
    return TRUE;
}

In enqueue, E13 is the cooperative operation.

In dequeue, D10 is the cooperative operation.

3.6. Array Based Algorithm

Valois also introduced a array based Non-Blocking FIFO algorithm18, but it's not feasible on real machine because it requires CAS on unaligned memory region.

3.7. Conclusion

This article introduces a link-listed based Non-Blocking algorithm for FIFO queue. It's based on the "Fix-size Slots Management"[2], and feasible on the platforms that support CAS.


Footnotes:

2

J.D. Valois, Lock-Free Data Structures PhD Thesis, Rensselaer Polytechnic Institute, Department of Computer Science, 1995.

3

Maged M. Michael, Michael L. Scott Correction of a Memory Management Method for Lock-Free Data Structures, Department of Computer Science University of Rochester, 1995.

4

Maged M. Michael Michael L. Scott, Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms,, Department of Computer Science University of Rochester, 1995.

5

Some notes on lock-free and wait-free algorithms http://www.audiomulch.com/~rossb/code/lockfree

6

Compare-and-swap from Wikipedia http://en.wikipedia.org/wiki/Compare-and-swap

7

ABA problem from Wikipedia http://en.wikipedia.org/wiki/ABA_problem

8

J.D. Valois, Lock-Free Data Structures PhD Thesis, Rensselaer Polytechnic Institute, Department of Computer Science, 1995.

9

Maged M. Michael, Michael L. Scott Correction of a Memory Management Method for Lock-Free Data Structures, Department of Computer Science University of Rochester, 1995.

10

"Intel 64 and IA-32 Architectures Software Developer's Manual Volume 2A: Instruction Set Reference, A-M" (PDF). Retrieved on 2007-12-15.

11

Load-Link/Store-Conditional from Wikipedia http://en.wikipedia.org/wiki/Load-Link/Store-Conditional

12

"Power ISA Version 2.05". Power.org (2007-10-23). Retrieved on 2007-12-18.

13

Maged M. Michael Michael L. Scott, Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms,, Department of Computer Science University of Rochester, 1995.

14

ABA problem from Wikipedia http://en.wikipedia.org/wiki/ABA_problem

15

ZC Miao, Non-Blocking Algorithm - Fix-size Slots Management, revision 2008-12-04

16

Double Compare And Swap from Wikipedia http://en.wikipedia.org/wiki/Double_Compare_And_Swap

17

J.D. Valois, Lock-Free Data Structures PhD Thesis, Rensselaer Polytechnic Institute, Department of Computer Science, 1995.

18

J.D. Valois, Implementing Lock-Free Queues

Tags: algorithm non-blocking
Creative Commons License
A First-Principles-Driven Life by Miao, ZhiCheng is licensed under a Creative Commons Attribution 3.0 License.