[PAPER] Non-Blocking Algorithms
1. Introduction
1.1. Non-Blocking Algorithm Introduction
In computer science, non-blocking synchronization ensures that threads competing for a shared resource do not have their execution indefinitely postponed by mutual exclusion. A non-blocking algorithm is lock-free if there is guaranteed system-wide progress; wait-free if there is also guaranteed per-thread progress1.
This simple diagram shows more clearly the idea:
+------------------------------+--------------------------------+ | | | | Mutual Exclusion | Non-Blocking | | | | | | +-------------------------+ | | | | | | | | | | H | | | Lock-Freedom | | | | y | | | | | | | b | | | +------------------+ | | | Blocking | r | Busy-Waiting| | | | | | | | i | (Polling) | | | Wait-Freedom | | | | | d | | | | | | | | | s | | | +------------------+ | | | | | | | | | | | +-------------------------+ | | | | +------------------------------+--------------------------------+
Using mutual exclusion is the most instinctive and easiest way to protect data from corruption in concurrent multi-threaded programming, which simply creates a critical region around the operations of data structure. It works fine most of the situations, but in some situations, lock is just too expensive or even not available. One example is for some ISR(Interrupt Service Routines), if it disables the interruption, then the scheduler will not be run until they open the interruption again, if it also waits for some resources which is hold by some task, then it will never available because scheduler stops. Another example is in RTOS, higher priority wait for the lower priority task to release the resource, then it causes priority-inversion. So, non-blocking algorithms are born to solve these problem without using mutual exclusions.
Valois published his most cited paper talking about lock-free data structure at 19952. But unfortunately, there are some mistakes with the memory management method introduced by Valois3.
But algorithms from Valois needs special memory management, and it's just too complicated to use. Michael and Scott published their own algorithms based on the idea of cooperative task from Valois4.
There's another more detailed notes on lock-free and wait-free algorithms area from Internet5.
2. Fixed-size Slots Management
2.1. Problem
Provide a non-blocking algorithm that maintains an array of fixed-size slots with the following interface:
void InitSlots(void)
Slots initialization.
SLOTID GetSlot(void)
Exclusively get an id of one free slot, and prevent other user to get this slot.
void PutSlot(SLOTID id)
Reclaim a slot returned by GetSlot, and make it free to get again.
DATA* DecodeSlotID(SLOTID id)
Decode a id of slot, and return the memory address of that slot.
2.2. Primitive CAS
Atomic operation CAS(compare-and-swap) atomically compares the contents of a memory location to a given value and, if they are the same, modifies the contents of that memory location to a given new value6. The following algorithm assume that CAS is available on the target hardware.
2.3. First Algorithm (with mistakes)
typedef struct { int next; DATA data; }Node; Node Slots[MAX_SLOT_NUM]; int head; void InitSlots(void) { int i; for (i = 0; i < MAX_SLOT_NUM - 1; ++i) { Slots[i].next = i+1; } Slots[i].next = NULL_ID; head = 0; } int GetSlot(void) { int oldhead; int next; do { oldhead = head; if (oldhead == NULL_ID) { return NULL_ID; } else { next = Slots[oldhead].next; } } while(!CAS(&head, oldhead, next)); return oldhead; } void PutSlot(int id) { int oldhead; do { oldhead = head; Slots[id].next = oldhead; } while(!CAS(&head, oldhead, id)); } DATA* DecodeSlotID(int id) { return &Slots[id].data; }
The idea of this algorithm is to use CAS to check when modify the head, if it's still the old value. This is commonly called read-modify-write. But this arises the well-known ABA problem.
2.4. ABA Problem
The above algorithm has a subtle problem, it assumes that if the id didn't change, then the list remains the same also. But it's very common to happen that other tasks takes head and head.next and then returns the head, now the head.next actually changed. This problem is known as ABA problem7.
There are several ways to solve it. Valois gave a methodology of memory management which tracks use count of pointers8. This way assures that a pointer possessing by some one will never be allocated until no one has a copy of the pointer, thus avoiding the ABA problem to happen. Michael and Scott publishes their fixes on Valois's memory management mistakes9.
Another way is to use pointer tag, which adds an extra "tag" bits to the pointer. The "tag" usually increments itself on every copy operation. Because of this, the next compare-and-swap will fail, even if the addresses are the same, because the tag bits will not match. This does not completely solve the problem, as the tag bits will eventually wrap around, but helps to avoid it.
2.5. Use Pointer Tag to Avoid ABA Problem
typedef union { /** to write */ uint_t Value; /** to write */ struct { /** to write */ uhalfint_t Counter; /** to write */ uhalfint_t Index; } Data; } SLOTID;
Type "uhalfintt" is half length of uintt, uintt is unsigned integer type. The "Counter" here is the "tag" of the pointer.
Now the algorithm looks like this:
typedef struct { SLOTID next; DATA data; }Node; Node Slots[MAX_SLOT_NUM]; SLOTID head; static inline SLOTID NewSLOTID(uhalfint_t index) { SLOTID id; id.Data.Counter = 0; id.Data.Index = index; return id; } static inline bool SLOTID_CAS(SLOTID *id, SLOTID oldid, SLOTID newid) { /* Increae the counter to avoid ABA problem */ ++newid.Data.Counter; return CAS(&id->Value, oldid.Value, newid.Value); } void InitSlots(void) { int i; for (i = 0; i < MAX_SLOT_NUM - 1; ++i) { Slots[i].next = NewSLOTID(i+1); } Slots[i].next = NewSLOTID(NULL_ID); head = NewSLOTID(0); } SLOTID GetSlot(void) { SLOTID oldhead; SLOTID next; do { oldhead = head; if (oldhead == NULL_ID) { return NULL_ID; } else { next = Slots[oldhead.Data.Index].next; } } while(!SLOTID_CAS(&head, oldhead, next)); return oldhead; } void PutSlot(SLOTID id) { SLOTID oldhead; do { oldhead = head; Slots[id.Data.Index].next = oldhead; } while(!SLOTID_CAS(&head, oldhead, id)); } DATA* DecodeSlotID(SLOTID id) { return &Slots[id.Data.Index].data; }
The key algorithm is the SLOTIDCAS operation: every time it's going to change the SLOTID, it uses SLOTIDCAS, which increase the Counter then CAS. This makes the ABA like ABA'. The index can be the same, but the counter is unlikely the same, the wider range of Counter is, the lesser possibility ABA will happen. On a 32-bit machine, range of Counter is [0..216].
2.6. Wider CAS
The problem of packing counter and index into a integer is obvious: the limitation number of array elements on a 32-bit machine is 216. And the counter limitation is 216, after that it wraps to 0. 216 is not a big enough value to soothe the skeptics, so on some architecture wider CAS is provided. Wider CAS is different from Multi CAS, the former can CAS on an adjacent memory fields thus be called wider, but the latter can CAS on unrelated memory fields.
On later inter x86 processor, it provides an instruction called CMPXCHG8B, which compare-and-swap-8-bytes10. By this instruction, we can operate on a normal memory pointer instead of a memory pointer, and its "tag" which has a range as large as 232.
CAS2 is a realistic existence of Multi CAS, but only on some Motorola 680x0 processors.
2.7. Load-Link and Store-Conditional(LL/SC)
In computer science, load-link (LL, also known as "load-linked" or "load and reserve") and store-conditional (SC) are a pair of instructions that together implement a lock-free atomic read-modify-write operation.
Load-link returns the current value of a memory location. A subsequent store-conditional to the same memory location will store a new value only if no updates have occurred to that location since the load-link. If any updates have occurred, the store-conditional is guaranteed to fail, even if the value read by the load-link has since been restored. As such, an LL/SC pair is stronger than a read followed by a compare-and-swap (CAS), which will not detect updates if the old value has been restored.11
LL/SC can finally make skeptics happy, it doesn't just make ABA problem look like ABA', but solve it in another say. The algorithm with LL/SC can be:
typedef struct { int next; DATA data; }Node; Node Slots[MAX_SLOT_NUM]; int head; void InitSlots(void) { int i; for (i = 0; i < MAX_SLOT_NUM - 1; ++i) { Slots[i].next = i+1; } Slots[i].next = NULL_ID; head = 0; } int GetSlot(void) { int oldhead; int next; do { oldhead = LL(&head); if (oldhead == NULL_ID) { return NULL_ID; } else { next = Slots[oldhead].next; } } while(!SC(&head, next)); return oldhead; } void PutSlot(int id) { int oldhead; do { oldhead = LL(&head); Slots[id].next = oldhead; } while(!SC(&head, id)); } DATA* DecodeSlotID(int id) { return &Slots[id].data; }
To use the above algorithm, the users have to also use LL/SC to load and store the slot id because of the ABA problem. But since realistic LL/SC implementations all have limitations, actually LL/SC version algorithm is more difficult to use than the CAS version.
2.8. Realistic Limitations
Real implementations of LL/SC can be found on Alpha, PowerPC12, MIPS, ARMv6(or above). But they're all Weak LL/SC, SC can fail even if there's no update between LL and corresponding SC, for example:
- The CPU can only reserves a memory region at a time.
- A context switching between them can cause the SC fail.
The first limitation can cause a lot of ideal algorithms based on LL/SC fail. So CAS version algorithm is still preferable.
2.9. Conclusion
Wait-Free Fix-size Slots Management is a simple version of non-blocking memory management implementation, it only manage fix-size slots. But it already make a lot of Non-Blocking algorithm common problems and tricks emerging. Later we'll see a implementation of Wait-Free Fifo Queue based on this Slots Management.
3. FIFO Queue
3.1. Problem
Provide a non-blocking algorithm that provides a FIFO Queue with the following interface:
void InitFIFO(void)
FIFO queue initialization.
bool Enqueue(DATA data)
Push a data into the FIFO queue. Return TRUE if success, return FALSE only if the FIFO is full.
bool Dequeue(DATA* pdata)
Pull a data from the FIFO queue. Return TRUE if success, return FALSE only if the FIFO is empty.
3.2. Link-List Structure
The algorithm introduced here is based on a link-list structure13. It's a single link-list with head and tail pointers. It looks like:
+-------+ +-------+ +-------+ +-------+ | | | | | | | | | -+---->| -+---->| -+---->| -+---->NULL | | | | | | | | +---^---+ +-------+ +-------+ +---^---+ | | | | | | HEAD TAIL
New data is pushed to the tail, and old data is pulled from head.
3.3. Dequeue Data, first thought
To dequeue a data, first read the HEAD value, and use CAS to swap the HEAD pointer to it's next pointer. And the old HEAD value is the data that dequeued. To avoid ABA problem14, we should use SLOTIDCAS which described in the "Fix-size Slots Management"15.
+-------+ +-------+ +-------+ +-------+ | | | | | | | | | -+---->| -+---->| -+---->| -+----> NULL | | | | | | | | +-------+ +---^---+ +-------+ +---^---+ | | | | | | HEAD TAIL
3.4. Enqueue Data
Enqueue operation is more complicated than dequeue operation, because it has to change two state: the next pointer of the tail slot, and the tail pointer. If we have MCAS(Multi CAS) or at least Double CAS16, then the problem is easy to solve. But since DCAS is not widely available in computer system, so a so-called cooperative method is introduced firstly by Valois17.
In cooperative method, enqueue operation first allocates a new slot, and sets next pointer of tail slot to the new slot. Then set tail pointer to new slot. All theses set operation use SLOTIDCAS, so the second step can fail, if someone others enters in, and set the tail pointer. Which leaves the FIFO in a intermediate state:
+-------+ +-------+ +-------+ +-------+ | | | | | | | | | -+---->| -+---->| -+---->| -+---->NULL | | | | | | | | +---^---+ +-------+ +---^---+ +-------+ | | | | | | HEAD TAIL
To complete the intermediate state:
- The next enqueue operation should try to move the TAIL forward, until the TAIL at the end, the enqueue operation should not change the TAIL slot next pointer.
- The next dequeue operation should also try to move the TAIL forward when the head meets the tail pointer.
3.5. Algorithm
This algorithm uses CAS version of "Fix-size Slots Management"15.
typedef struct { DATA data; SLOTID next; } Node; SLOTID head; SLOTID tail; void InitFIFO(void) { Node *sentinel; /* This algorithm relies on the fix-size slots management */ InitSlots(); /* Sentinel slots */ head = tail = GetSlot(); sentinel = DecodeSlotID(head); sentinel->next = NULL_ID; } /* Origin Algorithm: E01: node = new node() E02: node->value = value E03: node->next.ptr = NULL E04: loop E05: tail = Q->Tail E06: next = tail.ptr->next E07: if tail == Q->Tail E08: if next.ptr == NULL E09: if CAS(&tail.ptr->next, next, <node, next.count+1>) E10: break E11: endif E12: else E13: CAS(&Q->Tail, tail, <next.ptr, tail.count+1>) E14: endif E15: endif E16: endloop E17: CAS(&Q->Tail, tail, <node, tail.count+1>) */ bool Enqueue(DATA data) { SLOTID newid; Node *newnode; SLOTID oldtail; Node *oldtailnode; SLOTID nextid; Node *nextnode; /* E01 */ newid = GetSlot(); newnode = DecodeSlotID(newid); if (newnode == NULL) return FALSE; /* E02 */ newnode->data = data; /* E03 */ newnode->next = NULL_ID; /* E04 */ do { /* E05 */ oldtail = tail; oldtailnode = DecodeSlotID(oldtail); /* E06 */ nextid = oldtailnode->next; nextnode = DecodeSlotID(nextid); /* E07 */ if (oldtail == tail) { /* E08 */ if (nextnode == NULL) { /* E09 */ if (SLOTID_CAS(&oldtailnode->next, next, newid)) { /* E10 */ break; /* E11 */ } } /* E12 */ else { /* E13 */ SLOTID_CAS(&tail, oldtail, next); /* E14 */ } /* E15 */ } /* E16 */ } while(1); /* E17 */ SLOTID_CAS(&tail, oldtail, newid); return TRUE; } /* Origin Algorithm: D01: loop D02: head = Q->Head D03: tail = Q->Tail D04: next = head->next D05: if head == Q->Head D06: if head.ptr == tail.ptr D07: if next.ptr == NULL D08: return FALSE D09: endif D10: CAS(&Q->Tail, tail, <next.ptr, tail.count+1>) D11: else D12: *pvalue = next.ptr->value D13: if CAS(&Q->Head, head, <next.ptr, head.count+1>) D14: break D15: endif D16: endif D17: endif D18: endloop D19: free(head.ptr) D20: return TRUE */ bool Dequeue(DATA *pdata) { SLOTID oldhead Node *oldheadnode; SLOTID oldtail; Node *oldtailnode; SLOTID next; Node *nextnode; /* D01 */ do { /* D02 */ oldhead = head; oldheadnode = DecodeSlotID(oldhead); /* D03 */ oldtail = tail; oldtailnode = DecodeSlotID(oldtail); /* D04 */ next = oldhead->next; nextnode = DecodeSlotID(next); /* D05 */ if (oldhead == head) { /* D06 */ if (headnode == tailnode) { /* D07 */ if (nextnode == NULL) { /* D08 */ return FALSE; /* D09 */ } /* D10 */ SLOTID_CAS(&tail, oldtail, next); } /* D11 */ else { /* D12 */ *pdata = nextnode->data; /* D13 */ if (SLOTID_CAS(&head, oldhead, next) { /* D14 */ break; /* D15 */ } /* *D16 / } /* D17 */ } /* D18 */ } /* D19 */ PutSlot(oldhead); /* D20 */ return TRUE; }
In enqueue, E13 is the cooperative operation.
In dequeue, D10 is the cooperative operation.
3.6. Array Based Algorithm
Valois also introduced a array based Non-Blocking FIFO algorithm18, but it's not feasible on real machine because it requires CAS on unaligned memory region.
3.7. Conclusion
This article introduces a link-listed based Non-Blocking algorithm for FIFO queue. It's based on the "Fix-size Slots Management"[2], and feasible on the platforms that support CAS.
Footnotes:
Non-blocking synchronization from Wikipedia http://en.wikipedia.org/wiki/Lock-free_and_wait-free_algorithms#Wait-freedom
J.D. Valois, Lock-Free Data Structures PhD Thesis, Rensselaer Polytechnic Institute, Department of Computer Science, 1995.
Maged M. Michael, Michael L. Scott Correction of a Memory Management Method for Lock-Free Data Structures, Department of Computer Science University of Rochester, 1995.
Maged M. Michael Michael L. Scott, Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms,, Department of Computer Science University of Rochester, 1995.
Some notes on lock-free and wait-free algorithms http://www.audiomulch.com/~rossb/code/lockfree
Compare-and-swap from Wikipedia http://en.wikipedia.org/wiki/Compare-and-swap
ABA problem from Wikipedia http://en.wikipedia.org/wiki/ABA_problem
J.D. Valois, Lock-Free Data Structures PhD Thesis, Rensselaer Polytechnic Institute, Department of Computer Science, 1995.
Maged M. Michael, Michael L. Scott Correction of a Memory Management Method for Lock-Free Data Structures, Department of Computer Science University of Rochester, 1995.
"Intel 64 and IA-32 Architectures Software Developer's Manual Volume 2A: Instruction Set Reference, A-M" (PDF). Retrieved on 2007-12-15.
Load-Link/Store-Conditional from Wikipedia http://en.wikipedia.org/wiki/Load-Link/Store-Conditional
"Power ISA Version 2.05". Power.org (2007-10-23). Retrieved on 2007-12-18.
Maged M. Michael Michael L. Scott, Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms,, Department of Computer Science University of Rochester, 1995.
ABA problem from Wikipedia http://en.wikipedia.org/wiki/ABA_problem
ZC Miao, Non-Blocking Algorithm - Fix-size Slots Management, revision 2008-12-04
Double Compare And Swap from Wikipedia http://en.wikipedia.org/wiki/Double_Compare_And_Swap
J.D. Valois, Lock-Free Data Structures PhD Thesis, Rensselaer Polytechnic Institute, Department of Computer Science, 1995.
J.D. Valois, Implementing Lock-Free Queues