Perceiving The Effects Of Cache Coherency In Spin Locks
The "test-and-test-and-set" algorithm is a fairly well-understood solution to reducing latency in spin locks. In particular, the algorithm is often employed to reduce the latency exhibited in the similar "test-and-set" algorithm. Understanding how and why a seeming-superfluous read results in lower latency is rather interesting.
An astute reader may point out that the "test(-and-test)?-and-set" algorithms are somewhat deprecated, as, early on, they were identified as having a finite consensus number. Indeed, what I'm actually presenting is more of a "test-and-compare-and-swap" (hence, TCAS) solution, but such an algorithm name does not seem to exist.
Results
In a somewhat unconventional approach, I've decided to note the results before offering an explanation. If you are familiar with the algorithm, the results should come as no huge surprise: the TCAS approach made a fairly sizable difference with increased contention. All tests were run 20 times on an consistently-loaded 4-core (8-core HTT) i7.
% taskset -c 0,2 java
Benchmark Mode Threads Count Sec Mean (ns/ops) Mean error (ns/ops)
baseline avgt 2 200 1 0.726 0.003
testAndSet avgt 2 200 1 132.459 0.650
testTestAndSet avgt 2 200 1 160.533 3.013
% taskset -c 0,2,4,6 java
Benchmark Mode Threads Count Sec Mean (ns/ops) Mean error (ns/ops)
baseline avgt 4 200 1 1.445 0.001
testAndSet avgt 4 200 1 379.067 8.485
testTestAndSet avgt 4 200 1 229.402 1.221
% taskset -c 0,1,2,3,4,6 java
Benchmark Mode Threads Count Sec Mean (ns/ops) Mean error (ns/ops)
baseline avgt 6 200 1 1.151 0.009
testAndSet avgt 6 200 1 830.886 4.929
testTestAndSet avgt 6 200 1 565.265 4.067
The results are quite expected. Under lower contention, the TCAS approach made a slightly undesirable impact (more operations), whereas under increased load, TCAS decreased the operation time by 30-40%. While undocumented above, under less-controlled circumstances (overloading the CPU), decreases were over 50%. Keep in mind that the controls on these results are somewhat lax, and the results should not be considered scientific.
The Gist
The OpenJDK JMH library was used to perform the micro-benchmarks. In general, the two tests used basic spin locks, one using regular CAS (testAndSet
) and one using TCAS (testTestAndSet
).
@Fork(10)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class TestTestAndSetTest {
private static final Unsafe unsafe;
private static final long offsetLocked;
@GenerateMicroBenchmark
public void baseline(final Test test) {
test.doNothing();
}
@GenerateMicroBenchmark
public void testTestAndSet(final Test test) {
test.testTestAndSet();
}
@GenerateMicroBenchmark
public void testAndSet(final Test test) {
test.testAndSet();
}
@State(Scope.Benchmark)
public static class Test {
private volatile int locked;
private long a1;
@Setup
public void setup() {
a1 = (long) (Math.random() * Long.MAX_VALUE);
locked = 0;
}
public void doNothing() {
}
public void testTestAndSet() {
for(;;) {
if (locked == 0 && unsafe.compareAndSwapInt(this, offsetLocked, 0, 1)) {
a1++;
locked = 0;
break;
}
}
}
public void testAndSet() {
for(;;){
if (unsafe.compareAndSwapInt(this, offsetLocked, 0, 1)) {
a1++;
locked = 0;
break;
}
}
}
}
}
A Brief Analysis
While it may initially seem unintuitive why an extra load leads to lower latency, there are several fairly understandable reasons: the CPU's cache coherency protocol and contending writes.
Cache Coherency
Numerous other websites and resources detail MESI and other cache coherency protocols in great depth, so I will not to do so here. Instead, I will attempt to summarise: cache coherency enables multi-core machines (whether SMP or NUMA) to synchronize memory that resides in processor caches. MESI defines a state machine that "tags" each cache line in a cache with a state: modified, exclusive, shared, or invalid. These states determine how the CPU has to handle reads and writes.
Given a processor is attempting to acquire the semaphore, it must first inspect its cache to determine if the semaphore's cache line is present in other processor caches (this would be a "shared" state). If so, the acquiring processor must first issue an "invalidate" message to the bus. All other processors must comply, marking their copy "invalid". Subsequently, the acquiring processor performs the write and locally tags its associated cache line "modified". In a simple implementation, the write is sent to main memory (or a shared cache). Upon subsequent read of an "invalid" cache line, a processor must reload the cache line from memory (or shared cache) at a cost (say, 7-10 cycles) substantially higher than a local read (roughly 2-3 cycles).
This behaviour is colloquially referred as "ping-ponging", as the cache line ownership is bounced between multiple processors. Instead, if all processors only wrote when able to make meaningful progress (that is, when the lock is visibly unlocked), all other processors may be able to hold the cache line in a "shared" state. Such behaviour would reduce the bus traffic, allowing the critical section to be executed quickly.
Contended Writes
On x86 (the system on which these benchmarks were run), the Unsafe.compareAndSwapLong
method being used in both benchmarked methods maps to the LOCK CMPXCHNG
prefix+opcode, which is an atomic CAS. Under reasonably low contention, the operation will infrequently fail, as the time between a previous read and subsequent write is fairly short. However, as the contention increases, writes will fail proportionally (i.e. only one thread can "win" per distinct read). In the TCAS implementation, the number of unnecessary writes has been reduced to as few as possible. Specifically, the TCAS method will only attempt to lock iff the lock is evidently unlocked.
While this explanation is heavily coupled to cache coherency in that read and write semantics differ, certain architectures present different concerns. As mentioned, x86 performs a LOCK CMPXCHG
; of importance is the LOCK
prefix, which causes the cache line (or, historically, the bus) to become locked for both reads and writes by all other processors. When locked, no other threads may make progress when attempting to operate on this cache line (keep in mind, however, the duration of this lock is only 1 instruction). This cache line lock may result in a stalled pipeline, possibly lengthening the duration of the critical section if the semaphore cannot be quickly released.
An Interesting Note About Locality
One noteworthy detail about the provided example is the locality of the semaphore and the counter. There is a good likelihood (but no guarantee) that the semaphore and counter will both sit within the same cache line in the above example. As such, whichever processor successfully acquires the semaphore will have the counter sitting in L1 cache (likely in a "modified" state). In experiments, when attempting to force the locked
and a1
members across different cache lines (by adding padding between the two fields), the results change, requiring nearly twice the amount of contention before TCAS has any benefit.
Considerations
The academic and research communities have published some excellent analysis on CAS contention. Solutions include write back-offs and partitioning. While the example above could theoretically benefit from such enhancements, such experimentation is well beyond the scope of this article (but quite possibly the subject of a future article).