Wednesday, April 17, 2013

Going Atomic with Locks

You might have read about non-blocking algorithms, how the synchronized keyword is so 2012 and how things are much faster if you use atomics. Then, you might have written your own atomic-based locks and discovered that things are no better.

I certainly have. But thanks to this fine book I've learned not only to make them faster but why they're faster.

Test-and-set Locks

Let's take a typical Java lock:


    private ReentrantLock lock = new ReentrantLock();

    @Override
    public void unlock() {
        lock.unlock();
    }

    @Override
    public void lock() {
        lock.lock();
    }

and let's have 100 threads trying to attain the lock to increment an int 50000 time. Naively, I thought this would be efficient:



    private final AtomicBoolean state = new AtomicBoolean(false);
    
    public void lock() {
        while (state.getAndSet(true)) { }
    }


    public void unlock() {
        state.set(false);
    }


It was an order of magnitude slower. The questions are: why is it much slower and how do we improve it?


Herlihy and Shavit [1] call this a TASLock (a test-and-set lock). They describe why it's so slow:

"For simplicity, we consider a typical multiprocessor architecture in which processors communicate by a shared broadcast medium called a bus (like a tiny Ethernet). Both the processors and the memory controller can broadcast on the bus, but only one processor (or memory) can broadcast on the bus at the a time. All processors (and memory) can listen. Today, bus-based architecture are common because they are easy to build, although they scale poorly to large numbers of processors.


"Each processor has a cache, a small high-speed memory where the processor keeps data likely to be of interest. A memory access typically requires orders of magnitude more machine cycles than a cache access... When a processor reads from an address in memory, it first checks whether that address and its contents are present in the cache. If so, then the processor has a cache hit, and can load the value immediately. If not, the the processor has a cache miss, and must find the data either in the memory or in another processor's cache. The processor then broadcasts the address on the bus. The other processors snoop on the bus. If one processor has that address in its cache, then it responds by broadcasting the address and value. If no processor has that address, then the memory responds with the value at theat address.

"Each getAndSet() call is broadcast on the bus. Because all threads must use the bus to communicate with memory, these getAndSet() calls delay all threads, even those not waiting for the lock. Even worse, the getAndSet() call forces other processors to discard their own cached copies of the lock, so every spinning thread encounters a cache miss almost every time, and must use the bus to fetch the new, but unchanged value. Adding insult to injury, when the thread holding the lock tries to release it, it may be delayed because the bus in monopolized by the spinners." [1]

Using the Linux perf command shows the number of bus cycles and cache-misses:


[henryp@corsair Performance]$ perf stat -e cycles,bus-cycles,cache-misses,L1-dcache-load-misses,L1-dcache-store-misses,LLC-load-misses,LLC-store-misses  java -cp bin/ com.henryp.test.concurrent.spin.LockMain

 Performance counter stats for 'java -cp bin/ com.henryp.test.concurrent.spin.LockMain':

   564,815,071,755 cycles                    #    0.000 GHz                     [57.17%]
    16,620,454,218 bus-cycles                                                   [57.18%]
           832,399 cache-misses                                                 [57.19%]
       421,992,129 L1-dcache-misses                                             [57.17%]
        14,457,236 L1-dcache-misses                                             [57.14%]
           247,642 LLC-misses                                                   [57.11%]
           529,972 LLC-misses                                                   [57.14%]

      10.681204987 seconds time elapsed



where LLC is the Last Level Cache (L3 on my machine). Misses to this cache mean a call to RAM.

We'll compare these figures to the other strategies.

Test-and-test-and-set Locks

Herlihy and Shavit propose an improvement that looks something like this:


    private final AtomicBoolean state = new AtomicBoolean(false);
    
    @Override
    public void lock() {
        while (true) {
            while (state.get()) { };
            if (!state.getAndSet(true))
                return;
        }
    }    

    @Override
    public void unlock() {
        state.set(false);
    }


It is somewhat better but not by much (about 25%). The authors call this a TTASLock (test-and-test-and-set lock)

"The first time thread B reads the lock it takes a cache miss, forcing B to block while the value is loaded into B's cache. As long as A holds the lock, B repeatedly rereads the value, but hits in the cache every time. B thus produces no bus traffic, and does not slow down other threads' memory accesses. Moreover, a thread that releases a lock is not delayed by threads spinning on that lock.

"The situation deteriorates, however, when the lock is released. The lock holder releases the lock by writing false to the lock variable, which immediately invalidates the spinners' cached copies. Each one takes a cache miss, rereads the new value, and they all (more-or-less simultaneously) call getAndSet() to acquire the lock. The first to succeed invalidates the others, who must then reread the value, causing a storm of bus traffic. Eventually, the threads settled down once again to local spinning.

"The notion of local spinning, where threads repeatedly reread cached values instead of repeatedly using the bus, is an important principle critical to the design of efficient spin locks." [1]

With this strategy, perf gives figures like:


   412,122,456,391 cycles                    #    0.000 GHz                     [57.19%]
    12,132,558,770 bus-cycles                                                   [57.15%]
           783,803 cache-misses                                                 [57.15%]
       128,832,596 L1-dcache-misses                                             [57.14%]
        11,947,461 L1-dcache-misses                                             [57.16%]
           256,900 LLC-misses                                                   [57.18%]
           560,395 LLC-misses                                                   [57.17%]

       7.813546393 seconds time elapsed


So, although the LLC figures are comparable, there is less bus activity and fewer L1 misses as Herlihy and Shavit suggested.

Back off!

The solution is to implement a back off strategy (since vying for a highly contended lock may be a waste of time). The code the authors suggest looks a lot like the TTASLock, something like this:


    private final AtomicBoolean state = new AtomicBoolean(false);
    
    @Override
    public void lock() {
        while (true) {
            while (state.get()) { backOff(); };
            if (!state.getAndSet(true))
                return;
        }
    }

    @Override
    public void unlock() {
        state.set(false);
    }

    

    private final Random random = new Random();

    public void backOff() throws InterruptedException {
        int delay = random.nextInt(limit);
        limit = Math.min(maxDelay, 2 * (limit == 0 ? 1 : limit));
        Thread.sleep(delay);
    }



This actually makes it faster than Java's built in locks. The results look something like this:

JavaReentrantLock took 459 ms
TTASLockWithBackOff took 321 ms
TTASLock took 8932 ms
TASLock took 12600 ms

Repeatedly running the tests indicate that TTASLock with a backoff is the fastest by a margin of about 25%.

Perf gives figures for this strategy like:

       900,772,793 cycles                    #    0.000 GHz                     [62.65%]
        56,529,667 bus-cycles                                                   [64.05%]
           468,664 cache-misses                                                 [62.47%]
         9,728,368 L1-dcache-misses                                             [60.17%]
         6,039,764 L1-dcache-misses                                             [57.58%]
           173,197 LLC-misses                                                   [57.86%]
           573,249 LLC-misses                                                   [60.75%]
       0.465015535 seconds time elapsed

Now the differences are orders of magnitude. 

Quick Addendum

I initially tried to use Valgrind to analyse what was going on but saw very little difference between the figures (about 4% fewer cache-misses between the best and worst locks). But reading the documentation explains why:



"Valgrind serialises execution so that only one (kernel) thread is running at a time. This approach avoids the horrible implementation problems of implementing a truly multithreaded version of Valgrind, but it does mean that threaded apps never use more than one CPU simultaneously, even if you have a multiprocessor or multicore machine." [2]


[1] The Art of Multiprocessor Programming.

[2] http://valgrind.org/docs/manual/manual-core.html

No comments:

Post a Comment