Tuesday, May 21, 2013

Atomics and Speeds

Just to compare strategies for writing to arrays, I put together some code where one thread writes a non-zero value to an array while another reads through the array, spinning on any elements that are still zero. There is no locking, so this should be fast.

The four different approaches are:

1. Writing to a simple int[] array.
2. Writing to a simple int[] array but occasionally reading/writing a volatile field
3. Writing to a simple int[] array but always reading/writing a volatile field
4. Using AtomicIntegerArray.set(...)
5. Using AtomicIntegerArray.lazySet(...)

The test:

1. Starts the producer thread
2. Starts the consumer thread
3. Joins on the producer thread
4. Joins on the consumer thread but times out after 10 seconds. If the reader thread has not finished after this time, we regard the test in error.

The same framework code is used for all approaches. After a few runs, the results look like:


Approach Mean (seconds) Standard Deviation (seconds)
AtomicIntegerArray.set(...) 131.2 3.6
Normal Array Access With Constant Volatile Read/Write 96.6 5.1
AtomicIntegerArray.lazySet(...) 42.3 6.0
Normal Array Access 37.4 4.7
Normal Array Access With Occasional Volatile Read/Write 6.3 3.8

So, we can quickly see that the set(...) method is the slowest way of writing to an array on my machine.

Normal array access looks reasonably fast but it is wrong! Occasionally, you will see the timeout in test step #4 being triggered as there is no guarantee that (in the absence of memory semantics) that the reader thread will see what the writer thread puts in the array.

The winner by far is the strategy to read/write to the array as normal but occasionally have the array writer thread write to some volatile field and the reader thread sometimes read from it. How often is a matter of tuning and what your needs are.

So, in summary:

  1. Avoid reading/writing from arrays with multiple threads with no further memory semantics (and making the reference to the array volatile is not sufficient). You might find one thread never sees the data another thread has put into the array.
  2. If you want the fastest exchange of data and can tolerate it being somewhat stale, occasionally use memory semantics.


Sunday, May 19, 2013

More Atomics

This week, I went to Nick Zeeb's talk at Skills Matter where I pestered him about atomics.

He was talking about the latest new technology from those clever boys and girls down at LMAX, the Coalescing Ring Buffer.

[First, one interesting aside: the mod operator (%) is expensive. Therefore, when you can, choose a value to mod against that is a power of 2 (let's call this value X). When it comes to mod-ing a variable against it, AND that variable with the value X-1. This is much faster.]

My interest in the system was piqued by their use of AtomicReferenceArray, a class that is not much discussed yet is pretty critical to some multithreaded apps. He confirmed my experience that this class is slow and you should avoid it unless you really have to.

How to increase the efficiency of AtomicReferenceArray

AtomicReferenceArray is slow as each call to set flushes the cache. Nick recommended the use of the method lazySet. The JavaDocs merely says this method "eventually sets the element at position i to the given value." Nick expanded on this and said that this method offers a write to memory in a reasonable time but you wouldn't want to use it if the value were critical and had to be written immediately (the Coalescing Ring Buffer uses them for the non-critical purging of values).

What's more, lazySet offers ordering semantics (ie, everything visible to that thread before the call to lazySet is available to other threads after).

Moran Tzafrir of Tel Aviv University points out another interesting optimization that would avoid atomics altogether:

"When you do code optimization, you might remove the “volatile” keyword in very specific situation. You have global variables G1…Gn, and always you update G1…Gn together. And Gn is updated last.  In this case you might consider “volatile” only to Gn.  (Note this does not guaranty atomic/transactional update to G1…Gn, it just guaranty that after you write to Gn all preceding updates are visible to the other threads)... Gn should be read first. (On the read scenario)" [1]

Basically, if you batch updates, you would not have to use a relatively expensive AtomicReferenceArray.set(...) every time, but just make the variable written last and read first volatile.

Linux specific testing

Nick also emphasised the need for testing your suppositions. He mentions these JVM parameters in the slides but I'll put them here for easy access:

Setting the frequency of your CPU:

sudo cpufreq-set -c 0 -f 1700000

Run the tests with:

sudo nice -n -20 taskset -c 1,3 java -XX:+PrintCompilation -XX:+PrintGC -server -XX:CompileThreshold=100000 -Xmx1g -Xms1g ...

[1] https://groups.google.com/forum/?fromgroups=#!searchin/art-of-multiprocessor-programming/moran$20tzafrir/art-of-multiprocessor-programming/3y20qh7ooK0/z_lmxMVqk48J





Monday, May 6, 2013

Cuckoo Hashing

I was taking a look at a library called Kryo that is being used on a project I'm working on. It promises to be a much faster way of de/serializing objects.

Trying to discover it's secrets, I came across a class called com.esotericsoftware.kryo.util.ObjectMap. It's an implementation of a map that claims to be very fast for get, containsKey and remove although it maybe slower for put.

The way it does this is by using a technique called Cuckoo hashing. Here, we have an array of keys and an array of values. Upon insertion, we calculate a hash for our key and see if the value slot is free. If it is, we insert the value and we're done. If not, we may repeat this process for N more times using a different hash. Finally, failing this, we check a fixed size of memory called the stash. I'll return to this in a moment.

If we have failed to find any space for our mapping to go, an old key/value mapping is evicted to make way for our new key/value pair. This is where there reference to cuckoos comes from (cuckoo chicks eject their foster siblings from the nest).

The old key/value pair is put into the stash I mentioned earlier. This is a fixed-size area of memory. Only when this is exhausted do the arrays resize.

Incidentally, because the stash is fixed-size, this algorithm qualifies as an in-place algorithm.

Running tests using both Kryo's ObjectMap and Java's HashMap each initialized with space for 10 000 slots and running the experiment 10 000 times gave these results (all times in milliseconds).

Kryo ObjectMap Java HashMap
Number of Elements Put Get Remove Put Get Remove
2000 636 96 126 379 295 353
4000 1372 401 704 716 569 709
8000 3384 1763 1628 2184 1269 1585

True to their word, the author appears to have implemented a very fast map when it come to get and remove. The only caveat is that as the data structure approaches any where near capacity, it slows down.

As an aside, the stash is searched using a technique called "linear probing" or "open addressing". Here "we just add the key to an [array] index but in linear probing, we put it in position i if that's free and if not look at position i+1, i+2... It works well if the size of the array is significantly bigger than the number of keys" [1].

[1] Prof Bob Sedgewick, Coursera.

The Racy Single-Check Idiom

I saw this idiom in some code and looked up what exactly it meant. From Joshua Bloch's Effective Java:

"If you don't care whether every thread recalculates the value of a field, and the type of the field is a primitive other than long or double, then you may choose to remove the volatile modifier from the field declaration in the single-check idiom [calculating the value of a volatile field if it's null]. This variant is known as the racy single-check idiom. It speeds up field access on some architectures, at the expense of additional initializations (up to one per thread that accesses the field). This is definitely an exotic technique, not for everyday use. It is, however, used by String instances to cache their hash codes."

So, it's a race condition that doesn't matter to the results but may increase performance.

Be warned: Jeremy Manson points out a gotcha with this idiom on his blog. The trick, he points out, is that you must read the non-volatile (but shared) reference only once. If you need to read it more than once, you must act upon a reference that points to that shared memory (that act of using a temporary variable meets the reading-the-reference-only-once condition. You can, however, use the temporary variable as often as you like).

The reason this is a problem is that multiple reads may be re-ordered as is the wont in the world of the Java Memory Model. This re-ordering can cause incorrect values to be calculated.

The use of a temporary variable also makes the code (slightly) more efficient as there are fewer main memory reads.

It is used in many places in the Java source code's collection classes (a quick peek produced ConcurrentHashMap.values(), ConcurrentSkipListMap.keySet() etc) where the authors have avoided making fields volatile. Typically, you see code like:


    public NavigableSet navigableKeySet() {
        KeySet ks = keySet;
        return (ks != null) ? ks : (keySet = new KeySet(this));
    }


and you may wonder why they used a temporary variable. It seems harmless but pointless. Really, they are avoiding Manson's gotcha.

Manson describes the pathological case (that is, when not using a temporary variable) as unlikely to happen. Indeed, I couldn't get it to happen on my computer when I tried to deliberately cause it. But the problem is very real. He should know, he wrote the JMM spec.


Saturday, May 4, 2013

Double Trouble

I'm writing a maths library in my spare time for fun. I've only got around to matrices but already I am seeing interesting behaviour.

Primarily, my library needs to be fast so I use primitives. My first matrix implementation uses doubles and I'm testing performance by calculating cross products.

I'm using the Caliper microbenchmarking framework to performance test my code with methods that look a little like:

    public int timeMutable2DDoubleMatrixCross(int reps) {

        Mutable2DDoubleMatrix accumulator = null;
        for (int i = 0 ; i < reps ; i++) {
            accumulator = mutable2DDoubleMatrix.cross(mutable2DDoubleMatrix);
        }
        return accumulator == null ? 0 : accumulator.getWidth();
    }

On my machine*, I get a result of  1.299ms (standard deviation of 0.011ms) to run an iteration of my code.

But you know when I wrote other implementations for other primitive types, there was an awful lot of copy, paste and make a minor change going on. Bearing in mind the usual autoboxing caveats I've mentioned before, what if I just changed the interface to use Double but the implementation and the call site were using primitive doubles? Maybe then I could make the code more generic with no loss of performance.

No joy - we still get the overhead of autoboxing. Average call time is now 4.041ms (standard deviation of 0.040ms). The JVM is not going to be fooled.

Ah, well. I was half expecting that and it's still better than using BigDecimal (that was taking 68.354ms per call.)

But what was really interesting is that double was faster than long (typically 1.403ms per call, 0.013ms std dev). That blew out of the water my plan to use inflated longs and then right rotate them to give approximate answers, echoing Kronecker's assertion that "God made natural numbers; all else is the work of man".


This might be because floating point calculations are passed to the Floating Point Unit. "In most computers, floating point arithmetic is usually much slower than integer arithmetic, though on the Intel Pentium it is usually faster because the integer unit was not given the same care as the floating point unit." [1]

As ever, your mileage may vary (YMMV).

* A 1.8 GHz Intel Core i7 running Mac OS X 10.7.5 and using java version 1.6.0_41.

[1] http://mindprod.com/jgloss/floatingpoint.html

Additional reading:
 - interesting read on the history of floating points (http://www.intel.com/standards/floatingpoint.pdf).
 - stats for the i7 processor (http://elrond.informatik.tu-freiberg.de/papers/WorldComp2012/PDP2833.pdf)


Wednesday, April 17, 2013

Going Atomic with Locks

You might have read about non-blocking algorithms, how the synchronized keyword is so 2012 and how things are much faster if you use atomics. Then, you might have written your own atomic-based locks and discovered that things are no better.

I certainly have. But thanks to this fine book I've learned not only to make them faster but why they're faster.

Test-and-set Locks

Let's take a typical Java lock:


    private ReentrantLock lock = new ReentrantLock();

    @Override
    public void unlock() {
        lock.unlock();
    }

    @Override
    public void lock() {
        lock.lock();
    }

and let's have 100 threads trying to attain the lock to increment an int 50000 time. Naively, I thought this would be efficient:



    private final AtomicBoolean state = new AtomicBoolean(false);
    
    public void lock() {
        while (state.getAndSet(true)) { }
    }


    public void unlock() {
        state.set(false);
    }


It was an order of magnitude slower. The questions are: why is it much slower and how do we improve it?


Herlihy and Shavit [1] call this a TASLock (a test-and-set lock). They describe why it's so slow:

"For simplicity, we consider a typical multiprocessor architecture in which processors communicate by a shared broadcast medium called a bus (like a tiny Ethernet). Both the processors and the memory controller can broadcast on the bus, but only one processor (or memory) can broadcast on the bus at the a time. All processors (and memory) can listen. Today, bus-based architecture are common because they are easy to build, although they scale poorly to large numbers of processors.


"Each processor has a cache, a small high-speed memory where the processor keeps data likely to be of interest. A memory access typically requires orders of magnitude more machine cycles than a cache access... When a processor reads from an address in memory, it first checks whether that address and its contents are present in the cache. If so, then the processor has a cache hit, and can load the value immediately. If not, the the processor has a cache miss, and must find the data either in the memory or in another processor's cache. The processor then broadcasts the address on the bus. The other processors snoop on the bus. If one processor has that address in its cache, then it responds by broadcasting the address and value. If no processor has that address, then the memory responds with the value at theat address.

"Each getAndSet() call is broadcast on the bus. Because all threads must use the bus to communicate with memory, these getAndSet() calls delay all threads, even those not waiting for the lock. Even worse, the getAndSet() call forces other processors to discard their own cached copies of the lock, so every spinning thread encounters a cache miss almost every time, and must use the bus to fetch the new, but unchanged value. Adding insult to injury, when the thread holding the lock tries to release it, it may be delayed because the bus in monopolized by the spinners." [1]

Using the Linux perf command shows the number of bus cycles and cache-misses:


[henryp@corsair Performance]$ perf stat -e cycles,bus-cycles,cache-misses,L1-dcache-load-misses,L1-dcache-store-misses,LLC-load-misses,LLC-store-misses  java -cp bin/ com.henryp.test.concurrent.spin.LockMain

 Performance counter stats for 'java -cp bin/ com.henryp.test.concurrent.spin.LockMain':

   564,815,071,755 cycles                    #    0.000 GHz                     [57.17%]
    16,620,454,218 bus-cycles                                                   [57.18%]
           832,399 cache-misses                                                 [57.19%]
       421,992,129 L1-dcache-misses                                             [57.17%]
        14,457,236 L1-dcache-misses                                             [57.14%]
           247,642 LLC-misses                                                   [57.11%]
           529,972 LLC-misses                                                   [57.14%]

      10.681204987 seconds time elapsed



where LLC is the Last Level Cache (L3 on my machine). Misses to this cache mean a call to RAM.

We'll compare these figures to the other strategies.

Test-and-test-and-set Locks

Herlihy and Shavit propose an improvement that looks something like this:


    private final AtomicBoolean state = new AtomicBoolean(false);
    
    @Override
    public void lock() {
        while (true) {
            while (state.get()) { };
            if (!state.getAndSet(true))
                return;
        }
    }    

    @Override
    public void unlock() {
        state.set(false);
    }


It is somewhat better but not by much (about 25%). The authors call this a TTASLock (test-and-test-and-set lock)

"The first time thread B reads the lock it takes a cache miss, forcing B to block while the value is loaded into B's cache. As long as A holds the lock, B repeatedly rereads the value, but hits in the cache every time. B thus produces no bus traffic, and does not slow down other threads' memory accesses. Moreover, a thread that releases a lock is not delayed by threads spinning on that lock.

"The situation deteriorates, however, when the lock is released. The lock holder releases the lock by writing false to the lock variable, which immediately invalidates the spinners' cached copies. Each one takes a cache miss, rereads the new value, and they all (more-or-less simultaneously) call getAndSet() to acquire the lock. The first to succeed invalidates the others, who must then reread the value, causing a storm of bus traffic. Eventually, the threads settled down once again to local spinning.

"The notion of local spinning, where threads repeatedly reread cached values instead of repeatedly using the bus, is an important principle critical to the design of efficient spin locks." [1]

With this strategy, perf gives figures like:


   412,122,456,391 cycles                    #    0.000 GHz                     [57.19%]
    12,132,558,770 bus-cycles                                                   [57.15%]
           783,803 cache-misses                                                 [57.15%]
       128,832,596 L1-dcache-misses                                             [57.14%]
        11,947,461 L1-dcache-misses                                             [57.16%]
           256,900 LLC-misses                                                   [57.18%]
           560,395 LLC-misses                                                   [57.17%]

       7.813546393 seconds time elapsed


So, although the LLC figures are comparable, there is less bus activity and fewer L1 misses as Herlihy and Shavit suggested.

Back off!

The solution is to implement a back off strategy (since vying for a highly contended lock may be a waste of time). The code the authors suggest looks a lot like the TTASLock, something like this:


    private final AtomicBoolean state = new AtomicBoolean(false);
    
    @Override
    public void lock() {
        while (true) {
            while (state.get()) { backOff(); };
            if (!state.getAndSet(true))
                return;
        }
    }

    @Override
    public void unlock() {
        state.set(false);
    }

    

    private final Random random = new Random();

    public void backOff() throws InterruptedException {
        int delay = random.nextInt(limit);
        limit = Math.min(maxDelay, 2 * (limit == 0 ? 1 : limit));
        Thread.sleep(delay);
    }



This actually makes it faster than Java's built in locks. The results look something like this:

JavaReentrantLock took 459 ms
TTASLockWithBackOff took 321 ms
TTASLock took 8932 ms
TASLock took 12600 ms

Repeatedly running the tests indicate that TTASLock with a backoff is the fastest by a margin of about 25%.

Perf gives figures for this strategy like:

       900,772,793 cycles                    #    0.000 GHz                     [62.65%]
        56,529,667 bus-cycles                                                   [64.05%]
           468,664 cache-misses                                                 [62.47%]
         9,728,368 L1-dcache-misses                                             [60.17%]
         6,039,764 L1-dcache-misses                                             [57.58%]
           173,197 LLC-misses                                                   [57.86%]
           573,249 LLC-misses                                                   [60.75%]
       0.465015535 seconds time elapsed

Now the differences are orders of magnitude. 

Quick Addendum

I initially tried to use Valgrind to analyse what was going on but saw very little difference between the figures (about 4% fewer cache-misses between the best and worst locks). But reading the documentation explains why:



"Valgrind serialises execution so that only one (kernel) thread is running at a time. This approach avoids the horrible implementation problems of implementing a truly multithreaded version of Valgrind, but it does mean that threaded apps never use more than one CPU simultaneously, even if you have a multiprocessor or multicore machine." [2]


[1] The Art of Multiprocessor Programming.

[2] http://valgrind.org/docs/manual/manual-core.html

Tuesday, April 9, 2013

Dumpster Diving in the JVM

If you're monitoring your application, you might notice the maximum amount of heap memory changing with time. How can this be? Surely the maximum is, well, the maximum and not a moving target?

The JavaDocs for the MemoryUsage defines the maximums as:

"the maximum amount of memory (in bytes) that can be used for memory management. Its value may be undefined. The maximum amount of memory may change over time if defined. The amount of used and committed memory will always be less than or equal to max if max is defined. A memory allocation may fail if it attempts to increase the used memory such that used > committed even if used <= max would still be true (for example, when the system is low on virtual memory)."

In Linux, it's easy to reserve more memory than you can use. For instance:


#include <sys/mman.h>
.
.
size_t sillyAmountOfMemory = 1024 * 1024 * 1024 * 1024; // over 1 terabyte!
printf("About to map %lld bytes\n", sillyAmountOfMemory);
addr = mmap(NULL, sillyAmountOfMemory, PROT_READ | PROT_WRITE,
                MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
.
.

Meanwhile, in a shell:

[phenry@localhost MyMemoryTestsC]$ ps aux | head -1 ; ps aux | grep MyMemoryTest
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
phenry   11905  0.0  0.0 1050452  236 pts/0    S    13:43   0:00 /home/phenry/workspaceGanymedeC/MyMemoryTestsC/Debug/MyMemoryTests

Notice the VSZ value - that's the virtual size of our application:

[phenry@localhost MyMemoryTestsC]$ man ps | grep -A2 VSZ
.
.
       vsz         VSZ       virtual memory size of the process in KiB (1024-byte units). Device mappings are currently excluded; this is subject to change.


This is much more physical memory than I have on this machine:


[phenry@localhost MyMemoryTestsC]$ cat /proc/meminfo | head -1
MemTotal:        1026144 kB


Back to our Java process. Depending on the Garbage Collector being used, the maximum heap size may change to optimize performance. Oracle says:

"The statistics such as average pause time kept by the collector are updated at the end of each collection. The tests to determine if the goals have been met are then made and any needed adjustments to the size of a generation is made. The exception is that explicit garbage collections (e.g., calls to System.gc()) are ignored in terms of keeping statistics and making adjustments to the sizes of generations...

"If the maximum pause time goal is not being met, the size of only one generation is shrunk at a time. If the pause times of both generations are above the goal, the size of the generation with the larger pause time is shrunk first.

"If the throughput goal is not being met, the sizes of both generations are increased."

On a beefier machine, I can see the Parallel Scavenging collector being used. Oracle says of this type of collector:

"The parallel scavenge collector is similar to the parallel copying collector, and collects young generation garbage. The collector is targeted towards large young generation heaps and to scale with more CPUs. It works very well with large young generation heap sizes that are in gigabytes, like 12GB to 80GB or more, and scales very well with increase in CPUs, 8 CPUs or more. It is designed to maximize throughput in enterprise environments where plenty of memory and processing power is available.

"The parallel scavenge collector is again stop-the-world, and is designed to keep the pause down. The degree of parallelism can again be controlled. In addition, the collector has an adaptive tuning policy that can be turned on to optimize the collection. It balances the heap layout by resizing, Eden, Survivor spaces and old generation sizes to minimize the time spent in the collection. Since the heap layout is different for this collector, with large young generations, and smaller older generations, a new feature called "promotion undo" prevents old generation out-of-memory exceptions by allowing the parallel collector to finish the young generation collection."

(As an aside, there have been enhancements in JDK 7:

"The Parallel Scavenger garbage collector has been extended to take advantage of machines with NUMA (Non Uniform Memory Access) architecture. Most modern computers are based on NUMA architecture, in which it takes a different amount of time to access different parts of memory. Typically, every processor in the system has a local memory that provides low access latency and high bandwidth, and remote memory that is considerably slower to access.

"In the Java HotSpot Virtual Machine, the NUMA-aware allocator has been implemented to take advantage of such systems and provide automatic memory placement optimizations for Java applications. The allocator controls the eden space of the young generation of the heap, where most of the new objects are created. The allocator divides the space into regions each of which is placed in the memory of a specific node. The allocator relies on a hypothesis that a thread that allocates the object will be the most likely to use the object. To ensure the fastest access to the new object, the allocator places it in the region local to the allocating thread. The regions can be dynamically resized to reflect the allocation rate of the application threads running on different nodes. That makes it possible to increase performance even of single-threaded applications. In addition, "from" and "to" survivor spaces of the young generation, the old generation, and the permanent generation have page interleaving turned on for them. This ensures that all threads have equal access latencies to these spaces on average." [1])

Now, we connect to our Java process from another Java process and examine the first's MBeans with something like this:


        HashMap map = new HashMap();
        JMXConnector c = JMXConnectorFactory.newJMXConnector(createConnectionURL(host, port), map);
        c.connect();
        Object o = c.getMBeanServerConnection().getAttribute(new ObjectName("java.lang:type=Memory"), "HeapMemoryUsage");
        CompositeData cd = (CompositeData) o;

        Object max = cd.get("max");
.
.
.


This is after we have changed the C++ code in hotspot/src/share/vm/services/management.cpp thus:

// Returns a java/lang/management/MemoryUsage object representing
// the memory usage for the heap or non-heap memory.
JVM_ENTRY(jobject, jmm_GetMemoryUsage(JNIEnv* env, jboolean heap))
.
.
.
        printf("PH: MemoryPool %s has %llu bytes\n", pool->name(), u.max_size());
.
.
        printf("PH: so the total is %llu bytes\n", total_max);


We see output from the JVM that looks like:

PH: MemoryPool PS Survivor Space has 65536 bytes
PH: PSGenerationPool::get_memory_usage: size 1431699456: 
PH: MemoryPool PS Old Gen has 14316601344 bytes
PH: so the total is 21473656832 bytes

when we hit it with the above MBean code. The JVM being monitored has the command line switch -Xmx20g.

Now, when this monitored process starts consuming large amounts of memory, our Java code that is monitoring it prints out:

Tue Apr 09 22:59:42 BST 2013: committed = 18403360768 (17550 mb, 17972032kb), max = 19088801792 (18204 mb, 18641408kb)
Tue Apr 09 22:59:43 BST 2013: committed = 20726546432 (19766 mb, 20240768kb), max = 20726546432 (19766 mb, 20240768kb)

(notice how the maximum changes) and the output from the JVM being monitored indicates a change:

PH: PSYoungGen::resize_spaces
.
.
.
PH: MemoryPool PS Survivor Space has 1179648 bytes
PH: PSGenerationPool::get_memory_usage: size 1431699456: 
PH: MemoryPool PS Old Gen has 14316601344 bytes
PH: so the total is 21473722368 bytes

So the maximum is only the maximum until the next one :-)