Thursday, May 30, 2013

Oracle Coherence Crib Sheet #1

I've just started working on a new project. The software handles some 39 million transactions per day and must store 3Gb per day (it already stores 4Tb). Essentially, it's a repository of trade and market data that must handle 8 million saves, 26 million loads and nearly a million queries every day.

One of the core technologies is Oracle's Coherence, an in-memory data grid. We use it over 8 different boxes with 5 different instances on each (plus a contingency environment). Since I am new to Coherence, I've been taking notes that I am publishing here (with extensive references to Oracle Coherence 3.5 event though we're using 3.7).

Cache Topologies

Near Cache - "A near cache is a hybrid, two-tier caching topology that uses a combination of a local, size-limited cache in the front tier, and a partitioned cache in the back tier to achieve the best of both worlds: the zero-latency read access of a replicated cache and the linear scalability of a partitioned cache". Has invalidation strategies: none, present, all, auto.

Continuous Query Cache - "very similar to a near cache [but] ... populates its front cache based on a query as soon as it is created; ... registers a query-based listener with the back cache, which means that its contest change dynamically as the data in the back cache changes... Basically CQC allows you to have a live dynamic view of the filtered subset of data in a partitioned cache.

Replicated - "Simply replicates all data to all cluster nodes".

Optimistic - Like replicated only with no concurrency control.

Partitioned - "Uses a divide and conquer approach... [it] is reminiscent of a distributed hash map. Each node in the cluster becomes responsible for a subset of cache partitions (buckets), and the Partitioned Cache service uses an entry key to determine which partition (bucket) to store the cache entry in".

Local - totally contained within a particular cluster node.

Remote - Any out of process cache accessed by a Coherence*Extend client. All cache requests are sent to a Coherence proxy where they are delegated to one of the other Coherence cache types (Repilcated, Optimistic, Partitioned). 

Backing Maps

There are "where cache data within the cluster is actually stored."
  • Local cache - "a backing map for replicated and partitioned caches and as a front cache for near and continuous query caches. It stores all the data on the heap, which means that it provides by far the fastest access speed, both read and write compared to other backing map implementations."
  • External backing map - "allows you to store cache items off-heap, thus allowing far greater storage capacity, at the cost of somewhat-to-significantly worse performance. There are several plugagable storage strategies... These strategies are implemented as storage managers." These include NIO Memory Manager, NIO File Manager, Berkley DB Store Manager and "a wrapper storage manager that allows you to make write operations asynchronous for any of the store managers listed earlier".
  • Paged external backing map - "very similar to the external backing map... The big difference between the the two is that a paged external backing map uses paging to optimize LRU eviction".
  • Overflow backing map - "a composite backing map with two tiers: a fast, size-limited, in-memory front tier, and a slower, but potentially much larger back tier on a disk."
  • Read-write backing map - "another composite backing map implementation... the read-write backing map has a single internal cache (usually a local cache) and either a cache loader, which allows it to load data from the external data source on cache misses, or a cache store, which also provides the ability to update data in the external data store on cache puts."
  • Partitioned backing map - "contains one backing map instance for each cache partition, which allows you to scale the cache size simply by increasing the number of partitions for a given cache".

Some important classes and interfaces

PortableObject - "The PortableObject interface is implemented by Java classes that can self- serialize and deserialize their state to and from a POF data stream."

NamedCache - "A NamedCache is a Map that holds resources shared among members of a cluster. These resources are expected to be managed in memory, and are typically composed of data that are also stored persistently in a database, or data that have been assembled or calculated at some significant cost, thus these resources are referred to as cached."

InvocableMap - "An InvocableMap is a Map against which both entry-targeted processing and aggregating operations can be invoked. While a traditional model for working with a Map is to have an operation access and mutate the Map directly through its API, the InvocableMap allows that model of operation to be inverted such that the operations against the Map contents are executed by (and thus within the localized context of) a Map. This is particularly useful in a distributed environment, because it enables the processing to be moved to the location at which the entries-to-be-processed are being managed, thus providing efficiency by localization of processing."

A recurring message in the JavaDocs:

Note: When using the Coherence Enterprise Edition or Grid Edition, the Partitioned Cache implements the InvocableMap interface by partitioning and localizing the invocations, resulting in extremely high throughput and low latency. When using Coherence Standard Edition, the InvocableMap processes the invocations on the originating node, typically resulting in higher network, memory and CPU utilization, which translates to lower performance, and particularly when processing large data sets.


Tuesday, May 21, 2013

Atomics and Speeds

Just to compare strategies for writing to arrays, I put together some code where one thread writes a non-zero value to an array while another reads through the array, spinning on any elements that are still zero. There is no locking, so this should be fast.

The four different approaches are:

1. Writing to a simple int[] array.
2. Writing to a simple int[] array but occasionally reading/writing a volatile field
3. Writing to a simple int[] array but always reading/writing a volatile field
4. Using AtomicIntegerArray.set(...)
5. Using AtomicIntegerArray.lazySet(...)

The test:

1. Starts the producer thread
2. Starts the consumer thread
3. Joins on the producer thread
4. Joins on the consumer thread but times out after 10 seconds. If the reader thread has not finished after this time, we regard the test in error.

The same framework code is used for all approaches. After a few runs, the results look like:


Approach Mean (seconds) Standard Deviation (seconds)
AtomicIntegerArray.set(...) 131.2 3.6
Normal Array Access With Constant Volatile Read/Write 96.6 5.1
AtomicIntegerArray.lazySet(...) 42.3 6.0
Normal Array Access 37.4 4.7
Normal Array Access With Occasional Volatile Read/Write 6.3 3.8

So, we can quickly see that the set(...) method is the slowest way of writing to an array on my machine.

Normal array access looks reasonably fast but it is wrong! Occasionally, you will see the timeout in test step #4 being triggered as there is no guarantee that (in the absence of memory semantics) that the reader thread will see what the writer thread puts in the array.

The winner by far is the strategy to read/write to the array as normal but occasionally have the array writer thread write to some volatile field and the reader thread sometimes read from it. How often is a matter of tuning and what your needs are.

So, in summary:

  1. Avoid reading/writing from arrays with multiple threads with no further memory semantics (and making the reference to the array volatile is not sufficient). You might find one thread never sees the data another thread has put into the array.
  2. If you want the fastest exchange of data and can tolerate it being somewhat stale, occasionally use memory semantics.


Sunday, May 19, 2013

More Atomics

This week, I went to Nick Zeeb's talk at Skills Matter where I pestered him about atomics.

He was talking about the latest new technology from those clever boys and girls down at LMAX, the Coalescing Ring Buffer.

[First, one interesting aside: the mod operator (%) is expensive. Therefore, when you can, choose a value to mod against that is a power of 2 (let's call this value X). When it comes to mod-ing a variable against it, AND that variable with the value X-1. This is much faster.]

My interest in the system was piqued by their use of AtomicReferenceArray, a class that is not much discussed yet is pretty critical to some multithreaded apps. He confirmed my experience that this class is slow and you should avoid it unless you really have to.

How to increase the efficiency of AtomicReferenceArray

AtomicReferenceArray is slow as each call to set flushes the cache. Nick recommended the use of the method lazySet. The JavaDocs merely says this method "eventually sets the element at position i to the given value." Nick expanded on this and said that this method offers a write to memory in a reasonable time but you wouldn't want to use it if the value were critical and had to be written immediately (the Coalescing Ring Buffer uses them for the non-critical purging of values).

What's more, lazySet offers ordering semantics (ie, everything visible to that thread before the call to lazySet is available to other threads after).

Moran Tzafrir of Tel Aviv University points out another interesting optimization that would avoid atomics altogether:

"When you do code optimization, you might remove the “volatile” keyword in very specific situation. You have global variables G1…Gn, and always you update G1…Gn together. And Gn is updated last.  In this case you might consider “volatile” only to Gn.  (Note this does not guaranty atomic/transactional update to G1…Gn, it just guaranty that after you write to Gn all preceding updates are visible to the other threads)... Gn should be read first. (On the read scenario)" [1]

Basically, if you batch updates, you would not have to use a relatively expensive AtomicReferenceArray.set(...) every time, but just make the variable written last and read first volatile.

Linux specific testing

Nick also emphasised the need for testing your suppositions. He mentions these JVM parameters in the slides but I'll put them here for easy access:

Setting the frequency of your CPU:

sudo cpufreq-set -c 0 -f 1700000

Run the tests with:

sudo nice -n -20 taskset -c 1,3 java -XX:+PrintCompilation -XX:+PrintGC -server -XX:CompileThreshold=100000 -Xmx1g -Xms1g ...

[1] https://groups.google.com/forum/?fromgroups=#!searchin/art-of-multiprocessor-programming/moran$20tzafrir/art-of-multiprocessor-programming/3y20qh7ooK0/z_lmxMVqk48J





Monday, May 6, 2013

Cuckoo Hashing

I was taking a look at a library called Kryo that is being used on a project I'm working on. It promises to be a much faster way of de/serializing objects.

Trying to discover it's secrets, I came across a class called com.esotericsoftware.kryo.util.ObjectMap. It's an implementation of a map that claims to be very fast for get, containsKey and remove although it maybe slower for put.

The way it does this is by using a technique called Cuckoo hashing. Here, we have an array of keys and an array of values. Upon insertion, we calculate a hash for our key and see if the value slot is free. If it is, we insert the value and we're done. If not, we may repeat this process for N more times using a different hash. Finally, failing this, we check a fixed size of memory called the stash. I'll return to this in a moment.

If we have failed to find any space for our mapping to go, an old key/value mapping is evicted to make way for our new key/value pair. This is where there reference to cuckoos comes from (cuckoo chicks eject their foster siblings from the nest).

The old key/value pair is put into the stash I mentioned earlier. This is a fixed-size area of memory. Only when this is exhausted do the arrays resize.

Incidentally, because the stash is fixed-size, this algorithm qualifies as an in-place algorithm.

Running tests using both Kryo's ObjectMap and Java's HashMap each initialized with space for 10 000 slots and running the experiment 10 000 times gave these results (all times in milliseconds).

Kryo ObjectMap Java HashMap
Number of Elements Put Get Remove Put Get Remove
2000 636 96 126 379 295 353
4000 1372 401 704 716 569 709
8000 3384 1763 1628 2184 1269 1585

True to their word, the author appears to have implemented a very fast map when it come to get and remove. The only caveat is that as the data structure approaches any where near capacity, it slows down.

As an aside, the stash is searched using a technique called "linear probing" or "open addressing". Here "we just add the key to an [array] index but in linear probing, we put it in position i if that's free and if not look at position i+1, i+2... It works well if the size of the array is significantly bigger than the number of keys" [1].

[1] Prof Bob Sedgewick, Coursera.

The Racy Single-Check Idiom

I saw this idiom in some code and looked up what exactly it meant. From Joshua Bloch's Effective Java:

"If you don't care whether every thread recalculates the value of a field, and the type of the field is a primitive other than long or double, then you may choose to remove the volatile modifier from the field declaration in the single-check idiom [calculating the value of a volatile field if it's null]. This variant is known as the racy single-check idiom. It speeds up field access on some architectures, at the expense of additional initializations (up to one per thread that accesses the field). This is definitely an exotic technique, not for everyday use. It is, however, used by String instances to cache their hash codes."

So, it's a race condition that doesn't matter to the results but may increase performance.

Be warned: Jeremy Manson points out a gotcha with this idiom on his blog. The trick, he points out, is that you must read the non-volatile (but shared) reference only once. If you need to read it more than once, you must act upon a reference that points to that shared memory (that act of using a temporary variable meets the reading-the-reference-only-once condition. You can, however, use the temporary variable as often as you like).

The reason this is a problem is that multiple reads may be re-ordered as is the wont in the world of the Java Memory Model. This re-ordering can cause incorrect values to be calculated.

The use of a temporary variable also makes the code (slightly) more efficient as there are fewer main memory reads.

It is used in many places in the Java source code's collection classes (a quick peek produced ConcurrentHashMap.values(), ConcurrentSkipListMap.keySet() etc) where the authors have avoided making fields volatile. Typically, you see code like:


    public NavigableSet navigableKeySet() {
        KeySet ks = keySet;
        return (ks != null) ? ks : (keySet = new KeySet(this));
    }


and you may wonder why they used a temporary variable. It seems harmless but pointless. Really, they are avoiding Manson's gotcha.

Manson describes the pathological case (that is, when not using a temporary variable) as unlikely to happen. Indeed, I couldn't get it to happen on my computer when I tried to deliberately cause it. But the problem is very real. He should know, he wrote the JMM spec.


Saturday, May 4, 2013

Double Trouble

I'm writing a maths library in my spare time for fun. I've only got around to matrices but already I am seeing interesting behaviour.

Primarily, my library needs to be fast so I use primitives. My first matrix implementation uses doubles and I'm testing performance by calculating cross products.

I'm using the Caliper microbenchmarking framework to performance test my code with methods that look a little like:

    public int timeMutable2DDoubleMatrixCross(int reps) {

        Mutable2DDoubleMatrix accumulator = null;
        for (int i = 0 ; i < reps ; i++) {
            accumulator = mutable2DDoubleMatrix.cross(mutable2DDoubleMatrix);
        }
        return accumulator == null ? 0 : accumulator.getWidth();
    }

On my machine*, I get a result of  1.299ms (standard deviation of 0.011ms) to run an iteration of my code.

But you know when I wrote other implementations for other primitive types, there was an awful lot of copy, paste and make a minor change going on. Bearing in mind the usual autoboxing caveats I've mentioned before, what if I just changed the interface to use Double but the implementation and the call site were using primitive doubles? Maybe then I could make the code more generic with no loss of performance.

No joy - we still get the overhead of autoboxing. Average call time is now 4.041ms (standard deviation of 0.040ms). The JVM is not going to be fooled.

Ah, well. I was half expecting that and it's still better than using BigDecimal (that was taking 68.354ms per call.)

But what was really interesting is that double was faster than long (typically 1.403ms per call, 0.013ms std dev). That blew out of the water my plan to use inflated longs and then right rotate them to give approximate answers, echoing Kronecker's assertion that "God made natural numbers; all else is the work of man".


This might be because floating point calculations are passed to the Floating Point Unit. "In most computers, floating point arithmetic is usually much slower than integer arithmetic, though on the Intel Pentium it is usually faster because the integer unit was not given the same care as the floating point unit." [1]

As ever, your mileage may vary (YMMV).

* A 1.8 GHz Intel Core i7 running Mac OS X 10.7.5 and using java version 1.6.0_41.

[1] http://mindprod.com/jgloss/floatingpoint.html

Additional reading:
 - interesting read on the history of floating points (http://www.intel.com/standards/floatingpoint.pdf).
 - stats for the i7 processor (http://elrond.informatik.tu-freiberg.de/papers/WorldComp2012/PDP2833.pdf)