Saturday, December 21, 2013

FastLocks in the JVM - part 2

I was wondering why AtomicIntegerArrays etc were slow. Alternatively, why are other methods (notably using synchronized) so fast? Now, I'm not a machine code guru nor a JVM coder so please take some of this post with a pinch of salt. Here are my findings.

Let's write some Java code that plugs into the same simple framework I mentioned in the previous post by extending our AbstractReaderRunnable class but this time, let the reader access a plain, old int[] array in a synchronized method (similarly for the writer).

There's nothing clever in the Java code, but this is how the machine code generated by HotSpot looks:

.
.
08a    andq    R10, #7 # long
08e    cmpq    R10, #5
092    jne     B20  P=0.000001 C=-1.000000
.
.

(where R10 points to memory in the stack).

Here, we're checking the 3 lowest order bits (#7) of R10. I believe the most significant bit refers to whether biased locking is enabled for an object of this class and "as long as an object is unlocked, the last two bits have the value 01." [1] "These bits are commonly referred to as the object sync bits." [3].

This analysis seems to be confirmed by Dave Dice, who wrote much of this, when he describes "the operation of stack-locking[:] The inlined enter code emitted by the JIT will first fetch and examine the mark word. (That code first checks to see if the object is biased)... If the mark word is neutral as determined by the low-order bits of the mark, we'll try to use stack locking. First, in anticipation of a successful compare-and-swap (CAS), the code will store the just-fetched mark value into the on-frame word that was allocated by the JIT and is associated with that particular bytecode offset. Next, the inline enter code will attempt to CAS the address of that on-frame location over the mark word. If successful, we've locked the object." [6]

So, we're checking whether the object allows biased locking and is unlocked. If so, we jump to:

1af   B20: # B7 B21 <- 0.351352="" b19="" b5="" div="" nbsp="" req:="">
1af    leaq    R11, [rsp + #48] # box lock
1b4    fastlock RBP,R11,RAX,R10
205    je     B7  P=0.999999 C=-1.000000

Fastlock is not an x86 instruction but inlined JVM code that performs the locking [2]. I've not been able to find exactly what this refers to in the OpenJDK source code, so I had to resort to Googling.

"Synchronization on Java objects is done using a fast lock mechanism using light-weight lock records (referred to as fast locks) in most cases and only done using a real lock mechanism (referred to as inflated locks) when needed. The premise behind this implementation is that contention on Java locks are rare. Hence, there is no need to associate an inflated lock with the object until contention occurs. Using an inflated lock tends to be slower than just using a fast lock.

"When a thread attempts to lock the object, it uses atomic operations to check and set the sync bits as well as the header word. If the sync bits are CVM_LOCKSTATE_UNLOCKED, then there is no contention on this object yet. Within the same atomic operation, the header word will be replaced with a pointer to a fast lock record and the sync bits are set to CVM_LOCKSTATE_LOCKED ... If the same thread attempts to re-enter the lock on this object, it will find that the sync bits are already set to CVM_LOCKSTATE_LOCKED, and check to see if it is the owner of the fast lock. Since the current thread does own this fast lock, it simply increments the reentry count in the fast lock record and proceed with code execution." [3]

So, no expensive lock instructions in the uncontended case - unlike in AtomicIntegerArrays.

"If a different thread attempts to acquire the lock on this object, it will check and see that it is not the owner of the fast lock record. This is considered a contention case which will trigger the inflation of the lock." [3] "Deflation occurs at all stop-the-world safepoints" which occur frequently. There are 3 safepoints in the JITed read method alone.

In the inflated fast-path, "there's no need to block or wake threads (which requires 1 atomic for an enter-exit pair)." [6] ("The slow-path implementation is in native C++ code while the fast-path is emitted by the JITs.") The ObjectMonitor class (in objectMonitor.hpp)  has a linked list of ObjectWaiter objects which "serves as a "proxy" or surrogate thread" and SpinCallbackArguments (in objectMonitor.cpp) that allow Adaptive Spinning [7] which uses a spin-then-block strategy based on measured success. This is currently the limit of my knowledge on how the JVM manages high throughput in this area.

So, why care?

Dice describes the JVM's locking as "considerably better (lower latency and better scalability) than native pthreads synchronization implementations." [6] So, why isn't it used everywhere?

Well, as ever, there is a trade-off. It's described as "optimized for system-wide throughput at the expense of short-term thread-specific fairness" [4] and "HotSpot favors throughput [and] favor[s] recently-run threads" [5].

We can see this quite easily. Given a writer class that looks a little like this:

public abstract class AbstractWriterRunnable implements Runnable {
    
    protected int index = 0;
    protected int lap = 1;
    private long duration;

    @Override
    public void run() {
        long start = System.currentTimeMillis();
        while (index < ARRAY_SIZE) {
            setElementAndIncrementCounters();
        }
        duration = System.currentTimeMillis() - start;
    }

    private void setElementAndIncrementCounters() {
        setElement();
        index++;
    }
    
    protected abstract void setElement();
}

that runs in one thread and a reader class that looks like this:

public abstract class AbstractReaderRunnable implements Runnable, Detailed {
    
    protected int index = 0;
    protected int lap = 0;
    private long duration;
    private int doNotCompileMeAway;

    @Override
    public void run() {
        long start = System.currentTimeMillis();
        while (index < ARRAY_SIZE) {
            lap = elementAtIndex();
            while (lap == 0) {
                lap = elementAtIndex();
                doNotCompileMeAway |= lap;
            }
            index++;
        }
        duration = System.currentTimeMillis() - start;
    }
    
    protected abstract int elementAtIndex();
.
.

that runs in another thread, we can implement setElement and elementAtIndex in subclasses, one pair to read and write to an array in synchronized methods and another to get and set on an AtomicIntegerArray.

No Free Meal

The results are very interesting. After one run that throw away the results (as ever) to let the JVM warm up, they consistently look something like this:

Synchronized Array Access
read thread took 9550ms, writer thread took 6516ms

AtomicIntegerArray Access
read thread took 74693ms, writer thread took 74693ms

Synchronized array access is consistently better over all my runs but notice the difference between the times for read and write threads - the synchronized reader is 3 seconds behind the writer while the AtomicIntegerArray reader is hot on its writer's heals.

Presumably, is an example of synchronized access sacrificing "short-term thread-specific fairness" for increased throughput.


[1] Synchronization and Object Locking.
[2] "Deep dive into assembly code from Java" -Kohsuke Kawaguchi's blog.
[3] CDC HotSpot Implementation.
[4] Synchronization in Java SE 6 (HotSpot).
[5] Synchonrization.
[6] "Lets say you're interested in using HotSpot as a vehicle for synchronization research" - Dave Dice's blog.
[7] Java SE6 Performance White Paper.


No comments:

Post a Comment