Wednesday, October 22, 2014

A Lock for all Weathers?


Continuing with the lock comparison code here, I've modified it to test the timings for all the locks for different permutations of readers and writers and compared the results

I've tested the different locks with 1 to 16 readers and 1 to 16 writers on a 16 core (with hyper-threading) E5-2687W chip. With the generated matrix, I used R to graph the results (code at the bottom of the post).

Note, duration is measured in milliseconds and is scaled by log 10.

Synchronized blocks

> showPlot("SYNCHRONIZED")



Perhaps surprisingly, with synchronized blocks, more writers mean the target number is reached more quickly. Reader threads (which are also synchronized) only make a huge difference to the overall duration when there are few writers. Then they can make a difference of an order of magnitude.

JDK 8 Adders

> showPlot("ADDER")


As you might expect from the way adders work, the threads don't seem to interfere with each other. The more threads that write, the faster the target number is reached, irrespective of reader threads. But even when they're slow, they're fast (relative to other locks).

Atomics

> showPlot("ATOMIC")



Atomic times are all tightly within an order of magnitude of each other. Like synchronized, with smaller number of writer threads, a smaller number of reader threads makes a difference. However, the effect is by no means as pronounced.


Java Read-Write Locks

> showPlot("RWLOCK")



Read-write locks are also sensitive to the number of reader threads at low number of write threads. It can make a difference of an order of magnitude. Best and worse times are also similar to Synchronized blocks.


JDK 8 StampedLocks with an Optimistic Approach

> showPlot("OPTIMISTIC_CORRECT")


I've modified the original code to retry optimistic locks in the event it was not attained, which I think makes for a more fair comparison.

This strategy is relatively insensitive to the number of threads when compared to other locking techniques but it's not the fastest strategy.

Volatile

> showPlot("VOLATILE")



NB: this is not to be trusted too much. That adding a value to a value is not an atomic operation in Java so this metric is somewhat misleading. However, if we just care for speed, knock yourself out.

Having said that, this technique is fastest when there is only one writer - in this particular case, you'll get the correct result. Fastest of all is one reader, one writer. This is not that much slower than a dirty read/write.

Dirty

> showPlot("DIRTY")


I've just just added this for grins. As long as you don't mind dirty data, it's fine.

StampedLock

> showPlot("STAMPED")


StampedLock again but this time  using a pessimistic locking strategy. I've actually got some data missing at a low number of writer threads and high reader threads as the test decided it took too long and aborted! Suffice to say, it's fast for anything but a permutation of low number of writer threads with a large number of reader threads.


R Scripting

These graphics were generated with R. The code is below for review.

The raw results are a CSV file that looks like:

Lock_type, Number_of_Readers, Number_of_Writers, Duration_in_ms

allLocksMatrix.R:

.libPaths("/home/henryp/R/x86_64-unknown-linux-gnu-library/3.1/")
require("rgl")

timingsData            <- read.csv("/home/henryp/Code/R/JavaLocks/all.csv", header = T)
names                  <- levels(timingsData$name)

toMatrix <- function (counterName) {
  counter.data             <- toRaw(counterName)
  counter.dim              <- sqrt(length(counter.data[,1]))
  counter.matrix           <- matrix(0, nrow=counter.dim, ncol=counter.dim, byrow=T) 
  for (i in 1:counter.dim) 
    for (j in 1:counter.dim) { 
      counter.matrix[i,j] = log10(counter.data[((i - 1)*counter.dim) + j, 4])
    } 
  }
  return (counter.matrix)
}

toRaw <- function(counterName) {
  counter.data             <subset(timingsData, name == counterName) 
  return (counter.data)
}


rotate <- function(x) t(apply(x, 2, rev))

showPlot <- function (counterName) {
  myMatrix <- toMatrix(counterName) # rotate(rotate(rotate(toMatrix(counterName))))
  
  nbcol = 100
  color = rev(rainbow(nbcol, start = 0/6, end = 4/6))
  zcol  = cut(myMatrix, nbcol)
  
  open3d()
  persp3d(seq(1, length(myMatrix[,1])), 
          seq(1, length(myMatrix[1,])), 
          myMatrix, 
          col=color[zcol],
          xlab="",
          ylab="",
          zlab="",
          axes=FALSE
  )

  axes3d(c('x++','z++', 'y++')) 
  box3d()
  mtext3d("Writers",edge="x++",line=3,las=2,cex=1.1)
  mtext3d("Readers",edge="y++",line=3,las=2,cex=1.1)
  mtext3d("duration", edge="z++",line=2,las=2,cex=1.1)

  title3d(counterName, col='red', line=4)
}


Monday, October 20, 2014

New to Java 8: Adders


I was reading this blog and played with the suite that tests how fast read/writes to a long are and I came across a class new to JDK 1.8, LongAdder.

(The test suite has a problem that the class that uses optimistic locking and a StampedLock ignores whether it has attained the lock but I'll deal with that elsewhere).

It's quite a cunning idea. If there is contention on the field containing the long value, the thread that fails to attain the lock creates a new memory location (a Striped64.Cell object) and adds its values there. When it comes to seeing the results of many threads mutating the value, the reader thread returns not just the base value but all values in the additional memory. Thus contention is minimized.

There is a caveat when it comes to adding up all the additional values. From the JavaDoc:

The returned value is NOT an atomic snapshot; invocation in the absence of concurrent updates returns an accurate result, but concurrent updates that occur while the sum is being calculated might not be incorporated.

Anyway, I updated the suite mentioned above adding a modified optimistic locking class that retries in the event not attaining the lock (OPTIMISTIC_CORRECT in the diagram below). The logs output look like this:

name,threads,duration
ADDER,16,782
ADDER,16,810
ADDER,16,769
ADDER,16,792
ADDER,16,771
ADDER,16,793
ADDER,16,735
ADDER,16,777
ADDER,16,768
ADDER,16,849
ATOMIC,16,514
.
.

and wrote some R code to process it and give me a graphic of all the different techniques:

.libPaths("/home/henryp/R/x86_64-unknown-linux-gnu-library/3.1/")
require("epade")
data            <- read.csv("/home/henryp/Code/R/JavaLocks/firstPastPost1W15R.csv")

groupBy         <- list(data$name, data$threads)
aggregatedMeans <-aggregate(data,
                            by=groupBy,
                            FUN=mean,
                            na.rm=TRUE)
aggregatedSds   <- aggregate(data,
                             by=groupBy,
                             FUN=sd,
                             na.rm=TRUE)

meansAndSds     <- cbind(aggregatedMeans[,c(1,5)],
                         aggregatedSds[,5])

ordered         <- meansAndSds[order(meansAndSds[,2]),]

bar3d.ade(data.matrix(ordered[,2],
                      rownames.force = NA),
          main="10000000 reads of a montonically incrementing long",
          y=ordered[,1],
          zticks=ordered[,1],
          xticks="1 writer, 15 readers",
          ylab="Duration (ms)")

The means and standard deviations look like this:

> ordered
             Group.1 duration aggregatedSds[, 5]
3              DIRTY    211.9           11.95780
2             ATOMIC    594.2           71.89777
9           VOLATILE    641.0           72.22188
1              ADDER    708.7           73.71122
5 OPTIMISTIC_CORRECT   1379.3          141.92334
4         OPTIMISTIC   7518.0         1392.87967
8       SYNCHRONIZED   8682.4         1507.51283
6             RWLOCK  10515.4          665.83251
7            STAMPED  21859.3         5076.10047

or, graphically:


Note: this has one writing thread and 15 reading threads on a 16 core (using hyperthreading) CPU. DIRTY means there was no locking, the fastest but least accurate technique. Duration (the vertical axis) is in milliseconds. I don't know why this axis is not labelled when I save the image as it is displayed in RStudio.

So, ADDER is indeed fast but not as fast as ATOMIC (which is also more accurate) on this computer and with this read/write profile. I'll try other numbers of readers and writers later.


Monday, October 13, 2014

R Tip of the Day


If I'd known this a month ago, I'd have saved myself a lot of time. I was pulling GC data from a JVM's logs but my pattern matching wasn't perfect so I was getting a lot of dirty data (NA in R).

To clean these from a vector, you can do something similar to this:

data      <- c( 1, 2, 3, NA, 5, 6 ) # data with some dirty elements
badData   <- is.na(data)            # elements are TRUE or FALSE
data(!badData)
cleanData <- data[!badData]

Similarly, if you have 2-dimensional (or more) data, you can clean all data when there is as much as one element of a tuple that is dirty. For example:

<- c( 1,  2,  3,  NA, 5,  6 )
<- c( 10, 20, NA, 40, 50, 60 )
cleaned <- complete.cases(x, y)

will clean all data where either x or y (or both) is dirty:

> x[cleaned]
[1] 1 2 5 6
> y[cleaned]
[1] 10 20 50 60


What exactly triggers a Garbage Collection?


After playing with the OpenJDK C++ code, I found a good resource of documentation here. But this is a log of my own journey through the OpenJDK code that was prompted by workmates asking: what exactly is it that prompts a garbage collection?

Since we were using the parallel scavenge GC, I had a look in the code at ParallelScavengeHeap::mem_allocate. (We've tried G1 but failed to get good performance out of it. After a few hours, all target collection times were violated and over time it became slower and slower).

First, we try to allocate the memory from the young generation:

  HeapWord* result = young_gen()->allocate(size);

Hopefully, this is successful. If it's not, then we try it again but before we do, we grab a lock:

      MutexLocker ml(Heap_lock);

MutexLocker is a subclass of StackObj, indeed the way we instantiate it in the above line of C++ code shows that this is going to be allocated on the stack. When the stack data is popped, the lock will automatically unlock itself. No need for something like Java's try/finally block here because our thread will execute the destructor of this object which looks like:

  ~MutexLocker() {
    _mutex->unlock();
  }

If this second attempt to allocate space in the young generation fails, we'll try to allocate space in the old generation:

      result = mem_allocate_old_gen(size);

This method is interesting as first it sees if the size required is greater than half of the capacity of Eden. If it can, job done and the method returns. If not, it checks if there is a "death march" taking place. From the comments:

A "death march" is a series of ultra-slow allocations in which a full gc is done before each allocation, and after the full gc the allocation still cannot be satisfied from the young gen.

If there is such a death march taking place, it will do the following 64 times: it will try to allocate the memory in the Old Generation, first by not expanding the heap, then (if forced) by seizing a lock and expanding the heap through all sorts of tricks (see PSOldGen::expand(size_t bytes) for more details) and even possibly putting the thread to sleep for a while.

If a GC is needed, the thread might stall and try the whole process again. If the thread thinks a JNI thread is an issue, it may fail.

If all this still does not provide enough space, the thread requests a GC. It cannot do this itself as only VM threads are allowed to GC, not application threads. So, it requests the VM thread to GC and puts itself to sleep (VMThread::execute). The VM thread then tries to allocate memory in the Young Generation just like the application thread did. Failing that, it tries to do a full collection (see ParallelScavengeHeap::failed_mem_allocate which calls ParallelScavengeHeap::do_full_collection) first without clearing all the soft references then if that doesn't work, it tries to clear them. A full collection can be expensive. Typically it was OK but with our 4gb heaps it was not unusually to see them last a couple of seconds.

If this fails, it's back into the loop until a given number of attempts has been made.

If all this fails, out of memory errors are thrown and the heap may be dumped depending on the JVM's arguments.


Java and Scala: Covariance and Contravariance


There is a long-standing bug in the Java type system:

        SubClass[]   subArray   = new SubClass[1];
        SuperClass[] superArray = subArray;
        superArray[0]           = new AnotherSubClass(); // throws java.lang.ArrayStoreException

Scala avoids this quite nicely by using the Array class whose contents are invariant (it's defined as final class Array[T] ...). Note that it has neither + to indicate T is covariant or - to indicate T is contravariant.

Java programmers may understand co- and contravariance in terms of method calls. Here, method parameters are contravariant and return types are covariant. What do I mean? Well take this code:

    public SuperClass getThis(SuperClass x) { ... }

It says it returns a SuperClass but it can actually be overriden with this:

    @Override
    public SubClass getThis(SuperClass x) { ... }

(where not surprisingly SubClass is a subclass of SuperClass). So, the return type is covariant, that is, it can be more specific than its definition.

Contravariance in Java is more rare (example at the bottom) but it means that types can be super classes and still the code will compile.

Generics in Java are invariant (neither co- nor contravariant). That is:

        List<SubClass>    mySubs   = new ArrayList<>();
        List<SuperClass>  mySupers = mySubs; // does not compile

But in Scala, this:

    val subList                                   = List[MySubClass]()
    val superButReallySubList: List[MySuperClass] = subList // allowed

is acceptable because List is defined as:

type List[+A] = scala.collection.immutable.List[A]

That is, the list is covariant (because of the +).

Interestingly, Scala gets around the problems of Java's covariant arrays by changing the type of the list when a covariant type is added.

  val subListWithSuperElem                      = subList :+ new MySuperClass
  subListWithSuperElem :+ new MyOtherSubClass // this is legal because subListWithSuperElem is not the type as subList
  printAllSubs(subListWithSuperElem)          // but this is illegal
.
.

  def printAllSubs(x : List[MySubClass]) = {
    x.foreach(x =< { print(x + " ") } )
  }

Now it gets complicated :) Let's take these four functions that are permutations of functions taking a sub- or super-type and returning a sub- or super-type:

    def superToSub  (x: MySuperClass) : MySubClass    = { new MySubClass()   }
    def subToSub    (x: MySubClass)   : MySubClass    = { new MySubClass()   }
    def superToSuper(x: MySuperClass) : MySuperClass  = { new MySuperClass() }
    def subToSuper  (x: MySubClass)   : MySuperClass  = { new MySuperClass() }

Then we want to pass them to these functions:

  def fnTakesSuperReturnsSub  (x :(MySuperClass)  => MySubClass)   = {...}
  def fnTakesSubReturnsSuper  (x: (MySubClass)    => MySuperClass) = {...}
  def fnTakesSuperReturnsSuper(x: (MySuperClass)  => MySuperClass) = {...}
  def fnTakesSubReturnsSub    (x: (MySubClass)    => MySubClass)   = {...}

Some combinations are allowed, some are not. 

                         Type of called function's parameter
                         (Super)->Sub   (Super)->Super   (Sub)->Sub   (Sub)->Super
Argument  (Super)->Sub         Yes            Yes            Yes          Yes
to        (Super)->Super       No             Yes             No          Yes
Function  (Sub)->Super         No             No              No          Yes
          (Sub)->Sub           No             No             Yes          Yes


The rule is:

When passing function A to function B, function A's argument is covariant to the function defined in B's parameters and the return type is contravariant. In other words, you can be more relaxed about the argument and more precise about the returned type.

Java 8's new Functional Programming style allows us to create similar functions, for example:

        Function<SubClass, SuperClass> subToSuper =
                (SubClass subclass) -> { return new SuperClass(); };

        Function<SubClass, SubClass> subToSub =
                (SubClass subclass) -> { return subclass; };

        Function<SuperClass, SuperClass> superToSuper =
                (SuperClass superClass) -> { return superClass; };

        Function<SuperClass, SubClass> superToSub =
                (SuperClass superClass) -> { return new SubClass(); };

Note, that Java 8 adds the syntactic sugar that makes it look a little like Scala, but it's just creating instances of java.util.function.Function really and the methods we might call may look like this:

    private void doSuperToSub(Function<SuperClass, SubClass> superToSub) { }

    private void doSuperToSuper(Function<SuperClass, SuperClass> superToSuper) { }

    private void doSubToSub(Function<SubClass, SubClass> subToSub) { }

    private void doSubToSuper(Function<SubClass, SuperClass> subToSuper) { }

But this time, since Java generics by default are invariant, there is a one-to-one mapping between the 4 functions and 4 methods above. For instance:

        doSubToSuper(subToSuper); // works!
//        doSubToSuper(subToSub); // doesn't compile

If you want Scala like semantics for argument covariance, you have to start getting clever with generics. For instance, you can do:

    doAnyToSuperToSub(subToSub);
    doAnyToSuperToSub(superToSub);

if we define this method such:

    private <T extends SuperClass> void doAnyToSuperToSub(Function<T, SubClass> anySuperToSub) { }

Similarly, if you want Scala like semantics for the returned type's contravariance, you'll need:

        doSubToAnySuper(subToSuper);
        doSubToAnySuper(subToSuper);

calling something like:

    private void doSubToAnySuper(Function<SubClass, ? super SubClass> fn) {}

But wildcards cause all sorts of confusion and it's starting to look messy.

Saturday, October 11, 2014

Scala Crib Sheet #4

More Scala for busy Java programmers

Visibility

private - Outer classes cannot see the private methods in inner classes. "Java would permit both accesses" [1]

protected - "In Scala, a protected member is only accessible from subclasses of the class in which it is defined. In Java, such accesses are also possible from classes in the same package." [1]

You can also qualify the access modifier keywords with a scope. For example:

package com.phenry.scala.subpackage

class MyVisibility {
  protected[MyVisibility] def printSomething = { 
    println("Something has been printed :)")
  }
}

object MyVisibility {
  def main(args : Array[String]) = {
    def app  = new MyVisibility()
    app.printSomething
  }
}

Note, in this case, the protected method can only be called from an object called MyVisibility
as protected[MyVisibility] is basically the same as private.

But we can also qualify with packages. Let's change that listing slightly:

  protected[subpackage] def printSomething = { ... }
.
.
object MyVisibilityXXX {
  def main(args : Array[String]) = {
    def app  = new MyVisibility()
    app.printSomething // < compiles!
  }
}

Now anything in the package subpackage can access our function, including our object with the new name MyVisibilityXXX.

The visibility is possible for anything beneath the restriction. For instance:

package society {
  package professional {
    import society.professional.elite.Illuminati
    class Executive {
      private[professional] var workDetails = null

      def help(illuminati : Illuminati): Unit = {
        // DOES NOT COMPILE:
        println(illuminati.topSecret) // <-- WRONG
      }
    }
    package elite {
      class Illuminati {
        protected[elite] val topSecret = null;

        def help(executive : Executive): Unit = {
          println(executive.workDetails)
        }
      }
    }
  }
}

So, an Illuminati can see an Executive'workDetails by virtue of being one package further down. However, an Executive cannot see an Illuminati's top secrets.

This is different to Java where the hierarchy of packages makes no difference to the access modifiers.

Scope can even be restricted to this object (see here).

But Scala is clever enough that if we create another package called elite but in a different part of the namespace tree, it won't allow it access to the first:

package society {
  package professional {
.
.
    package other {
      package elite {
        class OtherIlluminati {
          def otherHelp(illuminati : Illuminati) = {
            // "Symbol topSecret is inaccessible from this place"
            println(illuminati.topSecret) // <-- WRONG. does not compile
          }
        }
      }
    }
    package elite {
      class Illuminati {
        protected[elite] val topSecret = null;
.
.


Shadowing

If the val keyword means a reference is immutable, why does Scala have a final keyword? Well, fields can be shadowed in Scala like so:

class MyVisibility {
  val shadowMe : String = "super class String"

  def printShadowMe() = { println(shadowMe) }
}

class MyVisibilitySubClass extends MyVisibility {
  override val shadowMe : String = "subclass's String"
}

object MyVisibilityXXX {
  def main(args : Array[String]) = {
    def appSubclass = new MyVisibilitySubClass()
    appSubclass.printShadowMe() // subclass's String
  }
}

Using final would prevent this shadowing.


Types, passing methods and partial application

Say we have two functions that encode data and we expect the input and output to be the same. The test then becomes an ideal candidate for passing functions, that is we, we can write a test function that takes either function, applies the input and checks the output. What does the type of this functions parameter look like?

Say, this is one of the methods we want to pass:

  def canBePartiallyApplied(count : Int)(f : Int => Int) : Unit = {...}

to a function. The called function would look something like this:

  def fnTakingFnThatCanBePartiallyApplied(f : (Int, (Int) => Int) => Unit) = {...}

and we'd call it so:

  fnTakingFnThatCanBePartiallyApplied(app.canBePartiallyApplied(_)(_)) 

So, the testing function takes a function that has parameters of Int and another function that takes an Int and returns an Int. Phew.

If we partially applied our function, so:

    def partiallyApplied = canBePartiallyApplied(3)(_)

We could not longer pass partiallyApplied to our test method. It's type is not the same. We've dropped the first Int type, leaving just an argument that is a function taking and Int and returning an Int. The kind of function we could pass this too would look something like:

    def fnTakingPartiallyApplied(f : (Int => Int) => Unit) = {...}


[1] Programming in Scala, Odersky.

Wednesday, October 1, 2014

Unsafe Fences


The mysterious sun.misc.Unsafe class in JDK 1.8 has some methods that have been recently added. These are loadFence(), storeFence() and fullFence(). The documentation in Unsafe.java is sparse telling us that they "ensure lack of re-ordering of" load, stores and loads/stores before the fence "with loads or stores after the fence."

These load-, store- and full-fence native methods are defined in unsafe.cpp in the OpenJDK source code just defer to OrderAccess::acquire(), ::release() and ::fence() respectively.

OrderAccess::release() doesn't do a great deal:

inline void OrderAccess::release() {
  // Avoid hitting the same cache-line from
  // different threads.
  volatile jint local_dummy = 0;
}

It is inlined so it's as if it were copy-and-pasted into all parts of the code that call it and when I put something similar into a (non-inlined) piece of C++ and disassemble it on an x86_64 with objdump -d, it looks like this:

00000000000003dd <_Z19OrderAccess_releasev>:
 3dd: 55                   push   %rbp
 3de: 48 89 e5             mov    %rsp,%rbp
 3e1: c7 45 fc 00 00 00 00 movl   $0x0,-0x4(%rbp)
 3e8: 5d                   pop    %rbp
 3e9: c3                   retq   

The highlighted line is what sets 0 to something in the stack (all other lines are setting up and tearing down the stack frame). So, I can only conclude that the volatile in the release() method is there to stop the compiler re-ordering things as nothing special is happening at the hardware level. This appears to be what the OpenJDK source code suggests:

"Note: as of 6973570, we have replaced the originally static "dummy" field (see above) by a volatile store to the stack. All of the versions of the compilers that we currently use (SunStudio, gcc and VC++) respect the semantics of volatile here. If you build HotSpot using other compilers, you may need to verify that no compiler reordering occurs across the sequence point represented by the volatile access."

The only place in the JDK's source that I can see a call to loadFence() is in StampedLock. What this call to loadFence() appears to be avoiding is a write to a volatile variable to ensure a LoadLoad memory barrier (that is, the data loaded in the second read is at least as new as the data loaded in the first). Why this LoadLoad is necessary is explained at this link here.

The LoadLoad is desired after reading a non-volatile variable but before reading a volatile variable. But no memory barrier is otherwise issued in a Normal Store/Volatile Load combination. So, one way to provide it is to write to a volatile variable. Unfortunately, this is expensive.

[Aside: the semantics of primitive memory fences operate between instructions. It does not make sense to associate a fence with a single instruction.]

An acquire fence is a LoadLoad and a LoadStore (see here for a good explanation). From OpenJDK's orderAccess.hpp, the comments tell us what this acquire does:

"Execution by a processor of acquire makes the effect of all memory accesses issued by it subsequent to the acquire visible to all processors *after* the acquire completes.  The effect of prior memory accesses issued by it *may* be made visible *after* the acquire.  I.e., prior memory accesses may float below the acquire, but subsequent ones may not float above it."

So, acquire issues the LoadLoad we are looking for (plus LoadStore which appears is not needed in this case) and avoids an expensive write to a volatile.

A note on x86 architecture

LoadLoad, LoadStore and StoreStore seem to be no-ops on x86. This is what Doug Lea's JMM Cookbook says and looking at the OpenJDK code, acquire() and release() do nothing but add 0 to registers (they do, however, use volatile so the compiler should not do any re-ordering). So, calls to these methods should have no effect on this architecture even if they're needed on others.

The method fence() does, however, have an effect on x86 as uses the lock assembly instruction.

Empirically, this seems to be the case. Using JMH, I put a simple class together that benchmarks the times for these three calls. Each method is called and nothing more:

Benchmark                               Mode  Samples  Score  Score error  Units
c.p.m.m.FenceBenchmarks.acquireFence    avgt        5  0.338        0.021  ns/op
c.p.m.m.FenceBenchmarks.fence           avgt        5  6.343        0.090  ns/op
c.p.m.m.FenceBenchmarks.releaseFence    avgt        5  0.340        0.013  ns/op

The results show fence()  to be an order of magnitude slower than the other two methods.