Tuesday, August 19, 2014

Low Level Locks


Deep down in the JVM you'll find something like this in Atomic::cmpxchg:

  __asm__ volatile ("lock; 1: cmpxchgl %1,(%3)"
                    : "=a" (exchange_value)
                    : "r" (exchange_value), "a" (compare_value), "r" (dest)
                    : "cc", "memory");
  return exchange_value;

(Note: I've slightly modified this for reasons of clarity to ignore the case where the code might be running on a CPU that is not a multi-processor).

This code is hit often when the JVM tries to lock something. It is one of its many techniques to arbitrate contention (what happens in the event of contention I'll leave to another post).

So, what's it doing? It's basically doing what Java programmers would know from the AtomicXXX classes using low-level assembler. This is how GCC makes explicit assembler code. A good reference is here but briefly:
  • asm (or alternatively __asm__) are keywords that allow the programmer to directly write assembler. Volatile means don't move our code as part of an optimization. It must execute where we placed it.
  • The contents of the parentheses are delimited by colons and their meanings are ASSEMBLY_CODE:OUTPUT:INPUT:CLOBBERED.
  • The CLOBBERED list tells the compiler what we have fiddled with (other than input and output) and that it should bear this mind before and after executing it. In this case, we're telling it we fiddled with the memory and the condition flags ("cc").
  • The INPUT and OUTPUT fields map our C variables to particular CPU registers. In this case, we're saying use any registers for the values input via exchange_value and dest because "r" means "any register". For the input in compare_value and output in exchange_value we're saying explicitly use the EAX register ("a" and "=a" respectively. The equals sign means it is a write-only output.
Why the EAX register is used for input and output is given in Intel's Software Developer's Manual:

"The CMPXCHG instruction requires three operands: a source operand in a register, another source operand in the EAX register, and a destination operand. If the values contained in the destination operand and the EAX register are equal, the destination operand is replaced with the value of the other source operand (the value not in the EAX register). Otherwise, the original value of the destination operand is loaded in the EAX register." [1]

If you put this code in a simple .c file and compile it with:

$ gcc -S YOUR_FILE.c

you'll see the raw assembly output to YOUR_FILE.s confirms the above.

[1] Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2A, 2B, 2C, 3A, 3B, and 3C


Monday, August 18, 2014

R


R is a great tool for plotting performance graphs. Every developer who worries about the performance profile of his application should learn it. Here are some miscellaneous scripts I've been using for the last few months.

But first...

A Crash Course in R

R is Modular

You might find that the standard installation does not have all the modules you need. Any R GUI will help you download the ones you want. But to use them, you need something like this:

.libPaths("YOUR_MODULE_DIRECTORY")
require("MODULE_NAME")

Think of this as setting the classpath and then using import in your Java files.

Loading the data

Take this example:

timings_data <- read.table("C:/Users/henphi/db_saves.txt")
timings      <- c(timings_data$V15)
times        <- timings_data$V1

The first line reads the raw data that I saved in a file.

The second line takes the 15th column of data (think of it as the awk command in Unix) and puts it into a vector. Like awk, R treats the first column as index 1, not zero. There is a 0th column but this is the line number. We put the  data into a vector using the c() command.

The third lines is just like the second but takes just the 1st ($V1) column. Here, I chose not to use the c() command but that's OK. In both cases it doesn't matter as they are both vectors already. In the paragraph above, I just wanted to introduce the standard way of making vectors in R.

Coercing the Data

We've already mentioned one data structure - vectors that are created using the c() command. Another is a list that can be created using (unsurprisingly) the list() command.

In this example, I am taking the variables defined in the section above and manipulating the types to make something I can plot.

data <- structure (
    list(
        time = structure(
            as.POSIXct(strptime(times, '%H:%M:%S')),
            class = c("POSIXct", "POSIXt"),
            tzone = ""
        ),
        variable = timings
    ),
    .Names    = c("time", "variable"),
    row.names = c(NA, -10L),
    class     = "data.frame"
)

Here, I have made a data structure of a type called a data frame. "Although a data set of 50 employees with 4 variables per worker has the look and feel of a 50x4 matrix, it does not qualify as such in R because it mixes types. Instead of a matrix we use a data frame. A data frame in R is a list with each component of the list being a vector corresponding to a column in our 'matrix' of data... Typically, data frames are created by reading in a data set from a file". [1]

As part of this structure, we have a list that has two entries that go by the arbitrary names of time and variable. Variable is simple, it's just the timings variable that we loaded from a file earlier.

The element of this list, time, is also a list but we have coerced the string representations of our times to a type that represents a date. This is done by the strptime function that converts the string in our data file to a standard type using regexp. Once in the standard format, it is then converted to a Unix time-from-epoch using as.POSIXct.

(Since the data file only has time of day entries in it, R will assume they belong to the day on which the line runs and give it "today's" date. That's OK as I know my data is only within an hour but I don't care which hour it was.)

The .Names element defines the names of the data objects within this structure. They would be used as default in the graphs we generated but we'll override them later.

A real world example

Let's say we want to plot the Garbage Collection times of our app. A couple of command line switches will help.

  • -XX:+PrintGCApplicationStoppedTime which GC guru, Gil Tene, considers "is probably the only number in GC logs that you should sort-of-semi-believe."
  • -XX:+PrintGCDateStamps which makes the logs much easier to deal with than time-since-epoch.
  • -Xloggc:/YOUR/FILE/HERE to make the JVM output log stats in the first place.

I clean this file up just to grab the application pause times:

$ grep "Total time for which application threads were stopped" jstringserver.gc.log > jstringserver_cleaned.gc.log 

such that the slightly cleaned GC logs look like this:

$ more /home/henryp/Documents/Blogs/jstringserver_cleaned.gc.log
2014-08-18T15:17:29.824+0100: 1.096: Total time for which application threads were stopped: 0.0004190 seconds
2014-08-18T15:17:32.895+0100: 4.166: Total time for which application threads were stopped: 0.0001930 seconds
2014-08-18T15:18:40.902+0100: 72.173: Total time for which application threads were stopped: 0.0001650 seconds
2014-08-18T15:18:43.903+0100: 75.174: Total time for which application threads were stopped: 0.0000450 seconds
.
.

Now, my R script looks like this:

gc_data <- read.table("/home/henryp/Documents/Blogs/jstringserver_cleaned.gc.log")

data <- structure (
  list(
    time = structure(
      as.POSIXct(strptime(gc_data$V1, '%Y-%m-%dT%H:%M:%S')),
      class = c("POSIXct", "POSIXt"),
      tzone = ""
    ),
    variable = gc_data$V11
  ),
  .Names    = c("time", "variable"),
  class     = "data.frame"
)

plot(
  data,
  xaxt="n",
  type="h",
  main="GC times for JStringServer",
  ylab="GC Pause Time (ms)",
  xlim=c(as.POSIXct(strptime("15:20", "%H:%M")),
         as.POSIXct(strptime("15:30", "%H:%M")))
)

axis.POSIXct(side=1, at=cut(data$time, "min"), format="%H:%M")

Where I have zoomed-in on a period of time I am interested in, namely from 15:20 to 15:30 and axis.POSIXct just put's the time ticks on the x-axis after we disabled x-axis marking in the plot() command with xaxt="n".

The generated graph looks like:


Note that if you prefer, you can set log="y" as an argument in the plot command and have the y-axis scaled logarithmically.


[1] The Art of R Programming.

Saturday, August 16, 2014

Performance of RSA vs. DES


Choice of encryption algorithm matters. "In software, DES is about 100 times faster than RSA. These numbers may change slightly as technology changes, but RSA will never approach the speed of symmetric algorithms". [1]

DES can be cracked with brute force in hours but for our application that needs to encrypt a nonce to check the integrity of the conversational state, that's good enough security. So, it looked like it was worth investigating a change of algorithm.

"DES is a symmetric algorithm: The same algorithm and key are used for both encryption and decryption" [1]. AS a result, the means of getting a key are slightly different to getting an RSA private/public key pair:

    KeyGenerator keyGenerator   = KeyGenerator.getInstance("DES");
    SecretKey    secretKey      = keyGenerator.generateKey();

But using the cipher is pretty similar:

    Cipher cipher        = Cipher.getInstance("DES/ECB/PKCS5Padding");
    cipher.init(ENCRYPT_MODE, secretKey);
    byte[] encrypted     = cipher.doFinal("hello world".getBytes());

The performance improvements were dramatic, even better than Schneier's estimate. To measure them, I used the Java Microbenchmark Harness. To allow your project to use JMH, you need Maven and a pom.xml that looks a little like this below (my version of Eclipse - Kepler SR2 - auto-generated it for me but my work computer couldn't seem to as it didn't have the relevant Maven archetype so I'll include my pom.xml explicitly here):

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.phenry</groupId>
    <artifactId>microbenchmarks</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <packaging>jar</packaging>

    <name>Auto-generated JMH benchmark</name>

    <prerequisites>
        <maven>3.0</maven>
    </prerequisites>

    <dependencies>
        <dependency>
            <groupId>org.openjdk.jmh</groupId>
            <artifactId>jmh-core</artifactId>
            <version>${jmh.version}</version>
        </dependency>
        <dependency>
            <groupId>org.openjdk.jmh</groupId>
            <artifactId>jmh-generator-annprocess</artifactId>
            <version>${jmh.version}</version>
            <scope>provided</scope>
        </dependency>
   <dependency>
     <groupId>junit</groupId>
     <artifactId>junit</artifactId>
     <version>4.8.1</version>
     <scope>test</scope>
   </dependency>
    </dependencies>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <jmh.version>0.9.4</jmh.version>
    </properties>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                    <compilerVersion>1.6</compilerVersion>
                    <source>1.6</source>
                    <target>1.6</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.2</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <finalName>benchmarks</finalName>
                            <transformers>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass>org.openjdk.jmh.Main</mainClass>
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
        <pluginManagement>
            <plugins>
                <plugin>
                    <artifactId>maven-clean-plugin</artifactId>
                    <version>2.5</version>
                </plugin>
                <plugin>
                    <artifactId>maven-deploy-plugin</artifactId>
                    <version>2.8.1</version>
                </plugin>
                <plugin>
                    <artifactId>maven-install-plugin</artifactId>
                    <version>2.5.1</version>
                </plugin>
                <plugin>
                    <artifactId>maven-jar-plugin</artifactId>
                    <version>2.4</version>
                </plugin>
                <plugin>
                    <artifactId>maven-javadoc-plugin</artifactId>
                    <version>2.9.1</version>
                </plugin>
                <plugin>
                    <artifactId>maven-resources-plugin</artifactId>
                    <version>2.6</version>
                </plugin>
                <plugin>
                    <artifactId>maven-site-plugin</artifactId>
                    <version>3.3</version>
                </plugin>
                <plugin>
                    <artifactId>maven-source-plugin</artifactId>
                    <version>2.2.1</version>
                </plugin>
                <plugin>
                    <artifactId>maven-surefire-plugin</artifactId>
                    <version>2.17</version>
                </plugin>
            </plugins>
        </pluginManagement>
    </build>

</project>

Your code now needs a minimum of the @Benchmark annotation like this:

    @Benchmark
    public byte[] decrypt() throws Exception {
        cipher.init(DECRYPT_MODE, secretKey);
        return cipher.doFinal(encrypted);
    }

You don't need to return anything but I did for the sake of the my unit tests.

There are many ways to run your benchmark but the one I quite like looks something like this:

$ mvn clean install && java -jar target/benchmarks.jar ".*CryptoBenchmarks.*" -wi 4 -i 20 -f 1 -bm avgt -tu ns

where wi is the number of warm-up iterations, i is the number of actual iterations for the test, f is the number of times we fork for a particular iteration and tu is time units (here it is nanoseconds). The options can be found in the CommandLineOptions class. The -bm switch is the benchmark mode and options can be found in the Mode class where here we're asking it to measure the average times.

Note: you'll need to add a main method to your classes.

The output for my comparison of RSA and DES looked like this:

Benchmark                                                      Mode   Samples        Score  Score error    Units
c.p.m.RsaCryptoBenchmarks.decryptionUsingPrivateKey            avgt        20  6257593.175   197497.265    ns/op
c.p.m.RsaCryptoBenchmarks.encryptionUsingPublicKey             avgt        20   203438.819     2428.467    ns/op
c.p.m.DesCryptoBenchmarks.decrypt                              avgt        20      672.128       40.558    ns/op
c.p.m.DesCryptoBenchmarks.encrypt                              avgt        20      731.335       57.101    ns/op


So, encryption is roughly 1000 times more expensive in RSA and decryption 10 000.

[1] Applied Cryptography, Bruce Schneier.

Monday, August 11, 2014

The Performance of RSA Encryption


When encryption takes up a more of your round trip time than servicing the request itself, it's time to look at optimizing it.

I'm using a home-compiled version of OpenJDK that's running code that looks like this:

import static javax.crypto.Cipher.ENCRYPT_MODE;
        
        KeyPair             keyPair             = keyPairGenerator.genKeyPair();
        PrivateKey          privateKey          = keyPair.getPrivate();
        PublicKey           publicKey           = keyPair.getPublic();
        Cipher              cipher              = Cipher.getInstance(algorithm);
        
        // Note that privateKey is of type sun.security.rsa.RSAPrivateCrtKeyImpl
        
        cipher.init(ENCRYPT_MODE, privateKey);
        byte[] input = "hello world".getBytes();
        byte[] output = cipher.doFinal(input);

Stepping through the code

Thread [main] (Suspended)
RSACore.rsa(byte[], RSAPrivateKey) line: 101
RSACipher.doFinal() line: 352
RSACipher.engineDoFinal(byte[], int, int) line: 389
Cipher.doFinal(byte[]) line: 2121
.
.

I come to this in sun.security.rsa.RSACore:

    public static byte[] rsa(byte[] msg, RSAPrivateKey key)
            throws BadPaddingException {
        if (key instanceof RSAPrivateCrtKey) {
            return crtCrypt(msg, (RSAPrivateCrtKey)key);
        } else {
            return crypt(msg, key.getModulus(), key.getPrivateExponent());
        }
    }

If I wanted the other branch to be taken, I could do something like this:

    RSAPrivateKey       rsaPrivateKey   = (RSAPrivateKey)privateKey;
    RSAPrivateKeySpec   keySpec         = new RSAPrivateKeySpec(
        rsaPrivateKey.getModulus(), 
        rsaPrivateKey.getPrivateExponent());
    PrivateKey          slowPrivateKey  =
        KeyFactory.getInstance("RSA").generatePrivate(keySpec);

and call a Cipher that has been initialized with a private key of class sun.security.rsa.RSAPrivateKeyImpl rather than RSAPrivateCrtKeyImpl. Then when we get to the crypt method, we see this:

        BigInteger c = m.modPow(exp, n);

The BigInteger method modPow is very expensive since exp and n are enormous numbers. In fact, as I run jstack against my JVM, I see half a dozen threads in this code.

Conclusion

Because RSAPrivateCrtKey carries much more information than the interface RSAPrivateKey demands, less expensive calculations appear to be needed. Therefore, encryption is considerably faster using keys of this class than RSAPrivateKeyImpl.

When I benchmarked the two on a small string ("hello world") using JMH I found RSAPrivateCrtKey over three times faster when encrypting.


Saturday, August 9, 2014

RSA Encryption Explained


RSA is a common encryption algorithm that I wanted to understand. Bruce Schneier's excellent Applied Cryptography gives a good explanation but I found a few steps missing. Building on his derivation, I am going flesh it out.

What you need to know

Vocabulary
  • Relatively prime (or mutually prime or coprime): two integers are relatively prime when their only common divisor is 1.
  • Congruent: if two integers are have the same value when mod-ed with a third integer, they are congruent.
Notation

There is a lot of modulus arithmetic in cryptography and the notation looks like this:

a ≡ b (mod c)

This should not be read in the fashion of a programmer, like a = b mod c (or a = b % c in most languages). Rather, the mod applies to the whole equation so you should read it more like:

(a ≡ b) (mod c)

Modulus Arithmetic Formulas

You also need to know some formulas.

[f1]    a p - 1 ≡ 1 (mod p)  iff p is prime

This is Fermat's little theorem and the best proof I found is here.

General Maths Formulas

As any school child can tell you:

[f2]    abc = (ab)c

and:

[f3]    ab + c = ab ac

Also, the result of a mod is just the remainder of a division so it should be obvious:

[f4]    a mod b = c a = kb + c

where k is just some integer.

The Derivation

Choose two large prime numbers, p and q. Let n be the product of the two.

[d1]    n = pq

Why they must be prime will become apparent but it involves [f1] above.

Choose any integer e that is relatively prime to n.

Then, calculate d such that:

(d2)    ed  1 (mod ((p-1)(q-1)))

And that's it. Numbers n an e make the public key and d is the private key.

Now, let's take a message, m. If it's too big, we can break it down and if it's too small we can pad it. But that's not of interest at the moment, so let's just say m is a number we want to encrypt.

Encryption looks like this:

[d3]    c  me (mod n)

where c is the encrypted result.

Decryption looks like this:

[d4]    m  cd (mod n)

Easy, right?

Yes, but why does it work?

Why is equation [d4] true?

Ok, substitute c in [d4] with c in [d3] and use [f2] to trivially simplify:

[d5]    m  med (mod n)

now substitute [d1] into this:

[d6]    m  med (mod pq)

Now, re-express [d2] in the format of [f4]:

[d7]    ed = k ((p-1)(q-1) + 1

and substitute [d7] for ed in [d6]:

[d8]    m  m(k(p-1)(q-1) + 1) (mod pq)

Using [f3], this is the same as:

[d9]    m  m . m k(p-1)(q-1) (mod pq)

which, because of f3 is the same as:

[d10]   m  m . mk(p-1) m(q-1) (mod pq)

Now, let's invoke [f1] for (q-1):

[d11]   m  m . (mq-1)m(p-1) (mod pq)
           m . 1k     m(p-1) (mod pq)
           m .        m(p-1) (mod pq)


and again for (p-1):

[d12]   m  m (mod pq)

which is obviously true, so what we started with [d4] must also be true.

Saturday, August 2, 2014

Heartbleed and the JVM


The Heartbleed bug that affected OpenSSL earlier this year was a failure to do a proper bounds check. It allowed the client to define how much memory it was assigned in the server. The server then might have only filled some of this memory but all of its contents were returned to the client. The problem in a language like C is that this memory might be dirty. That is, it may contain other peoples' discarded data.

The patch was to check the bounds so no superfluous data was returned to the client. But why was the data not clean in the first place?

Basically, it's an optimization. Clearing the content of memory is expensive.

Java is much more secure than C. If you grab some off-heap memory, it cleans it before giving it to you. If you call ByteBuffer.allocateDirect(int) you are given an instance of a packge protected class, DirectByteBuffer. In its constructor you see:

            base = unsafe.allocateMemory(size);

where unsafe is an object of class sun.misc.Unsafe. The method allocateMemory takes us to unsafe.cpp in OpenJDK where we see:

  void* x = os::malloc(sz, mtInternal);
  if (x == NULL) {
    THROW_0(vmSymbols::java_lang_OutOfMemoryError());
  }
  //Copy::fill_to_words((HeapWord*)x, sz / HeapWordSize);
  return addr_to_java(x);

This doesn't clean the memory (that's done later). But as an interesting aside, set -XX:+PrintMalloc as a command line parameter and you'll see the JVM output lines like:

os::malloc 536870912 bytes --> 0x00007f1ce7fff028

Anyway, the cleaning of the memory comes here, a few lines later in DirectByteBuffer:

        unsafe.setMemory(base, size, (byte) 0);

which is a call to:

UNSAFE_ENTRY(void, Unsafe_SetMemory(JNIEnv *env, jobject unsafe, jlong addr, jlong size, jbyte value))
  UnsafeWrapper("Unsafe_SetMemory");
  size_t sz = (size_t)size;
  if (sz != (julong)size || size < 0) {
    THROW(vmSymbols::java_lang_IllegalArgumentException());
  }
  char* p = (char*) addr_from_java(addr);
  Copy::fill_to_memory_atomic(p, sz, value);
UNSAFE_END

Despite the mention of filling atomically, this call to the Copy class looks no more complicated than something like this:

    for (uintptr_t off = 0; off < size; off += sizeof(jlong)) {
      *(jlong*)(dst + off) = fill;
    }

The consequence of this security is a loss of performance:

"It's a well known fact that direct bytebuffer allocation is much slower than non direct byte buffers.

"If you think about it when you allocate a non direct byte buffer then it basically just needs to dereference a pointer to some memory on the Java heap, which is very quick. But for a direct buffer, it has to malloc real memory from the OS, and do a bunch of other house keeping.

"So yes, if you're using direct byte buffers it pays to re-use them."

https://community.jboss.org/thread/82231


Saturday, July 26, 2014

UDP and Full Buffers


Oracle's Coherence has a very nice reporting tool. If you don't have it running by default, you can connect via JConsole and turn it on. Give it a directory and it will print cluster statistics to it every minute.

The reason we need these stats is that on a quiet system, the percentage of packets dropped on some nodes can be terrible. I've been looking at why.

Oracle Coherence nodes talk to each other via UDP. Unlike TCP, nothing is ensuring our packets reach their destination. To counter this, Coherence builds its own protocol on top of UDP that resends packets if they have not been acknowledged or if the sender receives a NACK.

On a decent network, why would our packets not reach their destination? Well, one reason is that the receiving buffer is full. We've set our buffers at 2MB (per Oracle's advice) so this is unlikely but possible. If an application receives UDP packets faster than it can process them, packets will be dropped.

However, if you're running on Linux, you can easily see if this is the case.

henryp@phillsdell:~$ netstat -su 
.
.
Udp:
    3196 packets received
    9 packets to unknown port received.
    0 packet receive errors
    2770 packets sent

It's the packet receive errors count that is of interest to us. Although the packet is discarded, Linux will at least make a note of its loss.

The mystery continues as we are seeing nothing like the packet loss here that would explain the number of Coherence re-sends. 

Our cluster is saying that the number of resends ("when there is no ACK received within a timeout period" [1]) and the number resent early ("a packet is resent ahead of schedule when there is a NACK indicating that the packet has not been received" [1]) is high.

These resends appear necessary as the excess figure ("the total number of packet retransmissions which were later proven unnecessary" [1]) was low.

Note that you would not necessarily expect the publish success rate to correlate with receive success rate. They measure unrelated metrics. "Publisher success rate is a ratio of the number of packets successfully delivered in a first attempt to the total number of sent packets" [1]. Whereas the receiver rate is measures unnecessary redelivery. "Failure count is incremented when a re-delivery of previously received packet is detected. It could be caused by either very high inbound network latency or lost ACK packets." [2]

Our network guys assure us that our Cisco kit is not reporting anywhere near enough lost UDP packets to explain Coherence's need to resend packets, neither does the OS when reporting lost packets on the NICs. 

What's more, put a load through the system does not change the absolute number of resends - therefore as a percentage of overall packets, this pathological metric drops. This is also the case when we put a lot of load on the network with the datagram tester that comes with Coherence.

We are perplexed.

[1] Coherence docs.

[2] Coherence docs.