Agile Java Man: May 2014

Monday, May 26, 2014

Debugging the JVM

This is very nice. I've built my own JDK on Linux and am stepping through the C++ code as if it were a normal Java debug.

To do this:

1. Get the source code with:

hg clone http://hg.openjdk.java.net/jdk9/dev 9dev

2. Build with something like:

$ bash ./configure --with-debug-level=slowdebug --enable-debug-symbols ZIP_DEBUGINFO_FILES=0
$ make all CONF=linux-x86_64-normal-server-slowdebug ZIP_DEBUGINFO_FILES=0

or whichever CONF is appropriate to your system. This last step can take quite a long time (about 10 minutes).

The ZIP_DEBUGINFO_FILES is important (see here and here). This compiles the symbols into the artifact that allow debugging.

[Note: sometimes, I saw errors like:

gmake[2]: *** [/home/henryp/Code/OpenJDK/9dev/build/linux-x86_64-normal-server-slowdebug/jdk/demo/jpda/examples.jar] Error 1
gmake[2]: *** Waiting for unfinished jobs....
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
gmake[1]: *** [demos] Error 2
make: *** [demos-only] Error 2

which were worrying. In the event of this, I ran make again and it worked. Hmm.]

3. Install NetBeans 8.x. Although I am normally an Eclipse man, NetBeans is frankly better for C/C++ development.

4. Create a new C/C++ application in the hotspot directory.

5. Choose a Java main class to run by right clicking on the project properties and in the "Run" Category. Choose the class to run using the artifact that was built, eg something like:

"${OUTPUT_PATH}" -classpath /home/henryp/Workspaces/workspaceKepler/_MyTests/bin com.phenry.memory.ObjectCreatorMain

6. Insert your breakpoints wherever you wish and press the debug button. Note, if you are remote debugging and your system has been "hardened" for security reasons, you may have some trouble. Follow the instructions here (and the subsequent link) to turn off the security.

A Caveat

Finally, when debugging, you will be using Linux tools. I had problems on an Ubuntu 12.04 system as the tools were not up-to-date (the debugger acts erratically when debugging multi-threaded code). On my Fedora system, everything was fine. The versions I was using were:

[henryp@corsair 9dev]$ gdb -version
GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17)
.
.
[henryp@corsair 9dev]$ gcc --version
gcc (GCC) 4.7.2 20120921 (Red Hat 4.7.2-2)
.
.

Queues and Statistical Quirks

We use Oracle's Coherence Incubator in our project, in particular the Push Replication pattern to move documents from the primary site to a warm, disaster recovery site located 50 miles away. The Push Replication pattern in turn depends on the Messaging Pattern. This involves having in-memory queues of objects that are waiting to be published.

The trouble is, the distribution across these queues is not even. As a result, some queues can be full (and close to causing an OutOfMemoryError thus forcing us to reject client requests) and others empty. We looked at the distribution function but it seems to be more a quirk of probability.

Take this Scala code to demonstrate what I am talking about. Let's have a given number of buckets that have X documents randomly distributed over them each iteration. Let's consume a fixed number from each bucket each iteration (the batch size). If there are less than the batch size in the bucket, we just take them all and that's that. We then let this run for an arbitrary number of iterations.

The simulation code looks like this:

package com.phenry.stats

import scala.util.Random

object PublisherDistributionMain {

def main(args : Array[String]) = {
val numBuckets = args(0).toInt
val numIterations = args(1).toInt
val numNewDocs = args(2).toInt
val batchSize = args(3).toInt
var buckets = new Array[Int](numBuckets)

for (i <- 0 to numIterations) {
populate(buckets, numNewDocs)
buckets = depopulate(buckets, batchSize)
}
print(buckets)
}

def print(buckets : Array[Int]) : Unit = {
val largest = max(buckets)

buckets.map(_ * 50 / largest).foldLeft(0)((acc, elem) => {
println("%2d : ".format(acc) + "#" * elem)
acc + 1
})

val stats = new MeanAndStdDeviation
println("\nmean = %.3f, std dev = %.3f".format(stats.mean(buckets), stats.stdDev(buckets)))
}

def max(buckets : Array[Int]) : Int = {
buckets.reduce((x, y) => if (x > y) x else y)
}

def depopulate(buckets : Array[Int], batchSize : Int) : Array[Int] = {
buckets.map(x => if (x < batchSize) 0 else (x - batchSize))
}

def populate(buckets : Array[Int], numNewDocs : Int) : Unit = {
val random = new Random
for (i <- 0 to numNewDocs) {
val bucket = random.nextInt(buckets.length)
buckets(bucket) = buckets(bucket) + 1
}
}
}

With a simple stats class (MeanAndStdDeviation) looking like this:

package com.phenry.stats

class MeanAndStdDeviation {

implicit def integer2Double(x : Integer) = x.doubleValue()
implicit def integer2DoubleList(x : List[Int]) = x.map(i => i.doubleValue())
implicit def integer2DoubleList(x : Array[Int]) = x.map(i => i.doubleValue())
implicit def double2Integer(x : Double) = x.intValue
implicit def double2IntegerList(x : List[Double]) = x.map(d => d.intValue)

val mean = (x : Array[Int]) => sum(x) / x.length

def sum(results : List[Double]) : Double = {
results.foldRight(0d)((x, y) => x + y)
}

def stdDev(results : Array[Int]) : Double = {
val theMean = mean(results);
var sumOfSquaredDiffs = 0d
for (aResult <- results) sumOfSquaredDiffs += math.pow((aResult - theMean), 2)
math.sqrt(sumOfSquaredDiffs / results.length)
}

def sum(results : Array[Double]) : Double = {
results.reduce((x, y) => x + y)
}

.
.

(Scala really is nice for writing succinct functions...)

The results after 100 000 iterations look like this:

0 : ################
1 : ###############################
2 : ###############
3 : ##############
4 : #################
5 : ####
6 : ##########
7 : ######################
8 : ##############################
9 : ##################
10 : ##############
11 : ############
12 : ################
13 : ##############
14 : ##############################
15 : ####
16 : ##################################################
17 : ################
18 : ###################
19 : #############

that is, very lumpy. Bucket #15 has a fraction of what bucket #16 has, for instance.

As a result, our system will reject users' requests complaining it has not enough space despite the fact that many nodes in the cluster have plenty of memory.

Time to write a new consumer algorithm...

Monday, May 12, 2014

How do you distinguish a slow application from a slow network?

We've been given new hardware on which to run our application. To make sure it will be OK, we ran our performance tests against it. To our surprise, they sometimes did poorly despite the new CPUs being much better. Using jstack showed at any particular time, nothing much was going on.

First we looked at playing with OS parameters but we had no luck. Then we looked at esoteric things like traffic shaping but found nothing.

So, time to crack open Wireshark. We took a 60s sample of network traffic when things were going well as a baseline then sampled for 60s during slowness.

Although the application's functionality was fine during both samples, the TCP/IP traffic showed big differences. The slow traffic had lots of re-transmissions and "TCP ACKed lost segment" messages in the log.

To filter them in Wireshark, use:

tcp.analysis.lost_segment || tcp.analysis.retransmission

(Incidentally, another interesting filter is:

frame.time_delta > 0.1

which shows packets that had a gap between them of 0.1s - which is ages in the traffic world. Another is

tcp.window_size == 0

which shows your buffers filling up)

In our 60s capture on the client side, there were some 2 million frames captured. Of these, some 83 000 we re-transmissions or lost segments. This is 4% of our traffic and seems enough to slow our connections noticeably.

We've been assured by the hardware guys everything is fine. But we do share a VLAN (that is, we share physical networks with other teams) and that may be exhausted. This piece of kit is a Cisco Nexus 5k.

The mystery continues.

Saturday, May 10, 2014

CPU affinity and Performance

The Problem

I have a single-threaded application (my play-project, JStringServer, running BatchedServerMain) that is very CPU and IO heavy and I want to improve its performance and stop it from jittering.

By running htop, I notice that it jumps from CPU to CPU. One moment it's on CPU 6...

the next, it's on CPU 11:

The number of serviced clients per second is all over the place (a mean of 9523 per second and a standard deviation of 864. Compare this with the results below).

The Solution

You might have read a lot about CPU affinity on the blogosphere (such as HighScalability and Mechanical Sympathy). Jumping between CPUs can be expensive and one tactic is to avoid it. Could it help me?

I've heard of proposals to allow this to be done within Java but there is nothing I know of that allows you to do it yet so I will have to do it via the OS.

CPU Affinity

You can pin processes to given CPUs quite easily on Linux (see this article for instructions). But which do we pin it to? Not all are the same...

I pin my process to each CPU and take timings by hitting the process from another box. These are the results I get measuring the number of clients serviced each second:

0 : mean = 10244.200 std dev = 47.298
1 : mean = 10253.600 std dev = 52.063
2 : mean = 10245.400 std dev = 48.202
3 : mean = 10204.800 std dev = 47.422
4 : mean = 10265.000 std dev = 54.902
5 : mean = 7807.400 std dev = 41.105
6 : mean = 10217.400 std dev = 51.623
7 : mean = 10274.000 std dev = 46.656
8 : mean = 10238.600 std dev = 62.919
9 : mean = 10290.600 std dev = 49.318
10 : mean = 10209.400 std dev = 52.044
11 : mean = 10194.800 std dev = 54.377
12 : mean = 10243.400 std dev = 52.321
13 : mean = 8874.000 std dev = 43.959
14 : mean = 9849.200 std dev = 49.794
15 : mean = 10352.000 std dev = 55.194

Hmm, three CPUs are slow, but especially CPU #5.

Let's look at my system. I'm using an Intel Xeon processor (E5-2687W) which has 8 cores and 16 hardware threads.

[henryp@corsair ~]$ numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 65480 MB
node 0 free: 60931 MB
node distances:
node 0
0: 10

This shows I have only one CPU and memory nodes so I don't have to worry about cross-socket NUMA problems (This is a good article about looking at NUMA and hardware).

OK, so what are all my CPUs in my socket doing?

[henryp@corsair 9dev]$ cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11 CPU12 CPU13 CPU14 CPU15
0: 171 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IO-APIC-edge timer
1: 3644 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IO-APIC-edge i8042
4: 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IO-APIC-edge
8: 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IO-APIC-edge rtc0
9: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IO-APIC-fasteoi acpi
12: 132544 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IO-APIC-edge i8042
17: 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IO-APIC-fasteoi firewire_ohci
23: 61 0 0 0 0 0 21 0 0 0 0 0 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb1, ehci_hcd:usb2
27: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IO-APIC-fasteoi ath9k
32: 207183 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IO-APIC-fasteoi nvidia
36: 179 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IO-APIC-fasteoi snd_hda_intel
64: 27439 25657 81300 0 6288 32731 5618 45423 0 0 0 0 0 0 0 0 PCI-MSI-edge ahci
65: 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge ahci
66: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
67: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
68: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
69: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
70: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
71: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
72: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
73: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
74: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
75: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
76: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
77: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
78: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
79: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
80: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
81: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
82: 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge mei
83: 1576 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge em1
84: 757 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge snd_hda_intel
85: 14 0 0 0 0 424447 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge p7p1-rx-0
86: 19 0 0 0 0 0 216701 0 0 0 0 0 0 0 0 0 PCI-MSI-edge p7p1-tx-0
87: 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge p7p1
NMI: 332 254 231 207 211 310 218 219 403 303 194 228 280 382 329 281 Non-maskable interrupts
LOC: 418851 409978 386151 360594 358830 438636 351885 362193 292543 243019 179805 188224 223758 283426 257005 225531 Local timer interrupts
SPU: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Spurious interrupts
PMI: 332 254 231 207 211 310 218 219 403 303 194 228 280 382 329 281 Performance monitoring interrupts
IWI: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IRQ work interrupts
RTR: 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 APIC ICR read retries
RES: 1069726 92676 20752 8650 5003 4081 3523 2631 2234 2171 2467 1658 1974 1873 1849 1668 Rescheduling interrupts
CAL: 53667 72109 69200 68746 62293 82125 72462 71374 49909 51974 52292 54965 55904 52059 54509 54112 Function call interrupts
TLB: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 TLB shootdowns
TRM: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Threshold APIC interrupts
MCE: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Machine check exceptions
MCP: 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 Machine check polls
ERR: 0
MIS: 0

(Read this page to see how best to read this output).

CPU #5 seems to be handling an awful lot of interrupts from the network interface controller [NIC] for p7p1 (the name of my interface that the client is hitting).

"Unfortunately the default is to let all processors handle interrupts with the result that the overall performance is not optimal in particular on multi-core systems. This is because modern NICs can have multiple RX queues that work perfectly if cache coherency is respected. This means that interrupts for ethX-RX queue Y must be sent to 1 (one) core or at most to a core and its Hyper-Threaded (HT) companion. If multiple processors handle the same RX queue, this invalidates the cache and performance gets worse. For this reason IRQ balancing is the key for performance." [1]

OK, let's balance that puppy.

[root@corsair ~]# echo 15 > /proc/irq/85/smp_affinity ; echo 15 > /proc/irq/85/smp_affinity_list

This works for a few seconds and I see the same throughput levels as the other CPUs but then my Fedora 17 reverts it! How annoying. It doesn't do that on my Ubuntu 12.04 (admittedly on different hardware).

So, I use Tuna to see what's going on. This is how it looks before the test:

And during:

We can see CPU #5 is handling a lot of kernel interrupts.

Conclusion

It seems that this particular CPU is busy handling network IO and making my process perform badly if I pin it to it. This is probably a reason for the jitter as the kernel sometimes schedules my process to run on it.

But at the moment, I don't know how to stop the kernel from disabling network interrupts on this CPU. So, I'll just avoid it in the future.

[1] http://www.ntop.org/pf_ring/irq-balancing/

Agile Java Man