Monday, June 30, 2014

Quick Wins with the Garbage Collector


The problem

Recently, we upgraded our boxes from Linux 2.6.16-0.87.1-smp running on 16-processor Xeon E5620s to Linux 2.6.32-358.32.3.e16x86_64 running on 24-processor Xeon E5-2620s (all processor counts include hyper-threading).

Unfortunately, our GC times because horrendous. What exactly this was due to remains a mystery, but we were seeing events like this in the GC logs:

.
.
19948.548 [GC [PSYoungGen: 1386848K -> 5888K (1390080K)] 3481117K->2105461K(4186624K), 0.0306870 secs] [Times: user=0.18, sys=0.00, real =0.03 secs]
20077.869 [GC [PSYoungGen: 1387264K -> 5984K (1389568K)] 3486837K->2210813K(4186112K), 9.4264420 secs] [Times: user=4.05, sys=52.99, real =9.43 secs]
20131.745 [GC [PSYoungGen: 1387360K -> 5792K (1390080K)] 3492189K->2115622K(4186624K), 0.0271300 secs] [Times: user=0.16, sys=0.01, real =0.03 secs]
.
.

Whoa! What happened there? Nearly 10 seconds of GC pause times when all the others are fractions of a second?

Also, look where the time was spent - in kernel space (sys=52.99).

AlwaysPreTouch

Given that it's spending a lot of time in the kernel, it might be best to look at flags that interact with that. One such flag is -XX:+AlwaysPreTouch. This loads all the pages into memory at start-up. That is, there is an upfront cost.

If you run your JVM with this flag, you'll see the process taking its full amount of memory when you run the top command. That is, without it, you'll see:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                          
 6960 henryp    20   0 3312m  19m  10m S   16  0.3   0:00.97 java      

but with it:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                          
 6923 henryp    20   0 3312m 1.0g  10m S   16 14.0   0:02.40 java                                                                            

That is, the RESident space has filled to its assigned 1024m immediately (VIRTual space does not correspond to physical memory and can be ignored).            

Pre-touching pages

The JVM arg -XX:+AlwaysPreTouch causes this code in mutableSpace.cpp to execute when initializing MutableSpace:

void MutableSpace::pretouch_pages(MemRegion mr) {
  for (volatile char *p = (char*)mr.start(); p < (char*)mr.end(); p += os::vm_page_size()) {
    char t = *p; *p = t;
  }
}

(You can also find something similar in VirtualSpace::expand_by)

I'm not a C++ guru, but this appears to do nothing but make a (method-scoped) reference to an area of memory. Why does it do this? Well, by making a reference to that memory page, the kernel is obliged to load it. Presumably, the reference is volatile just in case the compiler thinks that this is pointless and optimizes it away.

This link talks about how lazily loading the pages can be more expensive in the case of writing to a lot of the pages. With AlwaysPreTouch, this hit was taken when the JVM started up.

The Garbage Collector and Network Performance


Our network woes continue.

Coherence has an MBean (Coherence:type=Node,nodeId=XXX) that measures intra-cluster packet loss using its own proprietary protocol, TCMP. In our old environment, this showed a 99.9+ per cent success rate. In our new, it was touching about 50%

Coherence has a tool that measures packet loss within a network. Employing this tool in our new environment showed no problems. So, it seemed the problem wasn't the network.

The only way I could see the problem using this Datagram Tool was to set the buffers at unrealistically low buffer sizes. But we were using Coherence's defaults. What was the problem?

Well, thinking about it, what could cause the buffers to overflow? One answer is that nobody was emptying them. This could be because of the garbage collector.

"Large inbound buffers can help insulate the Coherence network layer from JVM pauses that are caused by the Java Garbage Collector. While the JVM is paused, Coherence cannot dequeue packets from any inbound socket. If the pause is long enough to cause the packet buffer to overflow, the packet reception is delayed as the originating node must detect the packet loss and retransmit the packet(s)." [1]

Indeed, our garbage collection times are much higher on the new, improved hardware. How odd. (This is the subject of another post).

Rather than measure these packet loss statistics dynamically through JConsole, you can also have Coherence print a report every so often. You can do this via the MBean with the object name Coherence:type=Reporter,nodeId=YYY where YYY is your node of choice. Configuration is fairly self-explanatory and it gives you the chance to plot performance over time.

[1] Coherence Documentation.

Sunday, June 15, 2014

Lies, Damned Lies and Performance Statistics


A new way of measuring performance

I've introduced HdrHistogram to my pet project, JStringServer. The advantages of HdrHistogram is that it keeps the results in constant memory with constant access time. The only cost is some approximation of results and since I'm only interested in means, standard deviations etc I can live with that.

HdrHistogram works by keeping an array of "buckets" which represent a range of values. Any given value that falls into a particular bucket merely increases that bucket's counter. This is much more memory efficient than, say Apache's JMeter that keeps every reading in memory.

The range of these buckets grow exponentially (well, there are sub-buckets but let's keep this simple). So, all results that are outliers are collected in just one or two buckets. This is very memory efficient as each bucket is a Java long each.

Means, standard deviations and all that

Most performance tools like Apache's JMeter will give you a mean and a standard deviation. You might have even heard that 68% of results are within a standard deviation of the mean, 95% within 2 standard deviations, 99.7% within 3 etc (see here for the rule). But this is only true for a normal distribution (the typical, bell-shaped curve).

The time it takes for a given measurement independent of all others will typically conform to a Poisson distribution. (See Mathematical Methods in the Physical Sciences by Mary Boas for a clear derivation of this formula from first principles.)

Results

Typically, I see:

Initiated 412759 calls. Calls per second = 10275. number of errors at client side = 0. Average call time = 10ms
ThreadLocalStopWatch [name=read, totalCallsServiced=412659, max time = 630ms, Mean = 6.125236575477573, Min = 0, standard deviation 8.263969133274433, Histogram with stepsize = 1. Approximate values: 791, 1822, 3524, 15650, 143582, 307867, 243841, 81151, 18447, 4716, 1395, 493, 314, 195, 106, 97, 82, 66, 35, 10, 6, 7, 10, 7, 12, 14, 8, 7, 4, 2]
ThreadLocalStopWatch [name=write, totalCallsServiced=412719, max time = 26ms, Mean = 0.08618696982692825, Min = 0, standard deviation 0.327317028847167, Histogram with stepsize = 1. Approximate values: 411831, 33604, 771, 238, 80, 32, 17, 8, 1, 1, 2, 4, 5, 3, 0, 0, 1, 1, 1, 1, 0, 0, 1, 2, 5, 8, 4, 0, 0, 0]
ThreadLocalStopWatch [name=connect, totalCallsServiced=412720, max time = 3005ms, Mean = 4.17902209730568, Min = 0, standard deviation 65.54450133531354, Histogram with stepsize = 1. Approximate values: 408721, 117052, 2056, 852, 282, 114, 57, 30, 15, 12, 10, 10, 15, 10, 3, 5, 4, 3, 2, 1, 3, 4, 2, 0, 1, 1, 0, 1, 1, 0]
ThreadLocalStopWatch [name=total, totalCallsServiced=410682, max time = 66808064ns, Mean = 6295825.973268021, Min = 854104, standard deviation 1089344.7561186429, Histogram with stepsize = 1333333. Approximate values: 145, 837, 1596, 25512, 273990, 93021, 12099, 2283, 520, 273, 126, 77, 80, 55, 11, 9, 5, 9, 10, 11, 15, 5, 3, 2, 3, 3, 0, 0, 4, 1, 2]

For something that approximates to a Poisson distribution, we'd expect the mean and standard deviation to be about the same. Since this is not true for the total time, perhaps this is not a Poisson distribution. The results for reading data does have these two values roughly the same so let's look at them.

Let's see if the distribution of the values conform to the Poisson distribution or even the Normal (aka Gaussian). In the R language:

require(graphics)

# Histogram of times. Each interval is 1ms starting from 1ms
xread <- c(791, 1822, 3524, 15650, 143582, 307867, 243841, 81151, 18447, 4716, 1395, 493, 314, 195, 106, 97, 82, 66, 35, 10, 6, 7, 10, 7, 12, 14, 8, 7, 4, 2)
readMean = 6.125236575477573
readSd   = 8.263969133274433
readNum  = 412659

par(mfrow = c(3, 1)) # we're looking at 3 graphs

n      = readNum
x      = xread
mean   = readMean
sd     = readSd

ac <- barplot(x, main="Actual Distribution", axes=TRUE) 
axis(1, at=ac, labels=TRUE)

<- 1:length(x)

expectedPoisson = n * dpois(i, lambda=mean)
expectedNorm    = n * dnorm(i, mean=mean, sd=sd)

nm <- barplot(expectedNorm, main="Expected Normal Distribution")
axis(1, at=nm, labels=TRUE)

pn <- barplot(expectedPoisson, main="Expected Poisson Distribution")
axis(1, at=pn, labels=TRUE)

chisq.test(x, p=expectedPoisson, rescale.p=TRUE)
chisq.test(x, p=expectedNorm,    rescale.p=TRUE)

The last two lines are testing whether the actual distribution conforms to the Poisson (give a lambda value) or a Normal (given the mean and the standard deviation). It does this using the chi-squared test (basically, a comparison of the squared differences between what we expect and what we get).

For these particular results, the p-value from the chi-squared tests tell me they neither conform to a Poisson not a Normal distribution (in fact, they could barely be further from them). Looking at the graphs, you can see this by-eye:



So, what went wrong? Well, the Poisson distribution relies on events being independent of each other which is clearly not the case when there is contention for resources.

The Central Limit Theorem

(Aside):

The means of a distribution form a normal distribution given enough runs even if that original distribution is not itself a normal distribution.

Let's look at plotting the Poisson distribution in R:

mu=1; sigma=10   # mean, standard deviation
n=5000           # Number of iterations
xbar=rep(0,n)    # Holds the results of the iterations

for (i in 1:n) { 

  xbar[i]=mean(
    rpois(5000, mu) # num. random variables = 5000, mu = means
    ) 
}

par(mfrow = c(2, 1))


# Plot a typical Poisson distribution

hist(rpois(n, mu),prob=TRUE,breaks=15)

# Plot the results of n Poisson distributions

hist(xbar,prob=TRUE) #,xlim=c(70,130),ylim=c(0,0.1))

Gives this result:

A typical Poisson distribution and the mean of a Poisson over many iterations
This is quite nice as it means that if you run performance tests repeatedly, your means will approach a Normal Distribution giving some predictive power over them.


Monday, May 26, 2014

Debugging the JVM


This is very nice. I've built my own JDK on Linux and am stepping through the C++ code as if it were a normal Java debug.



To do this:

1. Get the source code with:

hg clone http://hg.openjdk.java.net/jdk9/dev 9dev

2. Build with something like:

$ bash ./configure --with-debug-level=slowdebug --enable-debug-symbols ZIP_DEBUGINFO_FILES=0
$ make all CONF=linux-x86_64-normal-server-slowdebug ZIP_DEBUGINFO_FILES=0

or whichever CONF is appropriate to your system. This last step can take quite a long time (about 10 minutes).

The ZIP_DEBUGINFO_FILES is important (see here and here). This compiles the symbols into the artifact that allow debugging.

[Note: sometimes, I saw errors like:

gmake[2]: *** [/home/henryp/Code/OpenJDK/9dev/build/linux-x86_64-normal-server-slowdebug/jdk/demo/jpda/examples.jar] Error 1
gmake[2]: *** Waiting for unfinished jobs....
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
gmake[1]: *** [demos] Error 2
make: *** [demos-only] Error 2

which were worrying. In the event of this, I ran make again and it worked. Hmm.]

3. Install NetBeans 8.x. Although I am normally an Eclipse man, NetBeans is frankly better for C/C++ development.

4. Create a new C/C++ application in the hotspot directory. 

5. Choose a Java main class to run by right clicking on the project properties and in the "Run" Category. Choose the class to run using the artifact that was built, eg something like:

"${OUTPUT_PATH}" -classpath /home/henryp/Workspaces/workspaceKepler/_MyTests/bin com.phenry.memory.ObjectCreatorMain

6. Insert your breakpoints wherever you wish and press the debug button. Note, if you are remote debugging and your system has been "hardened" for security reasons, you may have some trouble. Follow the instructions here (and the subsequent link) to turn off the security.

A Caveat

Finally, when debugging, you will be using Linux tools. I had problems on an Ubuntu 12.04 system as the tools were not up-to-date (the debugger acts erratically when debugging multi-threaded code). On my Fedora system, everything was fine. The versions I was using were:

[henryp@corsair 9dev]$ gdb -version
GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17)
.
.
[henryp@corsair 9dev]$ gcc --version
gcc (GCC) 4.7.2 20120921 (Red Hat 4.7.2-2)
.
.


Queues and Statistical Quirks


We use Oracle's Coherence Incubator in our project, in particular the Push Replication pattern to move documents from the primary site to a warm, disaster recovery site located 50 miles away. The Push Replication pattern in turn depends on the Messaging Pattern. This involves having in-memory queues of objects that are waiting to be published.

The trouble is, the distribution across these queues is not even. As a result, some queues can be full (and close to causing an OutOfMemoryError thus forcing us to reject client requests) and others empty. We looked at the distribution function but it seems to be more a quirk of probability.

Take this Scala code to demonstrate what I am talking about. Let's have a given number of buckets that have X documents randomly distributed over them each iteration. Let's consume a fixed number from each bucket each iteration (the batch size). If there are less than the batch size in the bucket, we just take them all and that's that. We then let this run for an arbitrary number of iterations.

The simulation code looks like this:

package com.phenry.stats

import scala.util.Random

object PublisherDistributionMain {
    
    def main(args : Array[String]) = {
        val numBuckets    = args(0).toInt
        val numIterations = args(1).toInt
        val numNewDocs    = args(2).toInt
        val batchSize     = args(3).toInt
        var buckets       = new Array[Int](numBuckets)
        
        for (i <- 0 to numIterations) {
          populate(buckets, numNewDocs)
          buckets = depopulate(buckets, batchSize)
        }
        print(buckets)
    }
    
    def print(buckets : Array[Int]) : Unit = {
      val largest = max(buckets)

      buckets.map(_ * 50 / largest).foldLeft(0)((acc, elem) => {
        println("%2d : ".format(acc) + "#" * elem)
        acc + 1
      })
      
      val stats = new MeanAndStdDeviation
      println("\nmean = %.3f, std dev = %.3f".format(stats.mean(buckets), stats.stdDev(buckets)))
    }

    def max(buckets : Array[Int]) : Int = {
      buckets.reduce((x, y) => if (x > y) x else y)
    }
    
    def depopulate(buckets : Array[Int], batchSize : Int) : Array[Int] = {
      buckets.map(x => if (x < batchSize) 0 else (x - batchSize))
    }
    
    def populate(buckets : Array[Int], numNewDocs : Int) : Unit = {
      val random = new Random
      for (i <- 0 to numNewDocs) {
        val bucket = random.nextInt(buckets.length)
        buckets(bucket) = buckets(bucket) + 1
      }
    }
}

With a simple stats class (MeanAndStdDeviation) looking like this:

package com.phenry.stats

class MeanAndStdDeviation {
  
  implicit def integer2Double(x : Integer)          = x.doubleValue()
  implicit def integer2DoubleList(x : List[Int])    = x.map(i => i.doubleValue())
  implicit def integer2DoubleList(x : Array[Int])   = x.map(i => i.doubleValue())
  implicit def double2Integer(x : Double)           = x.intValue
  implicit def double2IntegerList(x : List[Double]) = x.map(d => d.intValue)
      
  val mean      = (x : Array[Int])      => sum(x) / x.length
  
  def sum(results : List[Double]) : Double = {
    results.foldRight(0d)((x, y) => x + y)
  }
    
  def stdDev(results : Array[Int]) : Double = { 
    val theMean = mean(results);
    var sumOfSquaredDiffs = 0d
    for (aResult <- results) sumOfSquaredDiffs += math.pow((aResult - theMean), 2)
    math.sqrt(sumOfSquaredDiffs / results.length)
  }
  
  def sum(results : Array[Double]) : Double = {
    results.reduce((x, y) => x + y)
  }
  
.
.

(Scala really is nice for writing succinct functions...)

The results after 100 000 iterations look like this:

 0 : ################
 1 : ###############################
 2 : ###############
 3 : ##############
 4 : #################
 5 : ####
 6 : ##########
 7 : ######################
 8 : ##############################
 9 : ##################
10 : ##############
11 : ############
12 : ################
13 : ##############
14 : ##############################
15 : ####
16 : ##################################################
17 : ################
18 : ###################
19 : #############

that is, very lumpy. Bucket #15 has a fraction of what bucket #16 has, for instance.

As a result, our system will reject users' requests complaining it has not enough space despite the fact that many nodes in the cluster have plenty of memory.

Time to write a new consumer algorithm...

Monday, May 12, 2014

How do you distinguish a slow application from a slow network?


We've been given new hardware on which to run our application. To make sure it will be OK, we ran our performance tests against it. To our surprise, they sometimes did poorly despite the new CPUs being much better. Using jstack showed at any particular time, nothing much was going on.

First we looked at playing with OS parameters but we had no luck. Then we looked at esoteric things like traffic shaping but found nothing.

So, time to crack open Wireshark. We took a 60s sample of network traffic when things were going well as a baseline then sampled for 60s during slowness.

Although the application's functionality was fine during both samples, the TCP/IP traffic showed big differences. The slow traffic had lots of re-transmissions and "TCP ACKed lost segment" messages in the log.

To filter them in Wireshark, use:

tcp.analysis.lost_segment || tcp.analysis.retransmission

(Incidentally, another interesting filter is:

frame.time_delta > 0.1

which shows packets that had a gap between them of 0.1s - which is ages in the traffic world. Another is

tcp.window_size == 0

which shows your buffers filling up)

In our 60s capture on the client side, there were some 2 million frames captured. Of these, some 83 000 we re-transmissions or lost segments. This is 4% of our traffic and seems enough to slow our connections noticeably.

We've been assured by the hardware guys everything is fine. But we do share a VLAN (that is, we share physical networks with other teams) and that may be exhausted. This piece of kit is a Cisco Nexus 5k.

The mystery continues.


Saturday, May 10, 2014

CPU affinity and Performance

The Problem

I have a single-threaded application (my play-project, JStringServer, running BatchedServerMain) that is very CPU and IO heavy and I want to improve its performance and stop it from jittering.

By running htop, I notice that it jumps from CPU to CPU. One moment it's on CPU 6...


the next, it's on CPU 11:



The number of serviced clients per second is all over the place (a mean of 9523 per second and a standard deviation of 864. Compare this with the results below).

The Solution

You might have read a lot about CPU affinity on the blogosphere (such as HighScalability and Mechanical Sympathy). Jumping between CPUs can be expensive and one tactic is to avoid it. Could it help me?

I've heard of proposals to allow this to be done within Java but there is nothing I know of that allows you to do it yet so I will have to do it via the OS.

CPU Affinity

You can pin processes to given CPUs quite easily on Linux (see this article for instructions). But which do we pin it to? Not all are the same...

I pin my process to each CPU and take timings by hitting the process from another box. These are the results I get measuring the number of clients serviced each second:

0 : mean =  10244.200 std dev =     47.298
1 : mean =  10253.600 std dev =     52.063
2 : mean =  10245.400 std dev =     48.202
3 : mean =  10204.800 std dev =     47.422
4 : mean =  10265.000 std dev =     54.902
5 : mean =   7807.400 std dev =     41.105
6 : mean =  10217.400 std dev =     51.623
7 : mean =  10274.000 std dev =     46.656
8 : mean =  10238.600 std dev =     62.919
9 : mean =  10290.600 std dev =     49.318
10 : mean =  10209.400 std dev =     52.044
11 : mean =  10194.800 std dev =     54.377
12 : mean =  10243.400 std dev =     52.321
13 : mean =   8874.000 std dev =     43.959
14 : mean =   9849.200 std dev =     49.794
15 : mean =  10352.000 std dev =     55.194

Hmm, three CPUs are slow, but especially CPU #5.

Let's look at my system. I'm using an Intel Xeon processor (E5-2687W) which has 8 cores and 16 hardware threads.

[henryp@corsair ~]$ numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 65480 MB
node 0 free: 60931 MB
node distances:
node   0
  0:  10

This shows I have only one CPU and memory nodes so I don't have to worry about cross-socket NUMA problems (This is a good article about looking at NUMA and hardware).

OK, so what are all my CPUs in my socket doing?

[henryp@corsair 9dev]$ cat /proc/interrupts
            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14      CPU15      
   0:        171          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IO-APIC-edge      timer
   1:       3644          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IO-APIC-edge      i8042
   4:          2          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IO-APIC-edge    
   8:          1          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IO-APIC-edge      rtc0
   9:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   acpi
  12:     132544          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IO-APIC-edge      i8042
  17:          1          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   firewire_ohci
  23:         61          0          0          0          0          0         21          0          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   ehci_hcd:usb1, ehci_hcd:usb2
  27:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   ath9k
  32:     207183          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   nvidia
  36:        179          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   snd_hda_intel
  64:      27439      25657      81300          0       6288      32731       5618      45423          0          0          0          0          0          0          0          0   PCI-MSI-edge      ahci
  65:          5          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      ahci
  66:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      xhci_hcd
  67:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      xhci_hcd
  68:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      xhci_hcd
  69:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      xhci_hcd
  70:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      xhci_hcd
  71:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      xhci_hcd
  72:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      xhci_hcd
  73:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      xhci_hcd
  74:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      xhci_hcd
  75:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      xhci_hcd
  76:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      xhci_hcd
  77:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      xhci_hcd
  78:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      xhci_hcd
  79:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      xhci_hcd
  80:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      xhci_hcd
  81:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      xhci_hcd
  82:         10          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      mei
  83:       1576          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      em1
  84:        757          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      snd_hda_intel
  85:         14          0          0          0          0     424447          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      p7p1-rx-0
  86:         19          0          0          0          0          0     216701          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      p7p1-tx-0
  87:          2          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      p7p1
 NMI:        332        254        231        207        211        310        218        219        403        303        194        228        280        382        329        281   Non-maskable interrupts
 LOC:     418851     409978     386151     360594     358830     438636     351885     362193     292543     243019     179805     188224     223758     283426     257005     225531   Local timer interrupts
 SPU:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Spurious interrupts
 PMI:        332        254        231        207        211        310        218        219        403        303        194        228        280        382        329        281   Performance monitoring interrupts
 IWI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IRQ work interrupts
 RTR:         15          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   APIC ICR read retries
 RES:    1069726      92676      20752       8650       5003       4081       3523       2631       2234       2171       2467       1658       1974       1873       1849       1668   Rescheduling interrupts
 CAL:      53667      72109      69200      68746      62293      82125      72462      71374      49909      51974      52292      54965      55904      52059      54509      54112   Function call interrupts
 TLB:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   TLB shootdowns
 TRM:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Thermal event interrupts
 THR:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Threshold APIC interrupts
 MCE:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Machine check exceptions
 MCP:         12         12         12         12         12         12         12         12         12         12         12         12         12         12         12         12   Machine check polls
 ERR:          0
 MIS:          0

(Read this page to see how best to read this output).

CPU #5 seems to be handling an awful lot of interrupts from the network interface controller [NIC] for p7p1 (the name of my interface that the client is hitting).

"Unfortunately the default is to let all processors handle interrupts with the result that the overall performance is not optimal in particular on multi-core systems. This is because modern NICs can have multiple RX queues that work perfectly if cache coherency is respected. This means that interrupts for ethX-RX queue Y must be sent to 1 (one) core or at most to a core and its Hyper-Threaded (HT) companion. If multiple processors handle the same RX queue, this invalidates the cache and performance gets worse. For this reason IRQ balancing is the key for performance." [1]

OK, let's balance that puppy.

[root@corsair ~]# echo 15 > /proc/irq/85/smp_affinity ; echo 15 > /proc/irq/85/smp_affinity_list 

This works for a few seconds and I see the same throughput levels as the other CPUs but then my Fedora 17 reverts it! How annoying. It doesn't do that on my Ubuntu 12.04 (admittedly on different hardware).

So, I use Tuna to see what's going on. This is how it looks before the test:


And during:


We can see CPU #5 is handling a lot of kernel interrupts.

Conclusion

It seems that this particular CPU is busy handling network IO and making my process perform badly if I pin it to it. This is probably a reason for the jitter as the kernel sometimes schedules my process to run on it.

But at the moment, I don't know how to stop the kernel from disabling network interrupts on this CPU. So, I'll just avoid it in the future.

[1] http://www.ntop.org/pf_ring/irq-balancing/