Sunday, November 24, 2013

Bufferbloat: less is more

You would have thought that increasing buffer sizes was always a good thing, right? Wrong.

You would have thought that reducing load on a system would always make it faster, right? Also wrong.

When stress testing our code in an Oracle lab in Edinburgh, we noticed that increasing the load on the system increased throughput. Independently, on totally different software (nothing in common other than it's written in Java and some of it's running on Linux) I saw the same thing on my home network.

In both cases, a large network buffer size and low load was the problem. At home, I saw this:

Initiated 7855 calls. Calls per second = 846. number of errors at client side = 0. Average call time = 81ms
Initiated 9399 calls. Calls per second = 772. number of errors at client side = 0. Average call time = 89ms
Initiated 10815 calls. Calls per second = 708. number of errors at client side = 0. Average call time = 96ms
.
.

etc until I started a second machine hitting the same single-threaded process whereupon performance shot up:

Initiated 18913 calls. Calls per second = 771. number of errors at client side = 0. Average call time = 107ms
Initiated 21268 calls. Calls per second = 1177. number of errors at client side = 0. Average call time = 105ms
Initiated 24502 calls. Calls per second = 1617. number of errors at client side = 0. Average call time = 99ms
Initiated 29802 calls. Calls per second = 2650. number of errors at client side = 0. Average call time = 88ms
Initiated 34192 calls. Calls per second = 2195. number of errors at client side = 0. Average call time = 82ms
Initiated 39558 calls. Calls per second = 2683. number of errors at client side = 0. Average call time = 77ms

How odd - more load on the server means better throughput.

I was browsing the subject of bufferbloat on various websites including Jim Getty's excellent blog [1] where he writes extensively on the topic. He says:

"... bloat occurs in multiple places in an OS stack (and applications!). If your OS TCP implementation fills transmit queues more than needed, full queues will cause the RTT to increase, etc. , causing TCP to misbehave."

Inspired by this, I added to my code:

        serverSocketChannel.setOption(
            SO_RCVBUF,
            4096);

before binding the channel to an address and the problem went away (the default value for this option was about 128kb on my Linux box).

Note that although this looks like a very small number, there is no fear of a buffer overrun.

"The TCP socket received buffer cannot overflow because the peer is not allowed is not allowed to send data beyond the advertised window. This is TCP's flow control" [2].

Curious to see why reducing the buffer size helps things, I tried sizes of 512, 1024, 2048 and so on until 65536 bytes while running

sudo tcpdump -nn -i p7p1 '(tcp[13] & 0xc0 != 0)'

which according to [3] should show me when the network experiences congestion (p7p1 is the name of my network interface, by the way).

The first value for SO_RCVBUF at which poor initial performance is encountered was 8192 bytes. Interestingly, as soon as the second client started hitting the server, tcpdump started spewing output like:

17:54:28.620932 IP 192.168.1.91.59406 > 192.168.1.94.8888: Flags [.W], seq 133960115:133961563, ack 2988954847, win 33304, options [nop,nop,TS val 620089208 ecr 15423967], length 1448
17:54:28.621036 IP 192.168.1.91.59407 > 192.168.1.94.8888: Flags [.W], seq 4115302724:4115303748, ack 2823779942, win 33304, options [nop,nop,TS val 620089208 ecr 15423967], length 1024
17:54:28.623174 IP 192.168.1.65.51628 > 192.168.1.94.8888: Flags [.W], seq 1180366676:1180367700, ack 1925192901, win 8688, options [nop,nop,TS val 425774544 ecr 15423967], length 1024
17:54:28.911140 IP 192.168.1.91.56440 > 192.168.1.94.8888: Flags [.W], seq 2890777132:2890778156, ack 4156581585, win 33304, options [nop,nop,TS val 620089211 ecr 15424257], length 1024

What can we make of this? Well, it appears that the bigger the buffer, the longer a packet can stay in the receiver's queue as Getty informs us [1]. The longer it stays in the queue, the longer the round trip time (RTT). The longer the RTT, the worse the sender thinks the congestion is as it doesn't differentiate between time lost on the network and time stuck in a bloated stupid FIFO queue. (The RTT is used in determining the congestion [4])

Given a small buffer, the receiver will, at a much lower threshold, tell the sender not to transmit any more packets [2]. Thus the queue is smaller and less time is spent in it. As a result, the RTT is low and the sender believes the network to be congestion-free and is inclined to send more data.

Given a larger buffer but with greater competition for resources (from the second client), the available space in the buffer is reduced so it things look very similar to the client as described in the previous paragraph.

It appears that the Linux community are wise to this and have taken countermeasures [5].

[1] JG's Ramblings.
[2] Unix Network Programming, p58, Stevens et al, p207
[3] Wikipedia.
[4] RFC 5681.
[5] TCP Small Queues.

To Nagle or Not

Some TCP properties Java programmers have control over, others they don't. One optimisation is in Socket.setTcpNoDelay, which is to do with Nagle's algorithm (note: calling setTcpNoDelay with true turns the algorithm off). Basically, this tells the OS to batch your packets.

When you should turn it on or off depends very much on what you are trying to do [1]. Jetty sets no delay to true (that is, turns the algorithm off):

Phillips-MacBook-Air:jetty-all-8.1.9.v20130131 phenry$ grep -r setTcpNoDelay .
./org/eclipse/jetty/client/SelectConnector.java:            channel.socket().setTcpNoDelay(true);
./org/eclipse/jetty/client/SocketConnector.java:        socket.setTcpNoDelay(true);
./org/eclipse/jetty/server/AbstractConnector.java:            socket.setTcpNoDelay(true);
./org/eclipse/jetty/server/handler/ConnectHandler.java:            channel.socket().setTcpNoDelay(true);
./org/eclipse/jetty/websocket/WebSocketClient.java:        channel.socket().setTcpNoDelay(true);

Playing around with my own simple server, I experimented with this value. I set up a single thread on a 16-core Linux box using Java NIO that services requests sent from 2 MacBooks that each had 100 threads using normal, blocking IO and sending 10 010 bytes of data (the server replies with a mere 2 bytes).

Setting the algorithm on or off on the server made no discernible difference. Not surprising as 2 bytes are (probably) going to travel in the same packet. But calling socket.setTcpNoDelay(false) on the clients showed a marked improvement. Using a MacBook Pro (2.66GHz, Intel Core 2 Duo) as the client, the results looked like:

socket.setTcpNoDelay(true)

Mean calls/second:      3884
Standard Deviation:     283
Average call time (ms): 29

socket.setTcpNoDelay(false)

Mean calls/second:      5060
Standard Deviation:     75
Average call time (ms): 20

Your mileage my vary.

The big difference was the time it took to call SocketChannel.connect(...). This dropped from 20 to 13 ms.

As an aside, you can see Linux's network buffers filling up with something like:

[henryp@corsair ~]$ cat /proc/net/tcp | grep -i 22b8 # where 22b8 is port 8888 on which I am listening
   2: 5E01A8C0:22B8 00000000:0000 0A 00000000:00000012 02:0000000D 00000000  1000        0 1786995 2 ffff880f999f4600 99 0 0 10 -1                   
   3: 5E01A8C0:22B8 4101A8C0:F475 03 00000000:00000000 01:00000062 00000000  1000        0 0 2 ffff880f8b4fce80                                      
   4: 5E01A8C0:22B8 4101A8C0:F476 03 00000000:00000000 01:00000062 00000000  1000        0 0 2 ffff880f8b4fcf00 
.
.
  76: 5E01A8C0:22B8 5B01A8C0:D3A5 01 00000000:0000171A 00:00000000 00000000  1000        0 4035262 1 ffff880fc3ff8700 20 3 12 10 -1                  
  77: 5E01A8C0:22B8 5B01A8C0:D3AD 01 00000000:0000271A 00:00000000 00000000     0        0 0 1 ffff880fc3ffb100 20 3 12 10 -1                        
  78: 5E01A8C0:22B8 5B01A8C0:D3B4 01 00000000:00000800 00:00000000 00000000     0        0 0 1 ffff880e1216bf00 20 3 12 10 -1                        
  79: 5E01A8C0:22B8 5B01A8C0:D3A8 01 00000000:0000271A 00:00000000 00000000     0        0 0 1 ffff880fc3ff9500 20 3 12 10 -1                        
  80: 5E01A8C0:22B8 5B01A8C0:D3AC 01 00000000:0000271A 00:00000000 00000000     0        0 0 1 ffff880fc3ffe200 20 3 12 10 -1                        
  81: 5E01A8C0:22B8 5B01A8C0:D3B3 01 00000000:0000271A 00:00000000 00000000     0        0 0 1 ffff880e12169c00 20 3 12 10 -1                        
  82: 5E01A8C0:22B8 4101A8C0:F118 01 00000000:00000000 00:00000000 00000000  1000        0 4033066 1 ffff880e1216d400 20 3 0 10 -1 

Note 271A is 10010 - the size of our payload.

[1] ExtraHop blog.

Sunday, November 10, 2013

A Comparison of Simple Locking Strategies

I've been playing with locks again and made a comparison between 5 naive implementations. Putting aside arguments about how valuable microbenchmarks are, the results are interesting.

In each case, there were as many threads as CPUs on my machine (a 16 core Linux box) and all each thread wants to do is attain the lock, increment a counter until it reaches a point and release the lock.

Each test was run twice with the first result thrown away to allow the JVM to start up. There was also a 4 second pause before the run to avoid biased locking issues. Averages were taken over 10 runs of 2 000 000 iterations.

These are the five strategies:

SynchronizedIncrementer

Perhaps the simplest implementation is:

            synchronized (this) {
                counter++;
            }

a pessimistic lock approach.

AtomicReferenceIncrementer

Next is an optimistic lock approach with atomics:

            boolean set = false;
            while (!set) {
                Integer expected = counter.get();
                set = counter.compareAndSet(expected, new Integer(expected + 1));
            }

LockingIncrementer

Next is using Java's concurrent classes:

            lock.lock();
            try {
                counter++;
            } finally {
                lock.unlock();
            }

SpinLock

The next spins until a flag allows it to proceed.

            while (!atomicBoolean.compareAndSet(false, true)) {
                Thread.yield();
            }
            
            counter++;
            int myCounter = counter;
            
            if (!atomicBoolean.compareAndSet(true, false)) {
                throw new IllegalStateException();
            }

AtomicReferenceIncrementer

This uses an AtomicReference that holds an Integer and is otherwise similar to the AtomicIncrementer.

The Results

There isn't a huge amount between the strategies in this particular test:

                              Mean (ms) Standard Deviation
                              ========  ==================
AtomicIncrementer             2170      149.626535
SynchronizedIncrementer       2475      53.230630 
AtomicReferenceIncrementer    3319      275.069809
SpinLock                      3519      713.212030
LockingIncrementer            3690      244.545292

On my hardware at least, the optimistic AtomicInteger approach is fastest with the synchronized block offering the most predictable performance. However, there is not much between them.

The interesting thing is if you run the same test with just one thread, typically the results look like this:

Time took 7ms  for [SynchronizedIncrementer, counter = 2000000]
Time took 20ms for [AtomicIncrementer, counter = 2000000]
Time took 21ms for [AtomicReferenceIncrementer, counter = 2000000]
Time took 23ms for [SpinLock, counter = 2000000]
Time took 29ms for [LockingIncrementer, counter = 2000000]

Much faster and it's doing exactly the same amount of work!

Conclusion

Avoid multi-threading if you can help it. Use it only when it demonstrably speeds things up. Even then, try to architect your system so there is no contention in the first place.

Saturday, November 9, 2013

Journeys in Networks

After stress testing my pet project, JStringServer, I initially thought I'd made a big boo-boo as the performance was appalling. However, it turned out that my home router was not up to the job. So, on a friend's recommendation, I bought a TP-Link 5-Port Gigbait Desktop Switch. Performance was better but not that great. A quick Google showed I needed Cat 6 cables to make full use of it. D'oh.

OK, so after a few trips to my local electronics store, I set up a newish 1.8GHz i7 Mac Book Air (with a USB network adaptor) and an old 2009 Mac Book Pro trying to hammer my 16 core Linux desktop running my Java NIO server code.

The strategy JStringServer was using was one-thread-does-everything (code here). That is, it listens to the selector associated with the server socket, associates any clients who have connected with a second selector dedicated to clients, checks this second selector for any activity and services them. Although htop shows this thread to be very busy, the rest of the system was doing next to nothing.

The clients were averaging about 6 000 calls per second between them. Now, with a back-of-a-beer-mat calculation,  a payload of about 10 000 bytes (ignoring the 2 bytes return from the server) and 6 000 calls per second, this means the network was taking something like 480 gigabits/second (10 000 * 6 000 * 8 / 1 000 000). Not bad, but why not better?

TcpDump

Since JStringServer is currently using just TCP, it turns out that there is a lot of overhead on the network acknowledging the packets the client is sending the server.

If we run tcpdump and capture its output thus:

$ sudo tcpdump -nn host 192.168.1.94 and 192.168.1.65 -i p7p1 > tcpdump_jstringserver_2machines_normalUse.txt 

we see as many packets are going to the server (192.168.1.94) as the other way:

$ grep -c "192.168.1.94.8888 >" tcpdump_jstringserver_2machines_normalUse.txt 
1996027
$ grep -c "> 192.168.1.94.8888" tcpdump_jstringserver_2machines_normalUse.txt 
2005298

So, the figure of 480 gigabits/second seems to be as good as we're going to get on this particular hardware using TCP (2 * 480 ~ 1 gigabit limit).

The return packets that carry the acknowledgement can also carry data [1]. There show up in tcpdump as [P.] where P stands for a PUSH of data and '.' represents an acknowledgement [2]. But since in this particular example, our server replies with very terse responses compared to very verbose requests, this doesn't save us much. A lot of packets are wasted just acknowledging:

$ grep -c -P "192.168.1.94.8888 \>.* \[\.\], ack \\d\\d" tcpdump_jstringserver_2machines_normalUse.txt 
1427585

That's about 70% of all traffic from the server to the client.

Another problem with TCP is the handshake uses a lot of packets (as a percentage of the total package used in a connection).

For SYN:

$ grep -c " \[S\]" tcpdump_jstringserver_2machines_normalUse.txt 
120675

For SYN-ACK

$ grep -c " \[S\.\]" tcpdump_jstringserver_2machines_normalUse.txt 
118371

and for ACK (handshake only):

$ grep -c -P "\[\.\], ack 1," tcpdump_jstringserver_2machines_normalUse.txt 
113403

That totals 17% of the total traffic. In this particular example, this connection pooling would solve this. 

[1] Does tcp send a syn-ack on every packet or only on the first connection StackOverflow.
[2] tcpdump man pages


Sunday, November 3, 2013

The other side of the Fence

Memory Barriers and Memory Fences are often used in the literature of the Java Memory Model, but what exactly are they?

Firstly, they are synonyms:

"To prevent the reordering of operations resulting from write buffering, modern architectures provide a special memory barrier instruction (sometimes called a memory fence) that forces outstanding operations to take effect. It is the programmers responsibility to know where to insert a memory barrier.... Not surprisingly, memory barriers are expensive, about as expensive as an atomic compareAndSet() instruction... In fact, synchronization instructions such as getAndSet() or compareAndSet() described in earlier chapters include a memory barrier on many architectures, as do reads and writes to volatile fields." [1]

It's interesting that compareAndSet is regarded as slow since most Java programmers I know seem to think they are more efficient (although this is not born out on the hardware I've been playing with where it appears to be comparable to using synchronized blocks). This could be why the Java allows you to change these values without incurring the ordering costs (see weakCompareAndSet and lazySet at the API JavaDocs).

Secondly, not all memory barriers are the same. Doug Lea categorises them [2] and says:

"A property of memory barriers that takes some getting used to is that they apply BETWEEN memory accesses. Despite the names given for barrier instructions on some processors, the right/best barrier to use depends on the kinds of accesses it separates. [The Java abstractions] map pretty well to specific instructions (sometimes no-ops) on existing processors:"

Of course, one way to massively improve performance is not to contend for a shared value in the first place. Marc Brooker presents some nice evidence [3] where parallelizing code massively slows it down because of all the contention. He also gives a good demonstration of interpreting the sometimes esoteric results from perf stat. This might not come as a surprise to readers of Martin Thompson's blog [4] where he advocates the Single Writer Principle.

Further Reading

The Fences class JavaDocs (documentation only, not planned for any release).

Doug Lea chatting about the JMM and what falls outside of it here.

[1] The Art of Multiprocessor Programming.

[2] Doug Lea's The JSR-133 Cookbook for Compiler Writers.

[3] Marc Brooker's blog.

[4] Martin Thompson's blog.

Saturday, November 2, 2013

Grid Testing in a single JVM!

This week, I have been playing with a great, free, open source project called LittleGrid. You can run a whole cluster in one JVM, stopping and starting members with a single method call to emulate failover. This makes running tests as part of a continuous build process very easy and very nice.

It does all this cleverness by having a different class loader for each instance. This can cause some confusion when you see messages that basically say: ClassCastException: cannot cast class YourClass to class YourClass. Huh? Well, of course, a class is defined by its class loader not just its fully qualified name.

You can get around this by introspectively instantiating a helper class using a cluster member's class loader. This is how we configured a mock Spring framework for all the cluster members.

Since I am relatively new to Coherence, it was gratifying to sanity check some of its features. For instance, in Coherence you can add a map entry using a normal put:

import com.tangosol.net.NamedCache; 
.
.
.
    NamedCache cache = CacheFactory.getCache(CACHE_NAME);
    cache.put(key, value);

Or you could add something by invoking an Entry Processor (Coherence's equivalent of a databases Stored Procedure):

    EntryProcessor entryProcessor = new MyEntryProcessor(key, value); 
    Object         returned       = cache.invoke(key, entryProcessor); 

where my entry processor looks something like this:

class MyEntryProcessor implements Serializable, EntryProcessor {
.
.
.
    public Object process(Entry entry) { 
        BackingMapManagerContext    context     = getContext(entry); 
        Map         myCache     = context.getBackingMap(CACHE_NAME); 
        Binary                      binaryKey   = (Binary) context.getKeyToInternalConverter().convert(myKey); 
        Binary                      binaryValue = (Binary) context.getValueToInternalConverter().convert(myValue); 
        myCache.put(binaryKey, binaryValue); 
        return null;
    }

    protected BackingMapManagerContext getContext(Entry entry) {
        BinaryEntry                 binaryEntry = (BinaryEntry) entry;
        BackingMapManagerContext    context     = binaryEntry.getContext();
        return context;
    }
.
.
.

By judicious use of breakpoints, I can show that the thread that executes the entry processor blocks the put method call.

This is important in our project as we have code that extends the com.tangosol.net.cache.LocalCache and overrides the put method to do some magic sauce. This is a bit nasty as it's not a good separation of concerns and we're looking at refactoring it out. But there was a concern that the two threads may introduce a race condition. Thankfully, it appears it cannot.

[A cleaner design might have been to use listeners on the cache but in the early days of us using Coherence, the team didn't know which threads executed these listeners.

"A backing map listener ... is nothing more than a class that implements the MapListener interface. [T]hey are executed on the cache service thread which imposes a certain set of requirements and limitations on them.

"For one, just like entry processors, backing map listeners are not allowed to make a re-entrant call into the cache service that they are part of. That means that you cannot access from them any cache that belongs to the same cache service.

"Second, because they are executed synchronously on a cache service thread, it is of paramount importance that you do not do anything time consuming within the even handler... If you need to do anything that might take longer, you need to delegate it to Invocation Service, Work Manager or an external system.

"Finally, because backing map listeners are essentially the same mechanism that is used internally for backup of cache entries, the MapEvent instance they receive are not quite what you would expect and calls to getKey, getOldValue and getNewValue will return values in internal, serialized binary format."

- From Oracle Coherence 3.5].

Testing failover is much easier in LittleGrid:

int memberId = ...
ClusterMember clusterMember = memberGroup.getClusterMember(memberId);
clusterMember.shutdown();

which also gives us an opportunity to see data jumping from the backup store and into the LocalCache. By break pointing the overriden put method, you can see that this is how the data that the node was backing up adds it to its cache.

One last note: I'm currently working in the investment banking and we have the resources to pay for Coherence Enterprise edition. However, we're quite happy with the free version and have been getting good performance out of it. As a result, the tests we're running in our Continuous Integration environment are pretty much representative of what we can see in prod.