Saturday, July 21, 2012

Two Java Gotchas that Keep Tripping me

I have a principle that if I make the same mistake twice, I put it on my blog to remind myself. The two that tripped me up recently are:

1. Java Regex

If you try to search for some text that's broken over two lines, you need to tell java.util.regex.Pattern. For instance, given the text:


This is a piece of text
with multiple lines and this text:
string to find

Then you might find a regular expression looking for the pattern text:.string (not the wild card after the colon) would find the end of the second and beginning of the third line. This is not the default behaviour. In fact, this test fails:


        String toFind = "text:.string";
        String toSearch = "This is a piece of text\nwith multiple lines and this text:\nstring to find\n";
        Pattern pattern = Pattern.compile(toFind);
        Matcher matcher = pattern.matcher(toSearch);
        Assert.assertTrue(String.format("Can't find [%s] in [%s]", toFind, toSearch), matcher.find());

You need to enable DOTALL mode with:

        Pattern pattern = Pattern.compile(toFind, Pattern.DOTALL);

Now, your regex will match.

2. XML Namespace and Java Parsers

I wasted most of a day this week wondering what was wrong with my XML when I tried to validate it against what looked like a perfectly good piece of XSD. It seems that the default behaviour of a parser is to ignore namespaces. I don't know why. Presumably it's due to historical reasons. Anyway, you need to add:

aSAXParserFactory.setNamespaceAware(true);

to your code (see the JavaDocs).

It was a colleague who told me about this. When I asked how he knew that, he said I had told him a few months before. Hence this blog entry.

Sunday, July 15, 2012

ThreadPoolExecutor and RTFM

Take a look at the constructors for ThreadPoolExector. All of them take a maximumPoolSize argument. If you thought this was necessarily the maximum size of the thread pool, you'd be wrong.

To be fair, the documentation does say this might not necessarily be the case. But although we should have RTFM, we were in a hurry.

Basically, the pool size is partly determined by the type of BlockingQueue you use in the constructor. The rule is that if the BlockingQueue in question can take another task, it will rather save it for later rather than start a new thread.

So, if for you pass the constructor a LinkedBlockingQueue that in turn was instantiated without a limit, no new threads are created no matter what the maximum size of the pool. So, a conservative programmer may instantiate the thread pool with an initial size of 1 and a much larger maximum pool size. But if he passes in an unbound queue then it is all to no avail. The executor will not be multi-threaded.

Furthermore, if you pass the constructor an ArrayBlockingQueue (that is by its nature bounded) to a ThreadPoolExecutor with a core pool size of 1 and a maximum pool size much bigger, the executor won't be multi-threaded until the number of tasks to be executed exceeds the size of the ArrayBlockingQueue.

And if you use the a SynchronousQueue (where "each insert operation must wait for a corresponding remove operation by another thread"), don't be surprised when it throws a RejectedExecutionException as soon as there are no other threads to do any work.

How the executor differentiates between the various queue types happens in ThreadPoolExecutor.execute:


    public void execute(Runnable command) {
        if (command == null)
            throw new NullPointerException();
        for (;;) {
            if (runState != RUNNING) {
                reject(command);
                return;
            }
            if (poolSize < corePoolSize && addIfUnderCorePoolSize(command))
                return;
            if (workQueue.offer(command))
                return;
            Runnable r = addIfUnderMaximumPoolSize(command);
            if (r == command)
                return;
            if (r == null) {
                reject(command);
                return;
            }
            // else retry
        }
    }


Queue.offer returns true it was possible to add something (the happy path of this method). This is determined by the type of queue and how it was instantiated as mentioned above. Only if the task was not added does the ThreadPoolExecutor code attempt to start another thread.

There were two other things that confused us. The first was that the timeout mentioned in ThreadPoolExecutor's method:

invokeAll(Collection<Callable<T>> tasks, long timeout, TimeUnit unit)

It refers to the total time it will wait not the the time for each Callable. During this time, it will poll the futures (by calling get())to see if they have anything to return. If it times out while polling, it will call cancel() on all futures.

The final gotcha we experienced was that isDone() on all the futures returned from invokeAll will return true. However, a call to get() will lead to a CancellationException being thrown. Not what we expected.

The long and short of it is that although these Java classes are beautifully crafted, they are not without their somewhat counterintuitive peculiarities.

Saturday, July 7, 2012

Why System Calls are Slow

Java is an abstraction around the underlying operating system. Normally, you wouldn't have to worry about what calls the JVM makes to the OS on your behalf. But these system calls can be expensive as the girls and buys down at LMAX know. They've put together a framework that tries to minimize them.

As with any other process, Linux allows you to see what underlying operating system calls the JVM is making with the strace command. When threads vie for synchronized blocks, for instance, you'll see  output like:


futex(0xb7651344, FUTEX_WAIT_PRIVATE, 1, {0, 999854465}) = 0

where the futex system call is asking for the operating system to arbitrate the contention.

When a call is made the following steps are executed (from the excellent The Linux Programming Interface):
  1. Although your C code may look like you're calling a system call as if it were any other function, it actually "invokes a wrapper function in the C library".
  2. "Argument are passed to the wrapper via the stack [as with any C function call], but the kernel expects them in specific registers. The wrapper function copies the arguments to these registers".
  3. "Since all system calls enter the kernel in the same way, the kernel needs some method of identifying the system call. To permit this, the wrapper function copies the system call into a specific register (%eax)"
  4. The "machine instruction (int 0x80), which causes the processor to switch from user mode to kernel mode and execute code pointed to be location 0x80" is run.
  5. In response to this, "the kernel invokes its system_call() routine". This is covered in the next section.
  6. "If the return value of the system call service routine indicated an error, the wrapper function sets the global variable errno ... using this value. The wrapper function then returns to the caller, providing an integer return value indication the success or failure of the system call"
Step #5 mentions the "system_call() routine". This does the following:
  1. "Saves the register values onto the kernel stack".
  2. "Checks the validity of the system call number."
  3. "Invokes the appropriate system call routine, which is found by using the system call number to index a table of all system call services (the kernel variable sys_call_table). If the system call service routine has any arguments, it first checks their validity". The list of kernel calls and their memory addresses can be found at /proc/kallsyms. [1]
  4. Restores registers from the stack and places the return value on the stack.
  5. Drops back to user mode and returns to the wrapper function.
If this looks like a lot of work, it's because it is. This is why system calls are largely avoided in the Disruptor where even the smallest delay is to be avoided.

[1] The Ksplice Blog.

How do I remember thee? Let me count the ways...

With apologies to poetess, Elizabeth Barrett Browning, this post is about all ways a Java process can hold on to memory.

When people talk about how much memory a Java process has, most people only think of heap size. But there's more.

Firstly, there's obviously the stack. When we create a new thread and then start it, memory usage increases. Instantiating a thread - or indeed any object - in Java does not in itself use any additional system memory.

Secondly, there are direct byte buffers.

Thirdly, there are memory mapped files and there also may be OS specific means of sharing data between processes.

1. The Heap

Memory usage of processes share similar characteristics and the concept of heap space is not unique to the JVM.

In C, allocating memory is usually done with the malloc call. Note that in Linux, this is not a kernel call but a library call. It can be found in the glibc library. If you're using the GNU C compiler, this is the code that will be executed. But you can use any memory manager you like. You can even write your own.

"A memory manager is a set of routines that takes care of the dirty work of getting your program memory for you. Most memory managers have two basic functions - allocate and deallocate. Whenever you need a certain amount of memory, you can simply tell allocate how much you need, and it will give you back an address to the memory. When you’re done with it, you tell deallocate that you are through with it. allocate will then be able to reuse the memory. This pattern of memory management is called dynamic memory allocation. This minimizes the number of "holes" in your memory, making sure that you are making the best use of it you can. The pool of memory used by memory managers is commonly referred to as the heap.

"The way memory managers work is that they keep track of where the system break is, and where the memory that you have allocated is. They mark each block of memory in the heap as being used or unused. When you request memory, the memory manager checks to see if there are any unused blocks of the appropriate size. If not, it calls the brk system call to request more memory. When you free memory it marks the block as unused so that future requests can retrieve it." - Programming from the Ground Up.

This is what malloc does.

"When a process needs memory, some room is created by moving the upper bound of the heap forward, using the brk() or sbrk() system calls. Because a system call is expensive in terms of CPU usage, a better strategy is to call brk() to grab a large chunk of memory and then split it as needed to get smaller chunks. This is exactly what malloc() does. It aggregates a lot of smaller malloc() requests into fewer large brk() calls. Doing so yields a significant performance improvement. " - Linux Journal.

So, creating an object in Java does not appear to in itself appear to demand any memory from the OS. That is, using Linux's strace on the process doesn't show any system calls to brk or mmap. Nor does the pmap command indicate that any more memory is being used (from the point of view of the OS).


2. The Stack

However, upon starting a Java Threadthe pmap command shows increased memory usage - about 312k on my system. Repeatedly running pmap and diffing the results between the JVM calling Thread.start() shows this:


00eb8000    312K rwx--    [ stack ]


3. Direct Byte Buffers

"Operating systems perform I/O operations on memory areas. These memory areas, as far as the operating system is concerned, are contiguous sequences of bytes.  It's no surprise then that only byte buffers are eligible to participate in I/O operations. Also recall that the operating system will directly access the address space of the process, in this case the JVM process, to transfer the data. This means that memory areas that are targets of I/O operations must be contiguous sequences of bytes. In the JVM, an array of bytes may not be stored contiguously in memory, or the Garbage Collector could move it at any time. Arrays are objects in Java, and the way data is stored inside that object could vary from one JVM implementation to another.

"For this reason, the notion of a direct buffer was introduced. Direct buffers are intended for interaction with channels and native I/O routines." - Java NIO.

Direct buffer memory is allocated with the call:

ByteBuffer.allocateDirect(capacity);

If you keep allocating memory like this, you won't see any changes to the memory usage as reported by calls to:

Runtime.getRuntime().freeMemory();

However, you will be using memory. In Linux, you can prove this by comparing the /proc/PID/smaps file between calls (where PID is the ID of the thread allocating memory). Typically diff will show something like:


< b49fc000-b4e00000 rw-p 00000000 00:00 0
< Size:               4112 kB
< Rss:                4112 kB
< Pss:                4112 kB
---
> b48fb000-b4e00000 rw-p 00000000 00:00 0
> Size:               5140 kB
> Rss:                5140 kB
> Pss:                5140 kB
824,825c824,825
< Private_Dirty:      4112 kB
< Referenced:         4112 kB
---
> Private_Dirty:      5140 kB
> Referenced:         5140 kB



These addresses correspond to the kernel calls the JVM is making. From strace:


mmap2(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb49fc000
.
.
.
mmap2(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb48fb000


This is when I am allocating a 1MiB of memory at a time. Why this is slightly more than 1MiB is currently a mystery to me.

4. Memory Mapped Files

"Calling map( ) on a FileChannel creates a virtual memory mapping backed by a disk file and wraps a MappedByteBuffer object around that virtual memory space. " - Java NIO.


This allows your JVM to use much more memory than the machine physically has. Peter Lawrey on his blog demonstrates "8,000,000,000,000 bytes or ~7.3 TB in virtual memory, in a Java process! This works because it only allocates or pages in the pages which you use. So while the file size is almost 8 TB, the actual disk space and memory used is 4 GB."


5. OS Specific Tempfs


There is a RamDisk-like functionality on most Linux distros called tmpfs (Note: most Unix-like systems have this). You can use this shared memory out-of-the-box by reading/writing to a virtual filesystem at /dev/shm.


The advantages of using this memory are outlined in Michael Kerrisk's The Linux Progamming Interface:

"Linux also support the notion of virtual file systems that reside in memory... file operations are much faster since no disk access is involved.



"This file system has kernel persistence--the shared memory objects that it contains will persist even if no process currently has them open, but they will be lost if the system is shut down". 


[Disclaimer: I am a Java engineer by trade but have an interest in Linux. I have tried my best to be accurate when talking about Linux but cannot guarantee what I write is error free]