Saturday, October 30, 2010

Debugging Production Environments

Have you ever been in the position where you have to look after a legacy app that has insufficient logging and non-deterministic problems? Whatsmore, it's running in production and that rare non-deterministic bug has just occurred. How do you diagnose it?

This task faced me last week. Looking at the code on my machine, I just couldn't see how the problem could manifest itself. "Afterall, there is only one instance of this object in the entire system, right?" I found myself saying. "It must be a bizarre and rare threading issue!"

So, I ran this on the production Linux box (where PID is the Java process ID):

$JAVA_HOME/jmap -histo:live PID | grep OUR_CLASS

and saw something like this (I've added the column headings for clarity):

num #instances #bytes class name

----------------------------------------------

9030: 2 200 OUR_CLASS

So, there were 2 instances of our class! All my recent investigation of odd threading issues was for naught!

This is one reason why the factory pattern is so nice - you know exactly where the objects are being instantiated (tip: make the class's constructor package protected and put the factory in the same package with no other classes in that package). As an added benefit, it makes writing test code easier.

Still, this isn't going to help you if some smart alec decides to introspectively instantiate an object of this class. But such a practice should be strongly discouraged.

Testing and embracing change

My client can only be described as "Agile curious" rather than gung-ho disciples of our lord, Kent Beck. So, it was nice to see an email circulated on the company-wide technical mailing lists asking for help in writing automated regression tests. The author asked about the wisdom of serializing the output of all tests into XML and comparing them between runs.

I almost choked on my coffee. I worked on a project that did this last year and it was not pretty. So, I emailed him how using serialized objects in tests had worked for us:

1. It was fragile. Even minor refactoring of the Java objects would break the tests.

2. It didn’t scale. As the number of XML files that needed to be maintained increased, productivity declined. (At one point, we were maintaining about 20 megs of serialization XML.)

3. Java developers tend not to like maintaining large XML files as their Java code changes - so some refactoring of the XML was, shall we say, less than diligent...

4. Serializing Sets (or any other collection where the order in which the elements were serialized was non-deterministic) could break the tests. For a time, some developers were hacking the problem domain classes to make the order deterministic just so the tests would pass.

To this last point, he replied that XML Unit could handle this non-determinism. However, I think the first three points were more than enough to frighten me from doing that again.

More volatility

Wondering why one of our legacy apps seems to be having thread issue, I continued digging into the recesses of the Java Memory Model.

Volatile is a fascinating keyword since many excellent Java programmers don't appreciate its finer points. An engineer I much admire told me that he only used it for boolean primitives when telling threads to stop running. Fair enough, but there is more to it than that. Here are some:

1. What if we're not using volatile primitives? What if we're using, say, an array? Can I be sure that setting an element of the array is volatile too and will be seen by other threads? Apparently not, according to Google's Jeremy Manson in this tutorial (about 40' into the video) where he says: "there is no way to make the elements of an array volatile... the reference to the array is volatile not the elements of the array itself".

2. Furthermore, the performance of volatile is generally pretty good. "On an x86 there is no cost in reading a volatile variable. There is a cost to writing to a volatile variable but not reading... as of 2007".

3. On some hardware, 64-bit read/writes are not atomic but "writes and reads of volatile long and double values are always atomic" (Java Language Specification 17.7).

4. The order of the lines of your code that a JVM executes can also be changed by volatile (you did know the order of execution may differ to how you wrote your code, right?). Brian Goetz explains:

"There is no guarantee that operations in one thread will be performed in the order given by the program, as long as the reordering is not detectable within that thread - even if the reordering is apparent to other threads"

(Java Concurrency in Practise, p34).

However, when we introduce volatile :

"Under the old memory model, accesses to volatile variables could not be reordered with each other, but they could be reordered with nonvolatile variable accesses. [...]

Under the new memory model, it is still true that volatile variables cannot be reordered with each other. The difference is that it is now no longer so easy to reorder normal field accesses around them... In effect, because the new memory model places stricter constraints on reordering of volatile field accesses with other field accesses, volatile or not, anything that was visible to thread A when it writes to volatile field f becomes visible to thread B when it reads f."

(JSR-133 FAQ)


Incidentally, this does mean that if you write code that is fine in JDK 1.5+, it might not be fine running on older machines. If you really want to write-once-run-anywhere Java, you should assume you're writing for the old JMM and make all variables volatile for whom the order of execution of read/write operations is important.

So, did all of this help me to fix the buggy legacy app? No. Every time I have cast a suspicious eye at the finer points of the JMM, it has actually been a much more mundane bug that was the culprit. In this case, it turns out that somebody, somewhere was introspectively calling one of our methods that starts threads - a call that is very hard to find with most IDEs.

Evil.

Wednesday, October 27, 2010

A little volatile

Time to refresh my memory: what is the volatile modifier for?

The answer given by most Java programmers I've interviewed recently is that it makes the value the reference points to visible to all threads. This is true but not the whole story.

When I first came to appreciate the Java Memory Model I was surprised like almost everybody else to see that the value of a single field may be different for two different threads. This is not something peculiar to Java. Any hardware that conforms to a Von Neumann architecture (which is pretty much every common processor) can run faster if it uses its CPU's registers to store references rather than main memory (RAM).

Making a reference volatile means that it will be published to main memory (accessible by all threads) rather than being stored in a CPU's register (accessible by only that thread). But it also means:

1. Its published state is safe

If a field of an object is being instantiated, potentially other threads have access to it. However, they may reference it before the instantiating thread has left the constructor. If the object is not immutable, this can lead to inconsistent data. Using volatile is one way to eliminate this.

To ensure safe publication, Java code needs to do at least one of the following:
  • initialize the object from a static initializer
  • store the reference in a final field
  • store the reference in a volatile field
  • store the reference in a java.util.concurrent.atomic.AtomicReference
  • guard the reference in an appropriate lock
(See Brian Goetz's excellent Java Concurrency in Practise, p52, for more information).

2. All other variables are flushed to main memory

"When thread A writes to a volatile variable and subsequently thread B reads that same variable, the value of all variables that were visible to A prior to writing to the volatile variable become visible to B after reading the volatile variable."
(Java Concurrency in Practise, p38)


The reason I am re-reading Mr Goetz's excellent book is that I have been asked to diagnose a threading issue in some legacy code where I came across the Double Checked Locking idiom (see this link for what it is and why it's pathological). It's been a long time since I saw DCL code or even the Singleton pattern (singletons are hard to test and, in this day of dependency injection, largely redundant). But something I read that I had forgotten was:

"To ensure that all threads see the most up-to-date values of shared mutable variables, the reading and writing threads must synchronize on a common lock"
(ibid, p37, emphasis mine).

DCL is fixed if one uses volatile, according to Jeremy Manson, co-author of the JSR-133 and JLS that deals with threads and synchronization. Since the old code does not use volatile, I suspect this may be the cause of our problem. However, now I need to prove it - not easy when you're dealing with Heisenbugs.