Saturday, September 7, 2013

Anatomy of a file handle leak

Production issues are nasty. But when the same non-deterministic state happens in the performance test environment, it's a gift from the gods. You can now examine it in a controlled environment.

The back story: occasionally, a prod server running Jetty 6.1.24 on a 1.6.13 JVM would complain that it had run out of file handles. Using lsof, we could see that over a thousand handles were indeed open and that they appeared to belong to TCP ports.

Using netstat, we could see what these ports were doing:

mds@gbl04215[Skye]:~> netstat -nap 2>/dev/null | head -2 
Active Internet connections (servers and established) 
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name 
mds@gbl04215[Skye]:~> netstat -nap 2>/dev/null | grep 10302 | more 
tcp        0      0 127.0.0.1:32007         0.0.0.0:*               LISTEN      10302/java 
tcp        0      0 :::48746                :::*                    LISTEN      10302/java 
tcp        0      0 :::1099                 :::*                    LISTEN      10302/java 
tcp        0      0 128.162.27.126:8112     :::*                    LISTEN      10302/java 
tcp        0      0 :::19601                :::*                    LISTEN      10302/java 
tcp        0      0 :::41624                :::*                    LISTEN      10302/java 
tcp        0      0 :::10008                :::*                    LISTEN      10302/java 
tcp        0      0 :::19001                :::*                    LISTEN      10302/java 
tcp        0      0 :::56639                :::*                    LISTEN      10302/java 
tcp     1224      0 128.162.27.126:19001    128.164.34.13:47367     CLOSE_WAIT  10302/java 
tcp     1223      0 128.162.27.126:19001    128.164.34.13:41988     CLOSE_WAIT  10302/java 
tcp     1224      0 128.162.27.126:19001    128.164.34.11:56322     CLOSE_WAIT  10302/java 
tcp     1224      0 128.162.27.126:19001    128.164.34.11:49410     CLOSE_WAIT  10302/java 
tcp     1224      0 128.162.27.126:19001    128.164.34.11:55043     CLOSE_WAIT  10302/java 
tcp      815      0 128.162.27.126:19001    133.13.143.184:57011    ESTABLISHED 10302/java 
tcp     1224      0 128.162.27.126:19001    128.164.34.13:48896     CLOSE_WAIT  10302/java 
tcp     1769      0 128.162.27.126:19001    128.164.34.11:45830     CLOSE_WAIT  10302/java 
tcp     1768      0 128.162.27.126:19001    128.164.34.11:53254     CLOSE_WAIT  10302/java 
tcp     1770      0 128.162.27.126:19001    128.164.34.11:43783     CLOSE_WAIT  10302/java 
tcp     1224      0 128.162.27.126:19001    128.164.34.11:49927     CLOSE_WAIT  10302/java 
tcp     1224      0 128.162.27.126:19001    128.164.34.11:44040     CLOSE_WAIT  10302/java 
tcp     1769      0 128.162.27.126:19001    128.164.34.11:45321     CLOSE_WAIT  10302/java 
tcp     1770      0 128.162.27.126:19001    128.164.34.13:44300     CLOSE_WAIT  10302/java 
tcp     1768      0 128.162.27.126:19001    128.164.34.13:45066     CLOSE_WAIT  10302/java 
tcp     1769      0 128.162.27.126:19001    128.164.34.13:47114     CLOSE_WAIT  10302/java 
tcp     1224      0 128.162.27.126:19001    128.164.34.13:40970     CLOSE_WAIT  10302/java 
tcp     1224      0 128.162.27.126:19001    128.164.34.13:39434     CLOSE_WAIT  10302/java
.
.


(where our Java PID is 10303).

Hmm, OK. Well, some interesting things to note: first is that netstat tells you more than the state of the port. As the man page says, it tells you that data is backing up on the port (the second column, Recv-Q). This is measured in bytes.

The second point to note is that the local port, that is the port of our server (19001), is the port we're using for HTTP via Jetty. 

The third is the CLOSE_WAIT state. This means that the client has closed the session (by sending a FIN packet) and that our server has acknowledged it (by returning an ACK packet) but that our application has not yet told the OS to close the socket (see diagram).


Putting this information together, I dumped the JVM's memory to a file and examined its contents.



Randomly selecting a SocketChannel that happened to be lingering in memory, I took the port associated with it and checked that this was one of my ports that is in the CLOSE_WAIT state. It was and so were all the others I looked at.

So, was Jetty not closing ports? Googling didn't find anybody complaining about this but I did find a JVM bug report here that looks awfully similar.

More research is needed but this suspiciously looks like the culprit.

Sunday, September 1, 2013

More NIO Tidbits

A miscellaneous list of interesting things I've found playing with my own NIO code (on a Mac OS X Lion 10.7.5):

Selector.select() and .selectedKeys can return non-zero and non-empty values even if the client has disconnected. They merely represent there is something to do. Maybe data has been buffered and needs reading.

If the state of the client has changed since the last call to Selector.select(), this may not be reflected until the next.

Similarly, SocketChannel.read(...) can return happily return a value greater than 0 and fill your buffer even if the client has long since disconnected. But it can also return -1 if the client has disconnected while at the same time a call to its SelectionKey.isReadable() returns true.

So, SelectionKey.isReadable() can return true even though the client has disconnected (even if it has not written anything) but it can also return false while the client is connected (but has not yet written anything)!

If the client has disconnected, the socket on the server side can still be open and be in a CLOSED_WAIT state. Writing to the associated SocketChannel will not throw an error but unsurprisingly the socket may no longer show up in a netstat after this has been done.

The Socket associated with the SocketChannel may still return true when its isConnected method is called irrespective of whether the client has disconnected. It is returning the state of this socket, not the socket on the client.

Level and Edge Triggered IO

There are "two models for readiness notification for a file descriptor.

  • Level-triggered notification: A file descriptor is considered to be ready if it is possible to perform an I/O system call without blocking.
  • Edge-triggered notification: Notification is provided if there is an I/O activity (e.g., new input) on a file descriptor since it was last monitored." [1]
(As a mnemonic, think of the 'E' in edge-triggered also standing for Event).

It's interesting to note that the two are isomorphic. If you were to have edge-triggered notifications and then kept a collection of them until used, you'd have level-triggered notification. Similarly, if you observed changes in the collection of level-triggered notifications, you'd have edge-triggered notifications. [2]

Anyway, in Java, "the Selector API requires a level-triggered polling mechanism. Edge-triggered interfaces require a much tighter coupling with the I/O methods." [3]

This edge-triggering can be seen in the standard read methods of Java I/O classes that tell you if data was received if the return value is greater than zero. 


[1] The Linux Programming Interface, Michael Kerrisk
[2] http://spectral.mscs.mu.edu/netbook/chapter18.html
[3] http://mail.openjdk.java.net/pipermail/nio-dev/2008-December/000329.html