Sunday, October 13, 2013

Perceptrons

I've been playing around a little with artificial intelligence. I've put a little project together called JNeural. This is a summary of what I have learned.

Perceptrons

In my little test project, I've built a perceptron, a simple piece of code that can be used for supervised learning. The basic algorithm is:

  1. Take a vector of inputs.
  2. Find the dot product of the inputs and a vector of weights.
  3. Feed this result into an activation function. Your answer will come out of the other side.
The simplest activation function returns a boolean given by evaluating: is the dot product greater than a threshold?

Teaching a Perceptron

The secret sauce is the weight vector. This is calculated by repeatedly giving the algorithm input vectors and asking it to compare its results with an expected answer. It goes like this:
  1. Given the input/weight dot product, calculate the delta of this result from the expected value.
  2. If the delta is greater than a threshold (and the expected value was no less than the threshold in the first place), then we're done.
  3. Otherwise, we got the answer wrong. So, let's calculate the the error, and adjust all the weights by it.
We do this either for a fixed number of steps or until our deviation for all the training sessions is within a certain margin.

Almost there...

This is not the whole story. There are two constants that we need to include. 

The first is the learning rate. This adjusts the error correction by a factor. It's needed because correction granularity might be too great. That is, we keep over shooting the mark. However, if it's too small, the algorithm takes much longer to learn.

The second is the bias. This is needed because imagine we had a problem space that was made of Cartesian co-ordinates and the question was "does a given point map to 1 or 0?". If we had an input of {0, 0}, no matter what the weights the output will always be 0.


Linearly Separable

Single perceptrons cannot solve problems that are not linearly separable. What does this mean?

Linearly separable problems are where a straight line can divide the problem space for example:


A less abstract problem space that is linearly separable is the truth table for AND:


since we can draw a straight line delineating one answer from another.

XOR is not linearly separable as it looks like this:


Perceptrons working in concert can solve this problem but not a single perceptron by itself.

References

The Nature of Code, Daniel Shiffman - available free online here.

Information Theory, Inference, and Learning Algorithms - available free here.

Wikipedia.

Saturday, October 12, 2013

Stacks of JVMs

The JVM uses a stack-based architecture [1]. When stack-based architectures evaluate expressions, they treats them like reverse Polish notation. Take, for instance:

(a + b * c) / (a + d * c - e)

This becomes in reverse Polish:

a b c * + a d c * + e - /

All this is saying is: push each variable on the stack and pop them when we sequentially see an operator, operate on it, and push the result back on the stack.

The downside of stack-based architectures is the potential to push and pop more than is necessary. In the example above, we push a and c twice. The upside is that the instructions set is simple and dense which can increase efficiency.

Of course, the code may be JITed in which case the native machine's architecture is used.

[1] Prof David Wentzlaff of Princeton University via Coursera.

Depth- and Breadth- first searches

I had a beer with the awfully clever Sean Scarff who was a games programmer in the 80s. He mentioned that he and his peers were writing a lot of algorithms while ignorant of the academic work in the field. In particular, when they were solving mazes, they imagined a smell emanating from the end of the maze and the closer they were the stronger the scent. This is breadth-first searching.

"Breath-first search and depth-first search both visit all the nodes in a graph, but their manner of doing so is dramatically different... Breadth-first search amounts to an army of searchers fanning out to cover the territory; depth-first search search corresponds to a single searcher probing unknown territory as deeply as possible, retreating only when hitting dead ends." - Algorithms in C++, Sedgewick.

Because of this, breadth-first search is easier to parallelize but may take up more memory (see other pros and cons here)

Sean is involved in computer networks these days where the algorithm still applies. Funny how algorithms do that.

Finally, a click here for cartoon that's both amusing and entertaining.

Sunday, October 6, 2013

A first look at Java 8's Lambdas

Cameron McKenzie has written an article on Java 8's new Lambdas so I thought I'd take a look at the new JDK. It's available here but note you may have IDE problems while the Java community decide on the exact syntax. I had to upgrade my IntelliJ to 12.1.5 to get the new syntax recognised.

McKenzie says that using Lambdas produces more efficient code but in my own tests, I haven't seen this. I took his code, only slightly modified it to remove System.out.println and put it into a Caliper test such that it looked like this:

    public int timeLambdas(int reps) {
        final Set   groups  = initGroups(reps);
        final List persons = initPersons(groups, reps);

        Stream sorted = persons.stream()
                .filter(p -> p.getAge() < reps)
                .map(p -> p.getGroup())
                .distinct()
                .sorted(comparing(g -> g.getSize()));

        int doNotJit = sorted.map(g -> g.getSize()).reduce(0, (i, j) -> i |= j);
        return doNotJit;
    }

(Note: I had to map/reduce the data otherwise all the filtering, mapping and dupe-removal does not get called. Lambdas are lazy and the code will not execute until asked to actually do something. I was very confused for an hour as I scratched my head wondering how on Earth they had made it so 'efficient'...).

Similarly, I had to make a slight modification to the traditional way of Java coding:

    public int timeTraditional(int reps) {
        final Set   myGroups = initGroups(reps);
        final List persons  = initPersons(myGroups, reps);

        Set groups = new HashSet<>();
        for (Person p : persons) {
            if (p.getAge() < reps)
                groups.add(p.getGroup());
        }
        List sorted = new ArrayList<>(groups);
        Collections.sort(sorted, new Comparator() {
            public int compare(Group a, Group b) {
                return Integer.compare(a.getSize(), b.getSize());
            }
        });
        int doNotJit = 1;
        for (Group g : sorted) {
            doNotJit |= g.getSize();
        }
        return doNotJit;
    }

So, running these in Caliper produced:

     0% Scenario{vm=/Users/phenry/Tools/JVMs/jdk1.8.0/bin/java, trial=0, benchmark=Lambdas} 753.72 ns; σ=98.39 ns @ 10 trials
     50% Scenario{vm=/Users/phenry/Tools/JVMs/jdk1.8.0/bin/java, trial=0, benchmark=Traditional} 530.16 ns; σ=60.72 ns @ 10 trials

     benchmark  ns linear runtime
     Lambdas 754 ==============================
     Traditional 530 ===================== 

Hmm, that's somewhat disappointing.

With a little bit of effort, I made the traditional approach even more efficient. I don't yet know enough about the new Lambdas to do the same for them but I suspect that there is more autoboxing going on as I mentioned before in other attempts to bring functional programming to the Java language. After all, the reduce method on the Stream interface (the gateway drug to Java 8's functional style) looks like this:

    T reduce(T identity, BinaryOperator accumulator);

(Note how that BinaryOperator is mapped to (i, j) -> ... )

That T represents a Class yet in my use of it in reduce(0, (i, j) -> i |= j), it's a a primitive int (0). Maybe, they'll optimise this in the future.

Saturday, October 5, 2013

Miscellaneous Coherence Notes


Replication vs. Partitioned Caches

From the (slightly old) Oracle Coherence 3.5:

Partitioned caches use divide and conquer. Each node does not have the entirety of the data set on one node, unlike replicated caches. This makes it better suited to systems that are write-heavy since "the Replicated Cache service needs to distribute the operation to all the nodes in the cluster and receive confirmation from them that the operation was completed successfully".

The downside of partitioned casts is that access is more expensive. "It is important to note that if the requested piece of data is not managed locally, it will always take only one additional network call to get it. Another thing to consider is that the objects in a partitioned cache are always stored in a serialized binary form. This means that every read request will have to deserialize the object, introducing additional latency."

"You can also increase the number of backup copies to more than one, but this is not recommended by Oracle and is rarely done in practice. The reason for this is that it guards against a very unlikely scenario - that two or more physical machines will fail at exactly the same time.

"The chances are that you will either lose a single physical machine, in the case of hardware failure, or a whole cluster within a data center, in the case of catastrophic failure. A single backup copy, on a different physical box, is all you need in the former case, while in the latter no number of backup copies will be enough - you will need to implement much broader disaster recovery solution and guard against it by implementing cluster-to-cluster replication across multiple data centers." Oracle's Push Replication pattern is what we're currently using at the moment.

Memory Issues

What you see is not necessarily what you get.

"If your data set is 1 GB and you have one backup copy for each object, you need at least 2GB of RAM.

"The cache indexes you create will need some memory.

The dangers of OOMEs are acute:

"You need to leave enough free space... or a frequent garbage collection will likely bring everything to a standstill, or worse yet, you will run out of memory and most likely bring the whole cluster down - when one node fails in a low-memory situation, it will likely have a domino effect on other nodes as they try to accommodate more data than they can handle".

I can testify that this is indeed the case as I have seen it before.

The book claims that too much memory can also be a problem. I can't vouch for this yet as we're only just moving to larger memory usage and it remains to be seen where the sweet spot is.

"Once the heap grows over a certain size (2GB at the moment for most JVMs), the garbage collection pause can become too long to be tolerated by users of an application, and possibly long enough that Coherence will assume the node is unavailable and kick it out of the cluster...

"Because of this, the recommended heap size for Coherence nodes is typically in the range of 1 to 2 GB, with 1GB usually being the optimal size that ensures that garbage collection pauses are short enough to be unnoticeable."

The book was written in 2009 and GC technology may have improved so YMMV.

Nodes leaving the cluster

The advantages of using a Coherence extend client is that their membership to the cluster is not as important.

"When a node fails, the Partitioned Cache service will notify all the other nodes to promote backup copies of the data that the failed node had primary responsibility for, and to create new backup copies on different nodes.

"When the failed node recovers, or a new node joins the cluster, the Partitioned Cache service will fail back some of the data to it by repartitioning the cluster and asking all of the existing members to move some of their data to the new node".



When is a connection connected?

More production issues with Jetty, this time hit by this bug - a deadlock in the web server.

While diagnosing similar bugs, I wondered when a connection is associated with the Java server process. This is not as simple a question as it sounds. The client can make a connection to a server socket but Linux won't recognise the connection as belonging to the process until it is accepted.

Take this code:

SocketChannel channel = ((ServerSocketChannel)key.channel()).accept(); 

Now put a breakpoint on this line, run it and make a connection to it like:

mds@gbl04214[Skye]:~> !telnet 
telnet 128.162.27.126 10110 
Trying 128.162.27.126... 
Connected to 128.162.27.126. 
Escape character is '^]'. 
This is a load of bunkum 


You'll see the breakpoint gets hit.

> netstat -nap 2>/dev/null | grep 10110 
tcp        0      0 128.162.27.126:10110    :::*                    LISTEN      5711/java 
tcp       26      0 128.162.27.126:10110    128.162.27.125:55658    ESTABLISHED - 


(Note how netstat is reporting that there are 26 bytes in the buffer ready to be read, my message This is a load of bunkum).

Pass over the break point (ie, we've now called accept) and we see:

> netstat -nap 2>/dev/null | grep 10110 
tcp        0      0 128.162.27.126:10110    :::*                    LISTEN      5711/java 
tcp       26      0 128.162.27.126:10110    128.162.27.125:55658    ESTABLISHED 5711/java 


Now the OS associates the connection with our process.

So, if your server goes mad and stops accepting connections, it may seem as if the connections backing up are not making it to your process. This is wrong.

As an addendum, if you did the same thing with netcat rather than telnet, notice that the socket state is CLOSE_WAIT rather than ESTABLISHED. Netcat has apparently terminated and sent a FIN packet but the socket won't transition out of this state until the application closes the connection.

> echo this is rubbish | netcat  128.162.27.126 10110 

> netstat -nap 2>/dev/null | grep 10110 
tcp        0      0 128.162.27.126:10110    :::*                    LISTEN      5711/java 
tcp       17      0 128.162.27.126:10110    128.162.27.125:49854    CLOSE_WAIT  - 

 Still, there is data to read even if the client no longer exists.