Sunday, March 26, 2017

You have nothing to lose but your (Markov) chains!


I've been playing around with Markov Chain Monte Carlo so decided to take a Coursera course that deals with them (Statistical Mechanics allows you plot configurations of objects that are as probable as if you took a snapshot of a classical Newtonian system).

"Starting from a given initialdistribution, a Markov chain defines a sequence of probability distributions over states of the system. For any Markov chain satisfying certain mathematical conditions, the distribution at successive states converges to a single distribution. This single distribution is called the stationary distribution... Much of the art of designing an MH algorithm is in choosing the proposal distribution." PPP (p343).

"The Metropolis algorithm takes a random walk in a probability space."

"The only difference between Metropolis and Metropolis-Hastings is that the first ones always sample from a symmetric distribution and thus they don't have the Hastings correction ratio." - see here

What is a Markov chain?

"If we think of Xt as a state X at time t and invoke the following dependency condition on each state:

Pr(Xt+1 = xt+1 | Xt = xt, Xt-1 = xt-1, ..., X0 = x0) = Pr(Xt+1 = xt+1 | Xt = xt)

then the stochastic process is known as a Markov chain" (from here)

What is a Monte Carlo simulation?

"The Monte Carlo method is a statistical - almost experimental - approach to computing integrals using random positions, called samples, whose distribution is carefully chosen." [Krauth]

In the two Coursera examples, pi is calculated by:

  1. throwing stones into a square and observing the ratio that are inside and outside a circle that just fits in the square (direct sampling).
  2. wandering around the space throwing stones and observing how high the pile is for each grid reference (Markov chain sampling). Note, there are interesting restrictions when a stone is tossed over the boundary of the space.

Note that both approaches approximate an integral in multiple dimensions, namely:

∫ 1-11-1 p(x, y) O(x, y) dx dy
_______________________________

    ∫ 1-1∫ 1-1 p(x, y) dx dy

where p is the probability of being at point (x, y) and O is the observed size of the pile of stones at (x, y).

Detailed balance condition

This is a sufficient condition for Monte Carlo. The argument goes like this:

Imagine moving around a 3x3 grid exploring all the squares. If we are on the square in the top right corner (call it 'a') and we can only move horizontally and vertically then the probabilities for the next move are obviously:

p(a → a) + p(a → b) + p(a → c) = 1                (1)

(where 'b' and 'c' are the square immediately to the left and immediately below 'a')

Similarly, I can only get to 'a' via 'a', 'b' and 'c' so:

πa . 1 = πb p(b → a) + πc p(c → a) + πa p(a → a)   (2)

substitute equation 1 into 2:

πa p(a → b) + πa p(a → c) = πb p(b → a) + πc p(c → a)

There are many ways to satisfy the global balance condition but one way is to say the first term on the left hand side equals the first in the right hand side and the second term on the left hand side equals the second term on the right hand side.

This is the detailed balance condition:

πa p(a → b) = πb p(b → a)
πa p(a → c) = πc p(c → a)

where πx is the probability of being on square x.

If we want to sweep out all the configurations equally, we want:

πa = πb = πc

(This is not true in the Metropolis algorithm. Metropolis et al proposed another function for πx)

This implies p(a → b) =  p(b → a).

What's interesting is that "the Monte Carlo algorithm is nothing but a matrix, the transfer matrix of transition probabilities p(a→b) for moving from the site a to the site b." In this 3x3 game, the matrix is 9x9 (the 9 squares from which we move to 9 squares to which we can move).

The transfer matrix

In Python, it looks like this:

from numpy import dot, matrix, set_printoptions
from numpy.linalg import eig

m = matrix('2  1  0  1  0  0  0  0  0;'
           '1  1  1  0  1  0  0  0  0;'
           '0  1  2  0  0  1  0  0  0;'
           '1  0  0  1  1  0  1  0  0;'
           '0  1  0  1  0  1  0  1  0;'
           '0  0  1  0  1  1  0  0  1;'
           '0  0  0  1  0  0  2  1  0;'
           '0  0  0  0  1  0  1  1  1;'
           '0  0  0  0  0  1  0  1  2') / 4.0

This conforms to a mathematics 'design pattern'. Chase the problem onto known territory and deal with it there with the tools designed by other people.

Where it gets cool is when we take the eigen-decomposition (a.k.a spectral decomposition) of this matrix. This says that for a square matrix, A, we can express it as:

A = Q  Λ Q-1

where the columns of Q are the eigenvectors and Λ is a diagonal matrix whose values are the eigenvalues (a.k.a the spectrum of A).

In Python, we can find Q and Λ by doing this:

eigenvalues, eigenvectors = eig(m)

And why is this interesting? Because if we keep multiplying the transfer matrix with the vector representing the probabilities of which square we're currently in, it converges to the stationary distribution.

How do we prove this? Well, let's say we multiply A by itself x times:

Ax = (Q  Λ Q-1)x =  Q  Λ Q-1 Q  Λ Q-1 Q  Λ Q-1...  Q  Λ Q-1 Q  Λx Q-1

so, we only have to times the eigenvalues by themselves x times. That makes the calculations more efficient.

What are the eigenvalues?

set_printoptions(precision=2)
set_printoptions(suppress=True)
print eigenvalues

[-0.5   0.    1.    0.75  0.5   0.25 -0.    0.25  0.75]

Note that all but one value is less than 1.0 so if we keep multiplying a vector of these values by itself, all the values other than the third will disappear. 

Now, let's look at the eigenvectors (Q):

print eigenvectors

[[ 0.17 -0.41 -0.33  0.58  0.5   0.33  0.22  0.04  0.11]
 [-0.33  0.41 -0.33  0.29  0.   -0.17  0.12 -0.52 -0.23]
 [ 0.17  0.   -0.33 -0.   -0.5   0.33 -0.34  0.04 -0.57]
 [-0.33  0.41 -0.33  0.29 -0.   -0.17 -0.57  0.48  0.34]
 [ 0.67 -0.   -0.33  0.   -0.   -0.67  0.   -0.08 -0.  ]
 [-0.33 -0.41 -0.33 -0.29  0.   -0.17  0.57  0.48 -0.34]
 [ 0.17  0.   -0.33 -0.   -0.5   0.33  0.34  0.04  0.57]
 [-0.33 -0.41 -0.33 -0.29  0.   -0.17 -0.12 -0.52  0.23]
 [ 0.17  0.41 -0.33 -0.58  0.5   0.33 -0.22  0.04 -0.11]]
[[ 0.17  0.   -0.33 -0.   -0.5   0.33 -0.34  0.04 -0.57]]

and note that the third column (corresponding to the third eigenvalue) is uniform. All other columns will disappear as their corresponding eigvalues, when multiplied by themselves a sufficient number of times, tend to 0. And that's why, after a sufficient number of matrix multiplications, we'll reach equilibrium.

Let's see it in action by multiplying the matrix by itself 10 times:

for i in range(10):
    m = dot(m, m)

print m

[[ 0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11]
 [ 0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11]
 [ 0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11]
 [ 0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11]
 [ 0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11]
 [ 0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11]
 [ 0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11]
 [ 0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11]
 [ 0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11  0.11]]

(Ir)Reducible matrices

We could easily extend our 3x3 box into a 6x6 box and make the transfer matrix such that only the top left hand 3x3 square and bottom right hand 3x3 square were possibly territories. If there were no means to transition from one such territory to the other, this would violate the Markov Chain principle outlined above. If we ever found ourselves in one quadrant, we'd know with 100% certainty that we'd never visited nor ever will visit the other.

"One of the two mathematically rigorous conditions, for a Monte Carlo Markov chain algorithm is that this pulling apart is not possible, that would be irreducible."

One way of looking at this is picturing the matrix as a graph. Usually, we picture graphs as adjacency matrices but if we do it the other way around we'd see this matrix as two disjoint sets of vertices. In linear algebra jargon, the matrix is reducible into these two sets. In graph theory jargon, we'd say the graph was not strongly connected. They are equivalent concepts.

Further reading



Friday, March 10, 2017

Hadoop and HBase in an integration test


It's always very nice to have your integration tests run out-of-the-box as soon as you check the code out of Git. No fiddling with your environment, just click and go.

Unfortunately, Hadoop uses some native Windows libraries and for those unfortunate enough to develop on Windows, this is a pain - particularly if you have a large, distributed team. So, this is a nice hack I've used a couple of times to get the integration tests to run on developers' laptops without any manual configuration.

First, check the binaries hadoop.dll and winutils.exe in. They're not very big (about 40kb). Then you need to set the System properties:

System.setProperty("java.library.path", BINARY_DIRECTORY)
System.setProperty("hadoop.home.dir",   BINARY_DIRECTORY)

where BINARY_DIRECTORY is the absolute path name of the directory the binaries live in.

Then, before running Hadoop code, you need this dirty hack that introspectively assures Hadoop that everything is OK:

val classLoader = this.getClass.getClassLoader
val field       = classOf[ClassLoader].getDeclaredField("usr_paths")
field.setAccessible(true)
val usrPath     = field.get(classLoader).asInstanceOf[Array[String]]
val newUsrPath  = new Array[String](usrPath.length + 1)
System.arraycopy(usrPath, 0, newUsrPath, 0, usrPath.length)
newUsrPath(usrPath.length) = BINARY_DIRECTORY
field.set(classLoader, newUsrPath)

val field = classOf[org.apache.hadoop.fs.FileSystem].getDeclaredField("FILE_SYSTEMS_LOADED")
field.setAccessible(true)
field.setBoolean(null, true)

val nativeCodeLoadedField = classOf[org.apache.hadoop.util.NativeCodeLoader].getDeclaredField("nativeCodeLoaded")
nativeCodeLoadedField.setAccessible(true)
nativeCodeLoadedField.set(null, false)


(This is Scala but it translates to Java easily enough)

It's ugly and it's dirty but it does the job. It basically tricks Hadoop into thinking everything is OK. It's good enough for tests but don't do it in production. Also, it's not at all necessary for *Nix systems.

Having done this, you can start running the integration tests against HDFS which is backed by your own hard disk. Again, you'd never do this in production (what's the point?) but it really helps with tests.

You do this with the following code (assuming you now want HBase to live on top of Hadoop):

val configuration = org.apache.hadoop.hbase.HBaseConfiguration.create()
configuration.set("fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
configuration.set("fs.hdfs.impl", classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
val utility = new org.apache.hadoop.hbase.HBaseTestingUtility(configuration)
utility.startMiniCluster()

And you're good to go. HBase and HDFS both working on a single machine.