Saturday, October 8, 2016

Maths and the Principle of Least Surprise

Well, it surprised me.

While Java's maximum and minimum values of Integers really do represent the highest and lowest int values, the same is not true of Doubles. Here, the maximum and minimum values represent the largest and smallest values of the magnitude of the numbers. That is, both values are positive even though Doubles are signed. The nomenclature is confusing but it makes sense (see here). Unlike Integers, Doubles are symmetric so the smallest number you can represent is -Double.MAX_VALUE.

Just when you get used to that, you find that Scala does it differently: Double.MinValue really is the smallest value a double can hold. That is,

Double.MinValue == -java.lang.Double.MAX_VALUE

It's debatable which is the most surprising.

Friday, October 7, 2016

HBase Troubleshooting

Nasty little problem. Connecting to HBase kept timing out after lots of errors that looked like:

client.RpcRetryingCaller: Call exception, tries=17, retries=35, retryTime=208879ms, msg row '' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=HOST,60020,1475135661112, seqNum=0

while logging in the HBase shell seemed to be OK.

HBase uses Zookeeper to keep a record of its regions. So, first I tried checking that:

> echo ruok | nc ZOOKEEPER_HOST 2181

Running a repair didn't seem to help either.

Time to look at the HBase transaction logs. They kept repeating every second or so something like:

master.splitLogManager: total tasks =3 unassigned = 0 tasks={ /hbase-secure/splitWAL/WALs/HOST,600201474532814834-splitting ... status = in_progess ...

WALs are write ahead logs. "HBase data updates are stored in a place in memory called memstore for fast write. In the event of a region server failure, the contents of the memstore are lost because they have not been saved to disk yet. To prevent data loss in such a scenario, the updates are persisted in a WAL file before they are stored in the memstore".

These WALs appeared to be constantly splitting. So, a quick look at the WALs directory  was in order. It looked something like this:

drwr-x-r-x - hbase app-hdfs     0 2016-07-29 /apps/hbase/xxx/WALs/MACHINE_NAME,60020,1474532814834-splitting
drwr-x-r-x - hbase app-hdfs     0 2016-09-29 /apps/hbase/xxx/WALs/MACHINE_NAME,60020,1474532435621
drwr-x-r-x - hbase app-hdfs     0 2016-09-30 /apps/hbase/xxx/WALs/MACHINE_NAME,60020,1474532365837
drwr-x-r-x - hbase app-hdfs     0 2016-07-28 /apps/hbase/xxx/WALs/MACHINE_NAME,60020,1474532463823-splitting

Splitting is natural but shouldn't take too long (a split file from months ago is definitely a bad sign).

We were lucky that this was a test environment so we could happily delete these splitting files and restart HBase (YMMV. Think long and hard about doing that in a production environment...) But the problem went away for us.

Saturday, September 24, 2016

Probabilistic Programming - Part 1

Probabilistic programming is an exciting, new-ish area that is eating a lot of my spare time.

"Probabilistic Programming and Bayesian Methods for Hackers" is an excellent book I am playing with. Not only is it free (see here), you can download it and read it as an IPython book (with the command ipython notebook) and tinker with the equations (use shift-enter to execute them).

As you might have gathered, it's Python-based but there are moves to integrate the library it uses extensively (PyMC) with Spark.

In the Java/Scala world, Figaro is making in-roads. In this first part post, I'm making notes on the Python implementation. In part two, I'm hoping to implement a Scala/Figaro version.

Bayes again

Bayesian probability requires a different way of thinking.

If frequentist and Bayesian inference were programming functions, with inputs being statistical problems, then the two would be different in what they return to the user. The frequentist inference function would return a number, representing an estimate (typically a summary statistic like the sample average etc.), whereas the Bayesian function would return probabilities
For example, in our debugging problem above, calling the frequentist function with the argument "My code passed all tests; is my code bug-free?" would return a YES. On the other hand, asking our Bayesian function "Often my code has bugs. My code passed all tests; is my code bug-free?" would return something very different: probabilities of YES and NO. The function might return: 
YES, with probability 0.8; NO, with probability 0.2 
This is very different from the answer the frequentist function returned. Notice that the Bayesian function accepted an additional argument:  "Often my code has bugs". This parameter is the prior. By including the prior parameter, we are telling the Bayesian function to include our belief about the situation.
The formula for Bayesian probability goes like this:

P(A|X) = P(X|A) P(A)

Which reads in English as: the probability of A given X is equal to the probability of X give A multiplied by the probability of A by itself and divided by the probability of X by itself.

P(A|X)is called our posterior and P(A) is called our prior.

Probability distributions refresher

Probability distributions can be discrete (integers) or continuous (real numbers). The distribution of a discrete variable is called a probability mass function, a continuous variable has a probability density function.

One (of many) discrete functions is the Poisson distribution that dates back centuries and was derived to estimate the probabilities of cavalry horses losing a shoe in (or some other independent event). It looks like:

P(Z = k) = λk e

and this has the convenient property that its expected value is λ (or, in maths speak, E[Z|λ]λ).

So, this looks as good a distribution as any for our prior. The trouble is, what is λ? Aye, there's the rub. λ can be anything (an integer or any real number), so to model it, we can choose an exponential distribution. It is continuous and it has some convenience. The probability density function looks like:

f(z|λ) = λ e-λz ,  z >= 0

and the convenient fact is that E[Z|λ] = λ-1.


The task is to deduce when (if at all) somebody's behavior changed given a list of the number of text messages they sent over a certain time period. To do this, the book uses PyMC, a Python library.

The workhorse of the code is the generation of simulated data:

def lambda_(tau=tau, lambda_1=lambda_1, lambda_2=lambda_2):
    out = np.zeros(n_count_data)
    out[:tau] = lambda_1  # lambda before tau is lambda1
    out[tau:] = lambda_2  # lambda after (and including) tau is lambda2
    return out

which simply creates an array where the elements are one of two values. All the values before tau are lambda_1 and the rest are lambda_2. Simples.

"Inside the deterministic decorator [@pm.deterministic], the Stochastic variables passed in behave like scalars or Numpy arrays (if multivariable), and not like Stochastic variables."

By putting some print statements in, you can see that this function gets hit thousands of times when we feed it into the Poisson function that is given to a Model function that is passed to a MCMC function (Monte Carlo Markov Chain). Phew! The details can be found in the freely downloadable book and I'll explain more when I show the Figaro implementation.

"MCMC returns samples from the posterior distribution, not the distribution itself... MCMC performs a task similar to repeatedly asking "How likely is this pebble I found to be from the mountain I am searching for?", and completes its task by returning thousands of accepted pebbles in hopes of reconstructing the original mountain. In MCMC and PyMC lingo, the returned sequence of "pebbles" are the samples, cumulatively called the traces."

And it's these traces that (when graphed) can help us to see when it is most likely that behavior changed. I'll go into more detail with the Scala implementation in part 2.

Saturday, September 3, 2016

Yet more maths and graphs - Lagrange Multiplier

Here's a little maths trick called the Lagrange Multiplier that helps in finding the maximum/minimum of a function f(x, y) given a constant that also uses x and y. We "form the function

F(x, y) = f(x, y) + λ Φ(x, y) 

and set the two partial derivatives of F equal to zero. Then solve these two equations and the equation Φ(x, y) = constant for the three unknowns x, y and λ".

A nice proof can be found in the essential undergraduate text, Mathematical Method in the Physical Sciences. Let's say we're trying to find the max/min of the function f(x, y). Obviously, at this point.

df = δf dx + δf dy = 0                               (1)
     δx      δy

and say we've been given a function that is a constant Φ(x, y). Once again, it should be obvious that:

dΦ = δΦ dx + δΦ dy = 0                               (2)
     δx      δy

since Φ(x, y) is a constant.

Now, multiply the second equation by λ and add it to the first.

(δf + λ δΦ )dx + (δf + λ δΦ)dy = 0                   (3)
 δx     δx        δy     δy

Let's take that second term. Since λ is arbitrary, let's choose it to be such that:

δf + λ δΦ = 0                                        (4)
δy     δy

(That is, we've decided

λ = - δf  / δΦ                                       (5)
              δy /  δy

But it's not necessary to worry about that as we're not particularly interested in the value of λ. We just use it to solve the equations).

Similarly, we do the same for the first term of (3):

δf + λ δΦ = 0                                        (6)
δx     δx

Now, with equations (4) and (5) and (6) we have enough information to find the values of x and y at the position of the min/max point.

Why am I writing this post? Because we're going to need the Lagrange Multiplier shortly in partitioning a graph.

Friday, September 2, 2016

Miscellaneous HBase tips

I'm new(ish) to HBase so here are a few notes I've jotted down over the last few months. Basically, Spark is great at what it does, but if you need to look something up while processing an element, you're best relying on another tool. HBase has been our choice.

HBase is not like a normal database

You can't do a simple select count(*)... You need to do something like:

$ hbase org.apache.hadoop.hbase.mapreduce.RowCounter

See here for more information.

The HBase shell is not even SQL like. If you want to limit the number of rows returned, the syntax looks like:

hbase> scan 'test-table', {'LIMIT' => 5}

In "Hadoop: the Definitive Guide" there is a great "Whirlwind Tour of the Data Model". What follows is an even more condensed precis.
"Table cells - the intersection of rows and column coordinates - are versioned. By default, their version is a timestamp... A cell's contents is an uninterpreted array of bytes.
"Table rows are sorted by ... the table's primary key... All table accesses are via the primary key
"Row columns are grouped into column families. All column family members have a common prefix.
"Physically, all column family members are stored together on the filesystem [therefore] it is advised that all column family members have the same general access patterns and size characteristics.
"Tables are automatically partitioned horizontally by HBase into regions. Initially, a table comprises of a single region but as the size of the region grows, it splits...Regions are the units that get distributed over an HBase cluster.
"Row updates are atomic no matter how many row columns constitute the row-level transaction." 
Unsurprisingly for a solution based on Hadoop, HBase shares a similar architecture. The "HBase master node orchestrat[es] a cluster of one or more regionserver slaves." Unlike Hadoop, "HBase depends on ZooKeeper ... as the authority on cluster state".

"At a minimum, when bootstrapping a client connection to an HBase cluster, the client must be passed the location of the ZooKeeper ensemble. Thereafter, the client navigates the ZooKeeper hierarchy to learn cluster attributes such as server locations." For the purposes of creating a Spark artifact, this is simply a matter of adding the host machine's hbase-site.xml at the top level.


Find the load the cluster is under with:

echo "status 'detailed'" | hbase shell 2>/dev/null | grep requestsPerSecond | perl -pe s/,.*//g

With this, I've seen nodes in my cluster happily reach 60 000 requests per second, which is most pleasing.

However, the load over the nodes is not terribly evenly distributed. One way to tackle this problem is salting. I did actually preempt this problem by reversing the Strings that were my keys. Since each key has a prefix taken from a fairly small set, I was expecting them to form "lumps" of data. However, I then create the HBase table with something like:

hbaseAdmin.createTable(hTableDescriptor, toBytes('0'), toBytes('z'), 20)

(where 20 is my number of regions).

However, this assumes that the text (even when reversed) that I am using as keys is evenly distributed over the alphanumerics (it's not as it's English words rather than random text). So, I still have some lumpiness.

Another optimization is to define the regions at table creation time (see the HBaseAdmin.createTable method that takes a startKey and endKey). This is to mitigate the problem that when a table is created, there is only one region. "The reason HBase creates only one region for the table is that it cannot possibly know how to create the split points within the row key space. Making such decisions is based highly on the distribution of the keys in your data." (from here). "Since pre-splitting will ensure that the initial load is more evenly distributed throughout the cluster, you should always consider using it if you know your key distribution beforehand."

Tuesday, August 30, 2016

Refresher on Eigenvalues and Eigenvectors

A lot of graph analysis strongly relies on linear algebra so here's a quick refresher on stuff you'll have studied in your undergraduate days.

In the mathematics of graphs, you'll see the mention of eigenvalues and eigenvectors. The general description of them (as in that Wikipedia link) focuses on the rotation, translation and shearing of images. Although absolutely correct, it takes the emphasis away from their most interesting quality: we're solving sets of equations with them.

A set of equations that are formed of constants and variables that are never raised to a power greater than one are called linear equations.


Let's take a straight-forward set of simultaneous equations:

2x-3y+ 8z=4

We can represent the coefficients as a matrix:


Now, we can produce a row reduced matrix by employing any combination of:

  • interchanging two rows
  • multiplying a row by a non-zero constant
  • adding one row to another

to give us:


where the the entire matrix we'll call matrix A but the purple subset we'll call matrix M. The rank (that is, the number of non-zero rows after all row reduction) is greater in A than M indicating that the equations have no solution (since no values of 0x, 0y and 0z can equal the -20 in the last row). It is inconsistent.

The rule here is:
  1. rank M < rank A: no solution
  2. rank M = rank A = number of unknowns: one solution
  3. rank M = rank A = R < n: R unknowns can be expressed in terms of (n - R) unknowns.
An example:

2x-y- 5z=2

when row reduced becomes:


Since this is option #3, and R = 2 and n  = 3, it's not surprising that we can express the original equations in one unknown (z, say)

x = 3 + 2z
y = 4 - z

which with a bit of Python, Matlib and some help from here like this

import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import matplotlib.pyplot as plt

mpl.rcParams['legend.fontsize'] = 10
fig = plt.figure()
ax = fig.gca(projection='3d')
theta = np.linspace(-4 * np.pi, 4 * np.pi, 100)
z = np.linspace(-10, 10, 10)
x = 3 + (2 * z)
y = 4 - z
ax.plot(x, y, z, label='x = 3 + 2z; y = 4 - z')

tells us that there are infinite solution - that is, anything on this line:


Like the use of images to demonstrate the properties of matrices, determinants are often described geometrically in the way they define volumes. "For an n x n matrix with columns, the value of det A is the signed volume of the parallelepiped defined by the vectors" [2] This is similar to calculating the cross product. In fact, the triple product that uses both the cross product and the dot product is how to find the volume in 3 dimensions.

But, this is only part of the story. Determinants are interesting because of what they tell us about the matrix.

Calculating determinants is tedious. Thankfully, since my undergraduate days, there have been many libraries written to do it for me (floating-point caveats aside). Now, we just run something like:

import numpy as np

identity = np.eye(3)
print "det(I3) = ", np.linalg.det(identity)

which tells me the (not too surprising) fact that the determinant of the identity matrix (I) over field ℜ3 is 1.0.

More interestingly:

det AB = det BA = (det A) x (det B)

Why is this interesting? Well, if A is a matrix (and matrices can be treated as functions) that has an inverse:

A A-1 = I
det (A A-1) = det(A) x det(A-1) = det I = 1

If the product of terms equals 1, no term can be 0. So, for an inverse matrix to exist, a prerequisite is that the matrix's determinant must be non-zero.

Homogeneous Equations

When the constants on the right hand side of all linear equations in the set are 0, we call them homogeneous equations.

Unlike the equations at the top of this post, "homogeneous equations are never inconsistent; they always have the solution 'all unknowns = 0' (often called the 'trivial solution'). If the number of independent equations (that is, the rank of the matrix) is the same as the number of unknowns, this is the only solution." [1] That is, there are too many contradictory equations (see here for a nice demo).

"If the rank of the matrix is less than the number of unknowns, there are infinitely many solutions." [1] That is, we can only express unknowns in terms of other unknowns as there are no constants (try it!). However, the infinite solutions describe a line, a plane, a hyper-plane (depending on the number of dimensions in which we're working) and we can take unit-vectors over these spaces.

"A system of n homogeneous equations in n unknowns has solutions other than the trivial solution if and only if the determinant of the coefficients is zero" [1]

Eigen- vectors and values

Which finally brings us to eigen- vectors and values. What they are is simple. Again in geometric terms, we can consider them as vectors whose direction does not change under a transformation from a matrix but whose magnitude might. Why they are interesting is more subtle. They come in handy in problems as disparate as Latent Semantic Analysis where we want to know which documents contain which 'concepts' to determining if a graph is acyclic.

Let's take their definition:

M v = λ v

(where M is a matrix, v an eigenvector and λ an eigenvalue - there may be more than one).

Not all matrices have eigen-vectors and -values (over the field on which the matrix acts - see here for more information). Again from a geometric point of view, if we imagine the matrix to represent a rotation, no vector stays pointing in the same direction (note: there may still be solutions but the eigenpairs will be made of complex numbers).

Not all eigenvectors are orthogonal. Generally, they are not although they are for symmetric matrices.

[1] Mathematical Method in the Physical Sciences - Boas.
[2] Coding the Matrix - Klein

Saturday, August 20, 2016

Louvain Modularity and Graph Modularity

Looking at ways to solve my problem of agglomerative clusters when trying to de-dupe data, I turned to community detection algorithms.

Louvain Modularity is a heuristic that tells you about communities in a network (implementations found here). The original paper ("Fast unfolding of communities in large networks") can be found here. The first equations comes without much of an introduction so I scuttled off to find where it came from. The natural choice was a brilliant book by the equation's creator Networks: An Introduction by Mark Newman*.

Newman points out the difficulty in coming up with an algorithm that finds communities. If you want to minimise the number of edges crossing a boundary between two groups, then a naive algorithm will cheerfully put no nodes in one group and all N nodes in the other. So, you might want to tell the algorithm to weight the results by multiplying them by the product of the numbers in each group. This is heading in the right direction as a 0-N split will be the least favourable. In fact, with a bit of simple calculus, you can find that the maximum is (N/2). This is nice but arbitrary.

So, one metric we could use is the difference between the actual number of edges connecting nodes of the same class with the expected number.

The actual number of between classes ci and cj is given by:
Σδ(ci,cj) =½ΣAij δ(ci,cj)

Where δ(ci,cj) is the Kronecker Delta (which is simply 1 if i == j else 0), A is our adjacency matrix and the ½ comes because we don't want to double-count all the edges in the adjacency matrix.

For the expected number, imagine node i that has degree ki. If the graph has m edges, there are obviously 2m ends. Therefore, for any outgoing edge from i, the chances that the other side is node j is kj/2m and for all outgoing edges from i, the chance is kikj/2m.

Add them all up, and the expected number is:

½Σ(ki kj / 2m) δ(ci,cj)

where, again, the ½ prevents us from double-counting.

Now, we take the difference and divide by m (since we interested in fractions not absolute numbers) and we get:

Q = (1/2m)Σ(Aij - ki kj / 2m) δ(ci,cj)

which is the equation in the paper.

* I really can't recommend this book enough. If you've got an undergraduate degree in something mathematically related, you'll find the equations no problem and they are presented in an easy-going manner.