Monday, September 18, 2017

Spark Partitions


Some miscellaneous Spark/GraphX partitioning notes that I have made.

Co-partitioned and co-located are not the same thing

"Even if the same partitioner is used for an RDD that needs to be partitioned within the current job as was used to partition another RDD that is needed in the current job but whose partitioning was persisted in a prior job, then the RDDs will still be co-partitioned but won't necessarily be co-located."

"There are two things here. One is data co-partition, and then data co-location.

"As long as you specify the same hash partitioner, a1 and b1 will be co-partitioned (i.e. partitioned the same way). The partitions might not be co-located, so your job will still incur network traffic, but you do reduce a whole round of shuffle.

"If a1 and b1 are put into memory by different jobs, there is no guarantee their partitions will be co-located, even though they are co-partitioned.

"What if you add a node to the cluster, or a node dies? Or what if a node is busy and can never be run anything - are you going to block the entire program forever so you could put a partition on that node?

"You can still write your program in a way that does mostly local joins. And even if the joins need to fetch blocks from remote nodes, the joins still avoid an extra round of shuffle, which is much more expensive than just fetching data."

From a Google Groups discussion (here).

Re/partitioning

Partitioning graphs present another interesting challenge. Edge cuts (ie, partitioning vertices) are a viable but generally inefficient way to partition a graph.

Most graphs obey a power law, for example something like Zipf's law if you're processing natural language. ("Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table." from Wikipedia). Consequently, the partitioning of vertices will be 'lumpy'.

GraphX allows you to chose your partitioning strategy. It offers these:

From "Spark GraphX in Action", p214.
You can find my experience of EdgePartition2D here.

"Introducing a .repartition will increase the amount of work that the Spark engine has to do, but the benefit can significantly outweigh that cost." (from here). In this example, the data was "lumpy" causing a single executor to do most of the work.

I've also seen repartitioning on a given field used to optimize joins. See my previous post where repartitiong a Dataframe by field X and writing it to a file was used to make joining on X more efficient.

Joins

"join has the same semantics as in sql join, e.g. every row after the join is (k, v1, v2), where v1 is from rdd1, and v2 is from rdd2.

cogroup just groups values of the same key together, and every row looks like (k, seq1, seq2), where seq1 contains all the values having the same key from rdd1."

(From another Google Groups discussion).

Note that Datasets have slightly different semantics. The outer-joins of RDDs can return RDDs that contain Options. Dataset's join returns another Dataset whose rows can be accessed as usual while its @experimental joinWith returns a Dataset[(T, U)]. Note that T and U can be null. No nice Options for us here.

Sorting

"In Apache Hadoop, the grouping is done in two places - the partitioner, which routes map outputs to reduce tasks, and the grouping comparator, which groups data within a reduce task.  Both of these are pluggable per-job.  The sorting is pluggable by setting the output key comparator...

"The order of the values within a reduce function call, however, is typically unspecified and can vary between runs. Secondary sort is a technique that allows the MapReduce programmer to control the order that the values show up within a reduce function call." (from here). This can then help the next stage.

Sorting also helped us when writing Parquet files to HDFS.

An interesting trick to improve sort performance can be found here: "Sorting in general has good cache hit rate due to the sequential scan access pattern. Sorting a list of pointers, however, has a poor cache hit rate because each comparison operation requires dereferencing two pointers that point to randomly located records in memory. So how do we improve the cache locality of sorting? A very simple approach is to store the sort key of each record side by side with the pointer. For example, if the sort key is a 64-bit integer, then we use 128-bit (64-bit pointer and 64-bit key) to store each record in the pointers array."

DAGs

You may see the DAG in the GUI not being as linear as you'd expect. It is Spark's prerogative to do this.  "Stages that are not interdependent may be submitted to the cluster for execution in parallel: this maximizes the parallelization capability on the cluster. So if operations in our dataflow can happen simultaneously we will expect to see multiple stages launched" (from here).


Tuesday, September 12, 2017

Neural Networks are just Matrices


DeepLearning4J brings neural nets to Java programmers. They suggest in the getting started section to run the XorExample. This is a neural net that, given XOR inputs and outputs, learns its logic. This is non-trivial for a simple neural net (see here) as the true and false values are not linearly separable in a single XOR matrix. DL4J provides a way of making more complicated neural nets but hides a lot of detail.

Matrices

The network in XorExample "consists in 2 input-neurons, 1 hidden-layer with 4 hidden-neurons, and 2 output-neurons... the first fires for false, the second fires for true".

But instead of talking about neurons, it's much easier to think of this neural net as matrices (at least if you're familiar with simple linear algebra).

So, imagine this neural net of just:

  • A 4 x 2 matrix of the inputs ("features").
  • A hidden layer that is a 2 x 4 matrix 
  • A layer that is a 4 x 2 matrix that yields the output.

We need to apply functions to these matrices before we multiply, but that's essentially it.

Multi-layer Neural Nets in a few lines of Python

To do this, I'm not going to use TensorFlow or other frameworks dedicated to neural nets. I'll just use Numpy as basically syntactic sugar around manipulating arrays of arrays.

Also, I'll present a more intuitive approach to the maths. A more thorough analysis can be found on Brian Dohansky's blog here.

Finally, I tried to do this in Scala using the Breeze linear algebra library but it was just so much easier to do it in Python and Numpy as it ran in a fraction of the time it took Scala to even compile.

The code

As mentioned, we need Numpy.

import numpy as np

Now, let's take the input to a XOR operation (our first matrix):

features = np.matrix('0.0, 0.0;'
                     '1.0, 0.0;'
                     '0.0, 1.0;'
                     '1.0, 1.0')

and the expected output:

labels = np.matrix('1.0, 0.0;'
                   '0.0, 1.0;'
                   '0.0, 1.0;'
                   '1.0, 0.0')

(Remember that that this is not trying to be a truth table. Instead, the first column indicates that the output is true if it's 1 and and the second column indicates it's false if it's 1).

We need those other 2 matrices. It doesn't matter what their values are initially as we'll correct them whatever they are. So, let's chose some random values but in a well-defined shape:

weightsLayer1 = np.random.rand(2, 4)
weightsLayer2 = np.random.rand(4, 2)

We also need the gradient of the weights and biases. We could have subsumed the biases into our matrices - that's mathematically equivalent - but the DeepLearning4J example doesn't do this so we won't either.  The weights and biases for the first and second layers respectively are:

_0_W = np.zeros([2, 4])
_0_b = np.zeros([1, 4])
_1_W = np.zeros([4, 2])
_1_b = np.zeros([1, 2])

Finally, we need a step value and a batch size:

learning_rate = 0.1
mini_batch_size = np.shape(features)[0]

Now we can do our first multiplication (or forward propagation):

    s0 = features * weightsLayer1
    s0 += _0_b   

But we need to apply a function to this (an activation function). In this example, we'll use a sigmoid function. It has a simple derivative that we'll need later.

def f_sigmoid(X):
        return 1 / (1 + np.exp(-X))

So, now we can apply the sigmoid activation function to the first layer:

sigmoided = f_sigmoid(s0)

This we'll feed into the next layer with another matrix multiplication:

    s1 = sigmoided * weightsLayer2
    s1 += _1_b

No we need our second activation function to apply to this matrix. It's the softmax function that basically normalizes to 1 all the values in each row allowing us to treat each row as a probability distribution. It looks like this (as stolen from Brian):

def f_softmax(X):
    Z = np.sum(np.exp(X), axis=1)
    Z = Z.reshape(Z.shape[0], 1)
    return np.exp(X) / Z

We apply this to the output from the previous layer and find the delta with what the output should be:

softmaxed = f_softmax(s1)
delta = softmaxed - labels

We calculate the delta weighted according to the weights of this layer, Transposing appropriately:

epsilonNext = (weightsLayer2 * delta.T).T

Now, it's a good job that sigmoid function has an easy derivative. It looks like this:

dLdz = np.multiply(sigmoided, (1-sigmoided)) 

where, note, this is an element-wise multiplication, not a matrix multiplication. Similarly, we calculate the back propagation (which "allows the information from the cost to then flow backward through the network in order to compute the gradient" [1]) using the same Numpy operation:

backprop = np.multiply(dLdz, epsilonNext)

Intuitively, you can think of this as each element of the weighted delta only being multiplied by the gradient of the function at that element.

Note:
"The term back-propagation is often misunderstood as meaning the whole learning algorithm for multi layer neural networks. Actually, back-propagation refers only to the method for computing the gradient, while another algorithm such as stochastic gradient descent, is used to perform learning using the gradient." [1]
Anyway, we can now update the gradients and the weights:

    _0_W = (features.T * backprop) + _0_W
    _1_W = (sigmoided.T * delta) + _1_W
    _0_b = _0_b - (learning_rate * np.sum(backprop, 0))
    _1_b = _1_b - (learning_rate * np.sum(delta, 0))
    weightsLayer1 = weightsLayer1 - (learning_rate * _0_W)
    weightsLayer2 = weightsLayer2 - (learning_rate * _1_W)

Note that the biases are derived from the columnar sums of the backprop and delta matrices.

Now, repeat this some 500 times and the output looks like:

print softmaxed

[[  1.00000000e+00   3.16080274e-36]
 [  1.50696539e-52   1.00000000e+00]
 [  1.64329041e-30   1.00000000e+00]
 [  1.00000000e+00   8.51240114e-40]]

which for all intents and purposes is the same as labels. QED.

Some things we ignored

For brevity, I didn't address the score (that tells us how close we are to our desired values).

Also, any attempt at regularization (that attempts to avoid over-fitting) was ignored. The DL4J XorExample set the L1 (New York taxi distance) and L2 (the Euclidean distance) to zero so we're true to the original there.

[1] Deep Learning (Goodfellow et al)

Sunday, September 10, 2017

Givens Rotations



The problem

If A is a matrix whose columns are {v1v2v3, ... vn }, then any point in its span can be described with an appropriate choice of x thus:
x = 
...
v1v2v3
...
x1
x2
x3
where x1, x2 ... etc are the multipliers we want of a given dimension summed over all the vectors, vi.

What this means is that any vector, b, can be expressed as A x. Equivalently, if we solve |b - A x| = 0 we're solving the least squares problem, beloved of loss functions in machine learning. "The advantage of an algorithm for least squares is that it can be used even when the matrix-vector equation has no solution" [1].

One algorithm: Gram Schmidt

The approach used in "Coding the Matrix" is the Gram Schmidt method. The circuitous route goes like this:
  1. We want to find the closest point between a vector that we'll call b and a space defined by the span of vectors {v1v2v3, ... vn}. We will assume that vi is orthogonal to vj for all i ≠ j. If they're not orthogonal, it's not hard to turn them into vectors that are but that's another problem.
  2. Given a k-dimensional vector space, V, the vector b can be expressed as the sum of two vectors - one that is in V and one that is orthogonal to V. In Klein's notation, this says b = b||V + b⊥V. "The point in V closest to b is b||V and the distance is ||b⊥V|| ... it suffices to find b⊥V, for we can then obtain b||V." [1].
  3. Given the orthogonal set of vectors{v1v2v3, ... vn}, b⊥V equals b with all the projections, <b, vi>, shaved off. This creates a recursive algorithm that looks like:

    bi = bi-1 - <bi-1, vi> / <vi, vi>

    where b0 = b. For convenience, we say σi = = bi-1 - <bi-1, vi> / <vi, vi>
  4. Now b can be expressed as

    b = σ0v0 + ... + σk-1vk-1 + b⊥V

    which just like the matrix at the top of this post can be expressed like:
    b = 
    ......
    v0...vk-1b⊥V
    ......
    σ0
    .
    .
    σk-1
    1
Similarly, if {v1v2v3, ... vn} are not orthogonal, we can invoke the same recursive algorithm to generate the original vectors from a linear combination of mutually orthogonal vectors, {v1*, v2*, v3*, ... vn*}. It would look like this:
A = 
......
v*1v*2...v*n
......

1σ12.........σ1n
01σ23......σ2n
001σ34...σ3n
......
....1σn-1
000001





QR-factorization

QR-factorization is reducing an m x n matix A into two matrices, Q and R where Q is an m x n column orthogonal (ie, the columns are orthogonal and have a unit norm) matrix and R is a triangular matrix [1]. In other words, it looks like the matrix above where R is the upper triangular matrix.

Another algorithm: Givens Rotation

Now, there is another way of achieving the same end. We can rotate the matrix such that a particular cell becomes zero. Keep doing this and we get the upper triangular matrix as above. This is called a Givens rotation. Let's call each rotation Gi then:

R = Gk Gk-1... G1  A

But note that each rotation is of the form:

10..........
01..........
......
..c.-s.
......
..s.c.
......
000001

where c2+s2=1, the c values are on the diagonal and all the other diagonals are 1. Or, to put it more succinctly in Python:

def givensRotation(A, i, j):
    G = np.eye(len(A))
    aij = A[i, j]
    ajj = A[j, j]
    r = ((aij ** 2) + (ajj ** 2)) ** 0.5
    c = ajj / r
    s = - aij / r
    G[j, i] = -s
    G[j, j] = c
    G[i, i] = c
    G[i, j] = s
    return G

Rearranging that equation for R, we get:

A = G1-1 G2-1 ... Gk-1-1 Gk-1  R

but as luck would have it, these G matrices are their own inverses (since c2+s2=1. Try it.). So:

A = G1 G2 ... Gk-1 Gk  R

and all we need to do is keep the original G matrices in the right order and setting

Q = G1 G2 ... Gk-1 Gk

we get:

A = Q R

which is our QR decomposition. Booyah.

Given a triangle represented in Breeze as:

    val m = DenseMatrix.zeros[Double](rows = 3, cols = 3)
    m(0, ::) := DenseVector(1.0, 2.0, 3.0).t
    m(1, ::) := DenseVector(-6.0, 5.0, -4.0).t
    m(2, ::) := DenseVector(-7.0, 1.0, 9.0).t

we can transform it with Givens rotations to give us:
Red is original, blue is reflection

Distributed Givens Rotation using Spark

I tried this in my own toy linear algebra library for Spark, Algernon. Performance currently sucks but as a proof-of-concept it shows it can be done. I avoid Spark's matrix multiplication code for efficiency. Since the G matrix is mostly 1s in the diagonal and this is effectively a NoOp in matrix multiplication for that particular row, we can write:

    val mathOps = implicitly[Numeric[T]]
    import session.sqlContext.implicits._
    import mathOps.mkNumericOps
    ithRow.joinWith(jthRow, '_1 ("j") === '_2 ("j"), "outer").flatMap { case (x, y) =>

      def calculate(x: MatrixCell[T], y: MatrixCell[T], _s: T) =
        {if (x == null) mathOps.zero else (x.x * c)} + {if (y == null) mathOps.zero else (y.x * _s)}

      val newIth = if (x == null) Seq() else Seq(MatrixCell(x.i, x.j, calculate(x, y, s)))
      val newJth = if (y == null) Seq() else Seq(MatrixCell(y.i, y.j, calculate(x, y, -s)))

      newIth ++ newJth
    }

Here, we're only interested in the i-th and j-th rows, multiplying them with the c and s values we have already calculated.

[1] Coding the Matrix, Klein