Spark's DIMSUM algorithm might have a limit to how much data it can process but the code for it came from Twitter who process gazillions of tweets everyday. How do you apply it to huge amounts of data?

Well, one way is to break the data down into smaller chunks and one way to do that without losing its meaningfulness is to use locality sensitive hashing. The best (but rather verbose) description I've found of LSH is here, a chapter from a free book on data mining.

(Note: Ullman uses the term "characteristic matrix" in the document differently from how perhaps a mathematician would use it. A characteristic matrix in maths is λ

**I**-

**A**where A is an n x n matrix, I is the n x n identity matrix and λ is an eigenvalue. Solving the characteristic equation, |λ

**I**-

**A**| = 0, gives you at most n real roots and allows you to calculate the eigenvectors,

**v**, by substituting the discovered eigenvectors into the homogeneous system of equations, (λ

**I**-

**A**)

**v**= 0 ).

To summarise, the key points follow.

__Jaccard Similarity__

First, we define something called the Jaccard similarity between two sets. This is simply:

s =

__number of shared elements between the sets__
number of total elements in the sets

Next, we convert the two sets into a matrix where the columns represent the set and the rows represent the elements in the set. The cell value where they meet simply represents whether that term is in that set (a value of 1) or not (a value of 0).

__Minhashing__

Next, we define a function that gives us the row of the first element in a given column that have value 1 (going from top to bottom). We call this minhashing. Typically, we've reordered the rows of the original matrix randomly.

__The Minhash and Jaccard relationship__

The chances that two columns have the same minhash result is actually the Jaccard similarity. The proof of this follows.

Given two columns, say they both have 1 in x rows; only one of them has a non-zero value in y columns; and we ignore the rest. If we pick any non-zero row at random then the probability that both sets have 1 is unsurprisingly:

__x__

x + y

which is actually exactly the same as the Jaccard similarity, s.

__Minhash Signatures__

Now, let's make another matrix with a column per set (as before) but this time each row represents one of n random functions. The function simply takes the row index and returns a deterministic value. The cell values are all set to positive infinity but that will change.

For each function, we use the index of the row in the original matrix to generate a result. For each set in the original matrix in that row that has a non-zero value, we place the function's value in the corresponding cell in our new matrix

*if it less than what is already there*.

If there are m sets, this results in an n x m matrix that is our (much reduced) signature.

__Banding__

Now let's break our signature matrix into b bands of r rows each.

The probability that all r rows within a band have the same value is s

^{r}.

The probability that at least one row is not the same is (1 - s

^{r}).

The probability that all b bands have at least one row that is not the same is (1 - s

^{r})

^{b}.

Therefore, the probability that at least one of the b bands has all r rows the same is 1 - (1 - s

^{r})

^{b}.

Which is the probability that we have a pair of sets that may be similar.

The nice things about it is that this describes an S-shaped curve no matter what the values for s, r and b. This is good as the probability that either a point on it is clearly probable or clearly improbable is maximized.

For b = 10, the probabilities for r = [2, 4, 6, 8] look like this:

__Distance functions__

There are many types of distance measurements. Euclidean distance is the one we're taught most in school but there are others. The only rule is that for any distance function, d, that gives the distance between x and y, the following must hold:

d(x, y) >= 0

d(x, y) = 0 iff x == y

d(x, y) = d(y, x) [they're symmetric]

d(x, y) <= d(x, z) + d(z, y) [the triangle inequality]Nothing too clever there. They're just common sense if you were navigating around town. The thing to bear in mind is that they apply to more than just the 3D space in which we live.

__Family of functions__

We define a family of functions as a collection of functions that say "yes, make x and y a candidate

pair,” or "no, do not make x and y a candidate pair unless some other function concludes we should do so.” Minhashing is one such family.

Furthermore, we define the term

*(d1 , d2 , p1 , p2)-sensitive*for such families to describe the curve they generate. We define

d1 < d2

p1 > p2where d1 and d2 are the results of a given distance function d(x, y); p1 and p2 are minimum probabilities that two functions agree when d(x, y) < d1 and the maximum probability they agree when d(x, y) > d2 respectively.

This gives us a nice short-hand way to describe a complicated graph.

__Amplifying the effect__

It is desirable to maximize the steepness of the S-curve to reduce the number of false negatives and false positives. We can do this by applying a second family of functions to the first. We could demand all the bands agree for 2 points (x, y), that is an AND operator. Or we could demand at least one of the bands agree, that is an OR operator. Each will change the curve in a different way depending on what we want.

This is analogous to the banding (see above) and the maths for this is the same only this time instead of s, we're dealing with p, the probability two functions give the same answer for a row.

The downside to this is we increase the demand on our CPUs. In large datasets, a small increase in CPU is magnified proportional to the amount of data we have to run the algorithm over so it's worth bearing this in mind.

__Code__

I stole the Python code from here to generate the graphs and slightly modified it to below:

from mpl_toolkits.mplot3d import Axes3D

from matplotlib.collections import PolyCollection

from matplotlib.colors import colorConverter

import matplotlib.pyplot as plt

import numpy as np

# from http://matplotlib.org/mpl_toolkits/mplot3d/tutorial.html#line-plots

fig = plt.figure()

ax = fig.gca(projection='3d')

def cc(arg):

return colorConverter.to_rgba(arg, alpha=0.6)

def p(s, r):

b = 10

return 1 - ((1 - s**r) ** b)

xs = np.arange(0, 1, 0.04)

verts = []

zs = [2.0, 4.0, 6.0, 8.0]

for z in zs:

ys = map(lambda x: p(x, z), xs)

ys[0], ys[-1] = 0, 0

verts.append(list(zip(xs, ys)))

poly = PolyCollection(verts, facecolors=[cc('r'), cc('g'), cc('b'),

cc('y')])

poly.set_alpha(0.7)

ax.add_collection3d(poly, zs=zs, zdir='y')

ax.set_xlabel('s')

ax.set_xlim3d(0, 1)

ax.set_ylabel('r')

ax.set_ylim3d(min(zs), max(zs))

ax.set_zlabel('p')

ax.set_zlim3d(0, 1)

plt.show()