Friday, April 13, 2018

She's a model

...but is she looking good? (With apologies to Kraftwerk).

"No single algorithm performs best or worst. This is wisdom known to machine learning practitioners, but difficult to grasp for beginners in the field. There is no silver bullet and you must test a suite of algorithms on a given dataset to see what works best." (MachineLearningMastery).

This blog goes on to quote from a report that compares models and data sets: "no one ML algorithm performs best across all 165 datasets. For example, there are 9 datasets for which Multinomial NB performs as well as or better than Gradient Tree Boosting, despite being the overall worst- and best-ranked algorithms, respectively".

Another comparison is made here as a data scientist reproduces part of her PhD using modern tools. Interestingly, Naive Bayes this time is a close second to SVMs (a result echoed here when using Python).

For my part, I took the "20 Newsgroups" data set and fed it into Knime's Document Classification Example. (I like this data set as it's close to my proprietary data set). With almost no change, my results for the "subjects" part of the data was:

Decision Tree78%
Naive Bayes70%

Boosting generally reduced all models by 5-10% in accuracy. So did removing just the punctuation rather than the more sophisticated massaging the data as in the example.

Interestingly, the results on the full corpus (not just subject text) was only about half as good. This could be that Knime could not store everything in memory.

Note that this Knime example only uses the top words in the corpus as the TF-IDF matrix would be far too big to fit into memory otherwise. Here, Spark has the advantage of being able to easily process the full matrix (Naive Bayes for instance scores about 85% there). So, these results should constitute a minimum of what we can achieve in Spark.

In Spark, I just first cleared the text of all punctuation with the code in this StackOverflow suggestion, that is:

s.replaceAll("""[\p{Punct}&&[^.]]""", "")

and then ran the ML pipeline of [Tokenizer, StopWordRemover, NGram, IDF, HashingTF, Normalizer]. The results looked like:

NaiveBayes85.1%Very fast (1 minute or less)
MultilayerPerceptronClassifier80.6%Took about 30 minutes with layer sizes (262144, 100, 80, 20); 262144 is the (uncompressed) TFIDF vector sizes
RandomForestClassifier79.6%For numTrees=190, maxDepth=20 but after SVD with n=400
Logistic Regression76.1%SVD with n=400
DeepLearning4J on Spark72.9%After 227 epochs taking 6 hours on 15 executors with 1 core each
RandomForestClassifier53.7%For numTrees=190, maxDepth=30
RandomForestClassifier48%For numTrees=1000
GBTRegressor-"Note: GBTs do not yet support multiclass classification"
LinearSVC-"LinearSVC only supports binary classification."

The table is a little unfair as I spent a disproportionate amount of time tuning the models. And, as ever, YMMV (your mileage may vary). These are the results for my data, yours will probably look very different.

Interestingly, creating a hand-made ensemble of NaiveBayes, MultilayerPerceptronClassifier and RandomForestClassifier didn't improve matters. The result on these 3 models trained on the same data and voting on the test data gave an accuracy of 81.0%.

Finally, there were two algorithms that I've mentioned before that were not part of this work but I'll include them for completeness:


Ensemble Cast

So, taking five Spark models (LinearSVC, NaiveBayes, MultilayerPerceptron, RandomForestClassifier and LogisticRegression), we can take the results and using joinWith and map,  weave the DataFrames together and let them vote on which category a given subject should be in.

Unfortunately, this roll-your-own bagging did not provide significantly better results. The overall accuracy at 86.029% was a mere 0.022% better than the best stand alone model.

Wednesday, April 11, 2018

Python crib sheet #1

Some notes I've been making as I learn Python.


The file is "automatically executed by Python the first time a package or subpackage is loaded. This permits whatever package initialization you may desire. Python requires that a directory contain an file before it can be recognized as a package. This prevents directories containing miscellaneous Python code from being accidentally imported as if they defined a package.

"The second point is probably the more important. For many packages, you won’t need to put anything in the package’s file—just make sure an empty file is present." [1]

"Note that there’s no recursive importing of names with a from ... import * statement." [1]

"A two-level hierarchy should be able to effectively handle all but a few of the rest. As written in the Zen of Python, by Tim Peters, “Flat is better than nested.”" [1]


In general, single underscores are a convention, double has a meaning (name mangling).
"Names, in a class, with a leading underscore are simply to indicate to other programmers that the attribute or method is intended to be private. However, nothing special is done with the name itself.

"Any identifier of the form __spam (at least two leading underscores, at most one trailing underscore) is textually replaced with _classname__spam, where classname is the current class name with leading underscore(s) stripped...  Name mangling is intended to give classes an easy way to define “private” instance variables and methods, without having to worry about instance variables defined by derived classes, or mucking with instance variables by code outside the class. Note that the mangling rules are designed mostly to avoid accidents; it still is possible for a determined soul to access or modify a variable that is considered private." (StackOverflow)

A special example is the __getitem__ special method attribute. "A solution is to use the __getitem__ special method attribute, which you can define in any user-defined class, to enable instances of that class to respond to list access syntax and semantics." [1]


"Python offers four kinds of numbers: integers, floats, complex numbers, and Booleans.

"An integer constant is written as an integer—0, –11, +33, 123456—and has unlimited range, restricted only by the resources of your machine.

"A float can be written with a decimal point or using scientific notation: 3.14, –2E-8, 2.718281828. The precision of these values is governed by the underlying machine but is typically equal to double (64-bit) types in C.

"Complex numbers are probably of limited interest and are discussed separately later in the section.  Booleans are either True or False and behave identically to 1 and 0 except for their string
representations." [1, p40]

Pythonic manipulation of lists

For flatten, see this StackOverflow answer.

flat_list = [item for sublist in l for item in sublist]

which means:

for sublist in l:
    for item in sublist:

where l is the structure we want to flatten.

Note that the += you'd find in Scala is not the same in Python. From StackOverflow.
+ is closer in meaning to extend than to append... the methods work in-place: extend is actually like += - in fact, it has exactly the same behavior as += except that it can accept any iterable, while += can only take another list.
Given a value x that we want i times, a repeating list can be created with [x] * i.


"The pass statement does nothing particular but can act as a placeholder" (StackOverflow). This makes the code compile and appears to act like a placeholder like ??? in Scala.

[1] The Quick Python Book, Second Edition

Monday, April 9, 2018

Cross validation in Spark

What it is

"k-fold cross-validation relies on quarantining subsets of the training data during the learning process... k-fold CV begins by randomly splitting the data into k disjoint subsets, called folds (typical choices for k are 5, 10, or 20). For each fold, a model is trained on all the data except the data from that fold and is subsequently used to generate predictions for the data from that fold.  After all k-folds are cycled through, the predictions for each fold are aggregated and compared to the true target variable to assess accuracy" [1]


Spark's  "CrossValidator begins by splitting the dataset into a set of folds which are used as separate training and test datasets. E.g., with k=3 folds, CrossValidator will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing" (from the documentation).


val nb        = new NaiveBayes("nb")
val pipeline  = new Pipeline().setStages(Array(tokenizer, remover, ngram, hashingTF, idf, nb))
val evaluator = new MulticlassClassificationEvaluator().setLabelCol(LABEL).setPredictionCol("prediciton").setMetricName("accuracy")
val paramGrid = new ParamGridBuilder().addGrid(nb.smoothing, Array(100.0, 10.0, 1.0, 0.1, 0.01, 0.001)).addGrid(idf.minDocFreq, Array(1, 2, 4, 8, 16, 32)).build()
val cv        = new CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(5)
val fitted    =
val metrics   = fitted.avgMetrics

where tokenizer, remover, ngram, hashingTF and idf are instances of Spark's Tokenizer, StopWordRemover, NGram, HashingTF and IDF .

Running this on the Subject text of the 20 Newsgroup data set yielded the optimized hyperparameters of 1 document for a word to be significant and a smoothing value of 0.1 for regularization leading to 77.1% accuracy.

Running this on all the text of the 20 Newsgroup data yielded values of 10.0 for smoothing and 4 for minDocFreq giving an optimized accuracy of nearly 88%.

Those results in tabular form:

subject only0.1177.1%
all text10.0487.9%

Interestingly, the range over the results for all smoothing hyperparameters was typically less than 6% but the range of results over all minDocFreq was as much as 60%. For this data and this model at least the rather unexceptional conclusion is that you can increase accuracy more from improving feature engineering than model tuning.

(Note: NGram.n was set to 2. After some more CV, I found it was best leaving it as 1. Then, the "subject only" accuracy was 85.1% and the "all text" accuracy was 89.4%).


Happily, Spark has parallel cross validation as of 2.3.0. See TrainValidationSplit.setParallelism(...) - it has a @Since("2.3.0"). This should improve performance. Using 10 executors with 30gb of memory and 2 cores each, CV on the full data set could take 20 minutes or so.


For logs on what TrainValidationSplit is doing, run:

scala> sc.setLogLevel("DEBUG")

This can be irritating so change it back to ERROR when you're done.

[1] Real World Machine Learning (sample here).

Friday, April 6, 2018

The Trouble with Hashing

If you use Spark's HashingTF, you avoid needing an external store of key/values that complicates the code but there are downsides. Collisions aside, the matrix it generates is overly wide and sparse. This can cause problems for some machine learning algorithms.

There are a number of clever mathematical dimensionality reduction techniques but since HashingTF makes the matrix unnecessarily sparse, you can compact it trivially with:

import org.apache.spark.sql.{DataFrame, Row}
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.{udf, col}

def toIndices(i: Int)(r: Row): Array[Int] = r.getAs[SparseVector](0).indices

def compacting(replacements: Map[Int, Int])(v: SparseVector): SparseVector = { 
  val indexVals  =
  val newIndices = { case (i, x) => (replacements(i), x) }.sortBy(_._1)
  new SparseVector(replacements.size,,

def compactingUdf(replacements: Map[Int, Int]): UserDefinedFunction = {
  val compactingFn = compacting(replacements) _

(You can find this code in my GitHub repository).

Now, you can compact your matrix. First, collect the non-zero indices:

val toIndicesFn: Row => TraversableOnce[Int] = toIndices(0)
val allIndices =

and then replace the old indices

val cUdf        = compactingUdf(allIndices)
val compactedDF = df.withColumn(YOUR_COLUMN_NAME, cUdf(col(YOUR_COLUMN_NAME)))

Now your matrix is considerably more narrow and less sparse.

For example, using this technique decreases the time taken by Spark's RandomForestClassifier from over an hour to a few minutes. The results were still comparable in my example (23% vs 25%) but the sparse matrix actually reduces the number of nodes from 1000 to 821. With the compacted matrix, I can increase the number of trees to nearly 3000 before seeing the same warning.

Bigger is not necessarily better in Neural Nets

Using again just the subjects from the "20 Newsgroups" public domain data, we can predict which message belongs to which newsgroup using a TensorFlow neural net. (Code for this blog post can be found here).

But neural networks are so flexible, how do I know which architecture to use? So, naively, I chose a modest size network with 2 hidden layers of size 100 and 20. My accuracy looked like this:
2 Layer ANN. Red dots are training accuracy, blue are testing and the blue line is accuracy for all data.
Pretty bad, roughly the monkey-score.

Here's some ML training advice from Stanford University.
Typical learning curve for high variance [has a] large gap between training and test error. 
Typical learning curve for high bias [has a] small gap between training and test error [and] even the training error is unacceptably high.
So, it looks like my neural net is exhibiting high bias since there is only 0.3% difference between training and test accuracy and they are both unacceptably low (accuracy = 1 - error).

The same source recommends:

  • What fixes high variance: "Try getting more training examples [and] try a smaller set of features"
  • What fixes high bias: "try a larger set of features"

But before I started playing with the features, I read this: "From the description of your problem it seems that a feedforward network with just a single hidden layer in should be able to do the trick" (from this answer on StackOverflow with some more useful hints at improving things). And indeed it worked:
ANN with 1 hidden layer. Red dots are training accuracy, blue are testing and the blue line is accuracy for all data.
Much better! The training accuracy is plateaus at about 98%, the test accuracy about 75% and when running the model on all data it achieves an accuracy of about 87%.

Why the improvement? "Essentially, your first hidden layers learn much slower than last. Normally your network should learn the correct weights. Here, however, most likely the weights in the first layer change very little and the error propagates to the next layers. It's so large that the subsequent layers can't possibly correct for it. Check the weights." (from StackOverflow)

But now my model appears to be exhibiting high variance according to the advice above from Stanford. Further improvement is a future exercise.

The Loss Values

It's interesting to note that the value from the loss function drops precipitously even in the pathological case. A cost function might calculate a better score for predictions that are closer to the true value but if they're still wrong, the accuracy will not change. For instance, if you increase your exam grade from an F to a D, a loss function might indeed reflect that you got better but if the pass grade is C, you still failed. Ergo, a better loss value with exactly the same accuracy. See this StackOverflow answer.

"Lower cost function error [does] not means better accuracy. The error of the cost function represents how well your model is learning/able to learn with respect to your training examples... It can show very low learning curves error but when you actually testing the results (Accuracy) you are getting wrong detections." (from StackOverflow)

Friday, March 23, 2018

Hands on with Gradient Boosting

Boosting is a meta-algorithm, that is "an algorithm that exists to manipulate some other algorithm".

It goes like this (from [1]):

  1. Draw a random subset of training samples d1 without replacement from the training set D to train a weak learner C1.
  2. Draw second random training sample d2 without replacement from D and 50% of the samples that were misclassified by C1. Give this to C2.
  3. Find the training sample d3 from D for which C1 and C2 disagree. Give this to C3.
  4. Find with majority voting the answers from C1, Cand C3.
Bagging vs. Boosting

"Bagging is a simple ensembling technique in which we build many independent predictors/models/learners and combine them using some model averaging techniques. (e.g. weighted average, majority vote or normal average)...

"Boosting is an ensemble technique in which the predictors are not made independently, but sequentially." (from Prince Grover)

So, as a mnemonic, think booSting is Sequential and bAgging is pArrallel.

Any time, any place anywhere

Because boosting is a meta algorithm, it can be used on lots of classifiers in Knime.

For instance, I used Knime's Palladian nodes to classify the "20 Newsgroups" data set. This algorithm extensively uses n-grams. With a min/max n-gram size of 3/10, it gave an overall accuracy of 88.3% in classification.

Although Palladian's classifier nodes are being used, they could be any classifier like Naive Bayes
So, I boosted the results 10 times... And only got 85.9%. Boosting 100 times gave 78.5%. Hmm, what gives?

This StackOverflow post gives a hint. "In general, boosting error can increase with the number of iterations, specifically when the data is noisy (e.g. mislabeled cases)... Basically, boosting can 'focus' on correctly predicting cases that contain misinformation, and in the process, deteriorate the average performance on other cases that are more substantive.”

The post also has an interesting chat between data scientists about boosting and overfitting. “I think it's interesting that you've rarely seen gradient boosting overfit. Over the four or so years that I've been using it, I've seen the opposite -- too many trees leads to overfitting”.

Indeed, Raschka writes that boosting algorithms are known for the "tendency to overfit the training data".

Features! Features! Features!

What seemed to have the most impact on Palladian's algorithm was the choice of min/max n-grams. A value of 15/15 gave only 67% accuracy comparing poorly to the 88% of 3/10.

[1] Python Machine Learning, Sebastian Raschka

Tuesday, March 20, 2018

Fighting Orcs

Orc has an amazing ability to compress data. A Parquet file of 1.1tb shrank to 15.7gb when saved as Orc although note that the Orc file is compressed. "The default compression format for ORC is set to snappy" (StackOverflow). This indeed seems to be the case as you can see snappy in the name of files in the HDFS directory.

But how fast is querying it?

Well, there was an initial pause when first reading the file with:

val orcFile =

It seemed that the driver was doing all the work and all 60 executors I had asked my shell to employ were doing nothing. Using jstack showed that the driver was trying to infer the schema (see o.a.s.s.e.d.DataSource.getOrInferFileFormatSchema). Adding the schema would mitigate this.

Now, let's group by a field called dax.


(There are 36 different values for dax over nearly 890 million rows with 292 columns).

This groupBy took about 50s when the underlying format was Parquet and 37s when it was Orc (not including the perfectly avoidable large read time) so not bad. How does Orc do it? It has 'indexes'. Sort of. "The term 'index' is rather inappropriate. Basically it's just min/max information persisted in the stripe footer at write time, then used at read time for skipping all stripes that are clearly not meeting the WHERE requirements, drastically reducing I/O in some cases" (StackOverflow).

In fact, you can see this metadata using the Hive command line:

hive --orcfiledump HDFS_DIR_NAME

Can we make it even faster? I thought sorting and repartitioning might make a difference so I executed this:


Note, cacheing the original Dataframe would cause OOMEs for reasons unknown to me at the moment. And without repartitioning, the third (of four) stages would take so long I had to kill the job. Repartitioning helped but I did notice a huge amount of shuffle (100s of gbs when the original file itself was only 16gb).

Also, I had to keep increasing the drivers' memory until I stopped getting OOMEs in the last stage. For reasons that I don't currently understand, the memory had to increase to 20gb before the job finished. Even then, it took 2.5 hours on 30 executors with 5 cores each and the resulting directory took 146gb of disk space.

However, the speed of a query was staggeringly fast. This:

partitionedOrc.where('dax === 201501).groupBy('dax).agg(count('dax)).show(10)

took a mere 2s. By comparison, the same query on the unpartitioned and unsorted Orc file took about 32s.

This speed is comparable to Apache Impala which "can be awesome for small ad-hoc queries" (StackOverflow).

Note that Impala, being "highly memory intensive (MPP), it is not a good fit for tasks that require heavy data operations like joins etc., as you just can't fit everything into the memory. This is where Hive is a better fit" (ibid).

Note that this speed increase appears to be available only when querying purely on that which is partitioned. For instance, let's take another field, mandant, and run a slightly modified version of the query above:

partitionedOrc.where('mandant === 201501).groupBy('dax).agg(count('dax)).show(10)

This takes 32s as well.