Monday, March 5, 2018

DeepLearning4J in a Spark cluster


Running DL4J on Spark was not too hard but there were some none obvious gotchas.

Firstly, my tests ran on a Windows machine but I needed to add dependencies to get it to run on our Linux cluster where I was getting "no openblas in java.library.path" errors. This link helped me and now my dependencies looked like this:

    <dependency>
      <groupId>org.deeplearning4j</groupId>
      <artifactId>dl4j-spark_${scala.compat.version}</artifactId>
      <version>${deeplearning4j.version}_spark_2</version>
    </dependency>
    <dependency>
      <groupId>org.deeplearning4j</groupId>
      <artifactId>deeplearning4j-core</artifactId>
      <version>${deeplearning4j.version}</version>
    </dependency>
    <dependency> <!-- remember to run spark with conf "spark.kryo.registrator=org.nd4j.Nd4jRegistrator" -->
      <groupId>org.nd4j</groupId>
      <artifactId>nd4j-kryo_${scala.compat.version}</artifactId>
      <version>${nd4j.version}</version>
    </dependency>
    <dependency>
      <groupId>org.nd4j</groupId>
      <artifactId>nd4j-native-platform</artifactId>
      <version>${nd4j.version}</version>
    </dependency>
    <dependency>
      <groupId>org.nd4j</groupId>
      <artifactId>nd4j-native</artifactId>
      <version>${nd4j.version}</version>
    </dependency>

with

    <deeplearning4j.version>0.9.1</deeplearning4j.version>
    <nd4j.version>0.9.1</nd4j.version>

Second, you need to add --conf "spark.kryo.registrator=org.nd4j.Nd4jRegistrator"  to the CLI when you start a Spark shell.

Thirdly, I randomly grabbed some Recurrent Neural Network from here just to test my code. I found that it was immensely memory-hungry. I needed to give my driver 20gb, my executors 30gb and only 1 core per executor to avoid occasional errors in the Spark stages.

Even then, I couldn't measure the accuracy of my neural net because of this issue. Apparently, it's fixed in the SNAPSHOT but then there are issues with the platform JARs not being up to date. I asked about this on the DeepLearning4J gitter channel archived here. (The team also helpfully told me to use org.deeplearning4j.eval.Evaluation with an argument to avoid the bug).

Finally, I started getting results but not before one last gotcha and this time in Spark: use RDD.sample to get your hands on data with which to test rather than take as you want a nice distribution over all categories. With this, I started getting more sensible answers when evaluation my results.

No comments:

Post a Comment