You are currently viewing Kotlin for Apache Spark: One Step Closer to Your Production Cluster

Kotlin for Apache Spark: One Step Closer to Your Production Cluster

We have released Preview 2 of Kotlin for Apache Spark. First of all, we’d like to thank the community for providing us with all their feedback and even some pull requests! Now let’s take a look at what we have implemented since Preview 1?

Scala 2.11 and Spark 2.4.1+ support

The main change in this preview is the introduction of Spark 2.4 and Scala 2.11 support. This means that you can now run jobs written in Kotlin in your production environment.

The syntax remains the same as the Apache Spark 3.0 compatible version, but the installation procedure differs a bit. You can now use sbt to add the correct dependency, for example:

libraryDependencies += "org.jetbrains.kotlinx.spark" %% "kotlin-spark-api-2.4" % "1.0.0-preview2"

Of course, for other build systems, such as Maven, we would use the standard:

<dependency>
  <groupId>org.jetbrains.kotlinx.spark</groupId>
  <artifactId>kotlin-spark-api-2.4_2.11</artifactId>
  <version>1.0.0-preview2</version>
</dependency>

And for Gradle we would use:

implementation 'org.jetbrains.kotlinx.spark:kotlin-spark-api-2.4_2.11:1.0.0-preview2'

Spark 3 only supports Scala 2.12 for now, so there is no need to define it.

You can read more about dependencies here.

Support for the custom SparkSessionBuilder

Imagine you have a company-wide SparkSession.Builder that is set up to work with your YARN/Mesos/etc. cluster. You don’t want to create it, and the “withSpark” function doesn’t give you enough flexibility. Before Preview 2, you had only one option – you had to rewrite the whole builder from Scala to Kotlin. You now have another one, as the “withSpark” function accepts the SparkSession.Builder as an argument:

val builder = // obtain your builder here
withSpark(builder, logLevel = DEBUG) {
  // your spark is initialized with correct config here
}

Broadcast variable support

Sometimes we need to share data between executor nodes. Occasionally, we need to provide executor nodes with dictionary-like data without pushing it to each executor node individually. Apache Spark allows you to do this using broadcasting variable. You were previously able to do this using the “encoder” function. Still, we are aiming to provide an API experience that is as consistent as possible with that of the Scala API. So we’ve added explicit support for the broadcasting of variables. For example:

// Broadclasted data should implement Serializable
data class SomeClass(val a: IntArray, val b: Int) : Serializable
// some another code snipped
val receivedBroadcast = broadcast.value // broadcasted value is available

We’d like to thank Jolan Rensen for the contribution.

Primitive arrays support for Spark 3

If you use arrays of primitives in your data pipeline, you may be glad to know that they now work in Kotlin for Apache Spark. Arrays of data (or Java bean) classes were supported in Preview 1, but we’ve only just added support for primitive arrays in Preview 2. Please note that this support will work correctly only if you’re using particular Kotlin primitive array types, such as IntArray, BooleanArray, and LongArray. In other cases, we propose using lists or alternative collection types.

Future work

Now that Spark 2.4 is supported, our targets for Preview 3 will be mainly to extend the Column API and add more operations on columns. Once Apache Spark with Scala 2.13 support is available, we will implement support for Scala 2.13, as well.

The awesome support introduced in Preview 2 makes now a great time to give it a try! The official repository and the quick start guide are excellent places to start.