Auto Added by WPeMatico

Introducing Kotlin for Apache Spark Preview

Apache Spark is an open-source unified analytics engine for large-scale distributed data processing. Over the last few years, it has become one of the most popular tools used for processing large amounts of data. It covers a wide range of tasks – from data batch processing and simple ETL (Extract/Transform/Load) to streaming and machine learning.

Due to Kotlin’s interoperability with Java, Kotlin developers can already work with Apache Spark via Java API. This way, however, they cannot use Kotlin to its full potential, and the general experience is far from smooth.

Today, we are happy to share the first preview of the Kotlin API for Apache Spark. This project adds a missing layer of compatibility between Kotlin and Apache Spark. It allows you to write idiomatic Kotlin code using familiar language features such as data classes and lambda expressions.

Kotlin for Apache Spark also extends the existing APIs with a few nice features.

withSpark and withCached functions

withSpark

is a simple and elegant way to work with SparkSession that will automatically take care of calling

spark.stop()

at the end of the block for you.
You can pass parameters to it that may be required to run Spark, such as master location, log level, or app name. It also comes with a convenient set of defaults for running Spark locally.

Here’s a classic example of counting occurrences of letters in lines:

val logFile = "a/path/to/logFile.txt" 
withSpark(master = "yarn", logLevel = SparkLogLevel.DEBUG){
   spark.read().textFile(logFile).withCached {
       val numAs = filter { it.contains("a") }.count()
       val numBs = filter { it.contains("b") }.count()
       println("Lines with a: $numAs, lines with b: $numBs")
   }
}

Another useful function in the example above is

withCached

. In other APIs, if you want to fork computations into several paths, but compute things only once, you would call the ‘cache’ method. However, this quickly becomes difficult to track and you have to remember to unpersist the cached data. Otherwise, you risk taking up more memory than intended or even breaking things altogether.

withCached

takes care of tracking and unpersisting for you.

Null safety

Kotlin for Spark adds

leftJoin

,

rightJoin

, and other aliases to the existing methods, however, these are null safe by design.


fun main() {

    data class Coordinate(val lon: Double, val lat: Double)
    data class City(val name: String, val coordinate: Coordinate)
    data class CityPopulation(val city: String, val population: Long)

    withSpark(appName = "Find biggest cities to visit") {
        val citiesWithCoordinates = dsOf(
                City("Moscow", Coordinate(37.6155600, 55.7522200)),
                // ...
        )

        val populations = dsOf(
                CityPopulation("Moscow", 11_503_501L),
                // ...
        )
        citiesWithCoordinates.rightJoin(populations, citiesWithCoordinates.col("name") `==` populations.col("city"))
                .filter { (_, citiesPopulation) ->
                    citiesPopulation.population > 15_000_000L
                }
                .map { (city, _) ->
                    // A city may potentially be null in this right join!!!
                    city?.coordinate
                }
                .filterNotNull()
                .show()
    }
}

Note the

city?.coordinate

line in the example above. A city may potentially be null in this right join. This would’ve caused a

NullPointerException

in other JVM Spark APIs, and it would’ve been rather difficult to debug the source of the problem.
Kotlin for Apache Spark takes care of null safety for you and you can conveniently filter out null results.

What’s supported

This initial version of Kotlin for Apache Spark supports Apache Spark 3.0 with the core compiled against Scala 2.12.

The API covers all the methods needed for creating self-contained Spark applications best suited for batch ETL.

Getting started with Kotlin for Apache Spark

To help you quickly get started with Kotlin for Apache Spark, we have prepared a Quick Start Guide that will help you set up the environment, correctly define dependencies for your project, and run your first self-contained Spark application written in Kotlin.

What’s next

We understand that it takes a while to upgrade any existing framework to a newer version, and Spark is no exception. That is why in the next update we are going to add support for the earlier Spark versions: 2.4.2 – 2.4.6.

We are also working on the Kotlin Spark shell so that you can enjoy working with your data in an interactive manner, and perform exploratory data analysis with it.

Currently, Spark Streaming and Spark MLlib are not covered by this API, but we will be closely listening to your feedback and will address it in our roadmap accordingly.

In the future, we hope to see Kotlin join the official Apache Spark project as a first-class citizen. We believe that it can add value both for Kotlin, and for the Spark community. That is why we have opened a Spark Project Improvement Proposal: Kotlin support for Apache Spark. We encourage you to voice your opinions and join the discussion.

Go ahead and try Kotlin for Apache Spark and let us know what you think!

Continue Reading Introducing Kotlin for Apache Spark Preview

End of content

No more pages to load