Category: Big Data Tools

Auto Added by WPeMatico

Kotlin API for Apache Spark 1.0 Released

The Kotlin API for Apache Spark is now widely available. This is the first stable release of the API that we consider to be feature-complete with respect to the user experience and compatibility with core Spark APIs.

Get on Maven Central

Let’s take a look at the new features this release brings to the API.

Typed select and sort

The Scala API has a typed select method that returns Datasets of Tuples. Sometimes using them can be more idiomatic or convenient than using the map function. Here’s what the syntax for this method looks like:

case class TestData(id: Long, name: String, url: String)
// ds is of type Dataset[TestData]
val result: Dataset[Tuple2[String, Long]] = 
        ds.select($"name".as[String], $"id".as[Long])

Sometimes obtaining just a tuple may be really convenient, but this method has a drawback: you have to select a column by name and explicitly provide the type. This can lead to errors, which might be hard to fix in long pipelines.

We’re trying to address this issue at least partially in our extension to the Scala API. Consider the following Kotlin code:

data class TestData(val id: Long, val name: String, val url: String)
// ds is of type Dataset<TestData>
val result: Dataset<Arity2<String, Long>> = 
        ds.selectTyped(TestData::name, TestData::id)

The result is the same, but the call is entirely type-safe. We do not use any strings and casts, and both the column name and the type are inferred from reflection.

We have also added a similarly reflective syntax to the sort function.

In Scala, this API supports arities up to 5, and we decided to be as consistent with the Scala API as possible. We also think that the usage of tuples with arities above five is an indication that something is going wrong. For example, maybe it would be better to extract a new domain object or, conversely, to work with untyped datasets.

More column functions

The Scala API is very rich in terms of functions that can be called on columns. We cannot make them identical to the Scala API because of the limitations of Kotlin. For example, overriding class members with extensions is forbidden, and the Dataset class is not extensible. But we can at least use infix functions and names in backticks to implement operator-like functions.

Here are the operator-like functions that we currently support:

==
!=
eq / `===`
neq / `=!=`
-col(...)
!col(...)
gt
lt
geq
leq
or
and / `&&`
+
-
*
/
%

Luckily, we can see that very few of the functions require backticks, and those that do can be autocompleted without you having to type them.

More `KeyValueGroupedDataset` wrapper functions

We initially designed the API so that anyone could call any function that requires a Decoder and Encoder simply by using the magic <em>encoder()</em> function, which generates everything automagically. It gave our users some flexibility, and it also allowed us not to implement all the functions that the Dataset API offers. But we would ultimately like to provide our users with the best developer experience possible. This is why we’ve implemented necessary wrappers over KeyValueGroupedDataset, and also why we’ve added support for the following functions:

cogroup
flatMapGroupsWithState
mapGroupsWithState

Support for Scala `TupleN` classes

There are several APIs in the Spark API that return Datasets of Tuples. Examples of such APIs are the select and joinWith functions. Before this release, users had to manually find an encoder for tuples:

val encoder = Encoders.tuple(Encoders.STRING(), Encoders.INT())
ds
    .select(ds.col("a").`as`<String>, ds.col("b").`as`<Int>)
    .map({ Tuple2(it._1(), it._2() + 1) }, encoder)

And the more we work with tuples, the more encoders we need, which leads to verbosity and requires us to find increasing numbers of names for new encoders.

After this change, code becomes as simple as any usual Kotlin API code:

ds
    .select(ds.col("a").`as`<String>, ds.col("b").`as`<Int>)
    .map { Tuple2(it._1(), it._2() + 1) }

You no longer need to rely on specific encoders or lambdas inside argument lists.

Support for date and time types

Work with dates and times is an important part of many data engineering workflows. For Spark 3.0, we had default encoders registered for Date and Timestamp, but inside data structures we had support only for LocalDate and Instant, which is obviously not enough. We now have full support for LocalDate, Date, Timestamp, and Instant both as top-level entities of dataframes and as fields inside of structures.

We have also added support for Date and Timestamp as fields inside of structures for Spark 2.

Support for maps encoded as tuples

There is a well-known practice of encoding maps as tuples. For example, rather than storing the ID of an entity and the name of the entity in the map, it is fairly common to store them in two columns in a structure like Dataset<Pair<Long, String>> (which is how relational databases usually work).

We are aware of this, and we’ve decided to add support for working with such datasets in the same way you work with maps. We have added the functions takeKeys and takeValues to Dataset<Tuple2<T1, T2>>, Dataset<Pair<T1, T2>>, and Dataset<Arity2<T1, T2>>.

Conclusion

We want to say a huge “thank you” to Jolan Rensen, who helped us tremendously by offering feedback, assisting with the implementation of features, and fixing bugs in this release. He worked with our project while writing his thesis, and we’re happy that we can help him with his brilliant work. If you want to read more about Jolan, please visit his site.

If you want to read more about the details of the new release, please check out the changelog.

As usual, the latest release is available at Maven Central. And we would love to get your feedback, which you can leave in:

Big Data Tools EAP Is Now Also Available for DataGrip and PyCharm

At the end of last year, we announced a preview of the IntelliJ IDEA Ultimate plugin that integrated Apache Zeppelin notebooks into the IDE. At the same time we shared our roadmap, in which we promised to support more tools for working with Big Data. Since then, the plugin team has been working hard and has extended the plugin with support for Apache Spark, Apache Hadoop’s HDFS, AWS S3, Google Cloud Storage, and Parquet files.

Because the plugin originally started with the Scala support in Zeppelin notebooks, it was reasonable for it to only be available for IntelliJ IDEA Ultimate. Now that the plugin supports a much wider set of scenarios and tools, the time has come to make it available for other IDEs too. With that, we are excited to announce that Big Data Tools is now also available for DataGrip and PyCharm Professional.

Why DataGrip and PyCharm? Big Data Tools is one of the first JetBrains plugins that aims to solve problems involving both code and data. Since the plugin offers tools for working with data, we think it’s logical to make the plugin available to DataGrip users. We believe the plugin will extend the capabilities of DataGrip users when it comes to working with distributed file storage systems and columnar file formats. At the same time, the users of PyCharm who use PySpark or who also work with data will benefit from having this plugin available in their IDE.

It’s important to highlight that Big Data Tools is still under EAP and has some limitations. One of the most important limitations, for now, is that the current version of the plugin for PyCharm and DataGrip offers all features that are available in IntelliJ IDEA except Zeppelin notebooks. Adding Zeppelin notebooks support is in our roadmap and we hope to have it soon.

The current feature set includes:

A file browser for distributed file storage systems, such as AWS S3, HDFS, GCS (support for other cloud storage is coming soon, too, e.g. Microsoft Azure). With this browser, you can browse folders and files, preview files, and manage files, e.g. creating, copying, renaming, deleting, uploading, and downloading them.

A viewer for columnar file formats, such as Parquet (the support for other formats is coming soon too, e.g. Avro and ORC).

A monitoring console for Spark clusters. With this console, you can browse cluster nodes, Spark jobs, their stages, and tasks.

Please note that the plugin is currently available for IDEs with version numbers 2020.1 or higher.

Additional information on the plugin can be found in the plugin repository.

Documentation for the plugin is now available for both DataGrip and PyCharm.

The easiest way to install the plugin is by opening the IDE’s Plugin settings, clicking Marketplace, searching for “Big Data Tools”, installing and then restarting the IDE.

Feel free to try the plugin, share your feedback, and spread the word!

The JetBrains team
The Drive to Develop

Big Data Tools EAP 6: Google Cloud Storage, Proxy, Kerberos, and Parquet Improvements

It’s been a while since our last update, but today we’re excited to give you a new EAP build. Originally we planned to exclusively work on bug fixes and stability improvements in this build. However, we couldn’t resist and added a completely new feature that has been on our roadmap for some time – integration with Google Cloud Storage.

Using the Google Cloud Storage integration is similar to working with AWS S3. Once you’ve configured a Google Cloud Storage bucket configuration in Big Data Tools Connections, you’ll see it and its contents in the Big Data Tools tool window.

Here’s what the configuration page looks like:

bdt_eap_6_gcs

You have to specify the path to your credentials JSON file, choose a bucket, and optionally a prefix if you’d like to work with a specific subfolder.

Once the bucket is configured, you’ll see the files and folders hierarchically in the Big Data Tools tool window:

bdt_eap_6_gcs_2

The context menu provides the same actions that work for AWS S3. You can copy, move, and rename your files and folders, download them to your local disk, and open them for a preview. In the case of a preview, the IDE downloads only a chunk of the file. This is very handy if you’d like to preview a large file, e.g. a Parquet or CSV.

Speaking of Parquet support, we’ve made certain improvements. First, we’ve fixed some edge cases in which it didn’t work. Second, we’ve reworked the appearance of the header to properly display the headers of the column and allow the user to sort the rows by any of the columns:

bdt_eap_6_parquet

Last but not least, we’ve added actions that let you copy the selected values, columns, or rows, or dump the whole document to the clipboard or a .CSV file.

The connection configuration for Spark and Zeppelin now supports HTTP proxies. Now you can configure a proxy for any of the connections in the Big Data Tools Connections settings:

bdt_eap_6_http

We hope this will make it easier for you to connect to Spark and Zeppelin in your secure environments.

Speaking of security, now the plugin also allows you to use and configure Kerberos authentication for connecting to your Spark server:

bdt_eap_6_kerberos

For more details on how to use the HTTP proxy and Kerberos, please see the updated documentation.

Those are all the major improvements in this update. The full list of changes (bug fixes and minor improvements) can be found in the release notes.

Having said that all, we’d like to ask you to try the new version of the plugin and share your feedback and bug reports with us.

If you have an idea for a cool feature that the Big Data Tools plugin could add in the future, please share it here in the comments or in the bug tracker, use this feedback form, or sound off in our Slack workspace. Thanks a lot!

The Big Data Tools team
The Drive to Develop

Update on Big Data Tools Plugin: Spark, HDFS, Parquet and More

It’s been a while since our last update. If you remember, last year, we announced IntelliJ IDEA’s integration with Apache Zeppelin, S3, and its experimental integration with Apache Spark. The latter integration was released as an experimental feature and was only available in the unstable update channel. But we have great news. Today we’re releasing a new version of the plugin that finally makes Spark support publicly available. It also adds support for HDFS and Parquet.

Spark Monitoring

Now that the Spark integration is available in the public update, let us quickly catch you up on what it can do for you.

To be able to monitor your Spark jobs, all you have to do now is go to the Big Data Tools Connections settings and add the URL of your Spark History Server:

Once you’ve done that, close the settings and open the Spark tool window in the bottom right of the IDE’s window. The Spark tool window displays the list of completed and running Spark applications (this is the Applications tab, which is collapsed by default), the list of the jobs, their stages, and tasks.

By clicking the Executors tab, you’ll see information about the active and non-active executors:

At the moment, the SQL tab shows a list of recent queries but it doesn’t yet include the actual SQL. Additionally, if you are using Kerberos with Spark, the IDE might not allow you to connect to the server. We’re working on fixing this in one of the next updates. If you use Kerberos, please let us know, so we prioritize this task over the others.

HDFS

Similar to the S3 support that we introduced in December, the plugin now allows you to connect to your HDFS servers to explore and manage your files from the IDE. To enable this feature, just go to the Big Data Tools Connections settings and add an HDFS configuration:

Currently, you have to specify the root path and the way to connect to the server: either Configuration Files Directory or Explicit URI.

Once you’ve configured HDFS servers, you’ll see them appear in the Big Data Tools tool window (next to your Apache Zeppelin notebooks and S3 buckets, if you’ve configured any of course):

The Big Data Tools tool window displays the files and folders that are stored in the configured servers. As is the case for S3, the CSV and Parquet files in HDFS can be expanded in the tree to show their file schemas. The context menu invoked on any file or folder provides a variety of actions:

These options allow you to manage files, copy them to your local machine, or preview them in the editor. Previewing allows you to see the first chunk of the file content without fully copying it to your machine.

Parquet

As mentioned above, this update introduces initial support for Parquet files. Now you can open any Parquet file in the IDE and view its content as a table:

When opening Parquet files, the plugin only displays the first portion rather than the entirety of the content. This is especially useful when you work with very large files.

Note that just as with Spark, you need physical access to the servers in order to access the files. This means that if your servers are behind an SSH tunnel, you currently have to establish the tunnel yourself. In the event that you experience any issues or inconveniences when accessing your files, please make sure to let us know about it. Otherwise, we might not know of specific scenarios that may not yet be supported. The sooner you provide your feedback, the better!

That’s it for today. As you might have also noticed, up until now, we’ve published our updates in the Scala blog, and this is the first update published in the IntelliJ IDEA blog. We’re doing this because now the plugin no longer merely offers Apache Zeppelin and Scala support. Instead, it integrates a much wider variety of tools for working with big data.

To see the complete list of bug fixes in this update, please refer to the release notes.

And last but not least, in case you need help on how to use any feature of the plugin, make sure to check out the documentation. Still need help? Please don’t hesitate to leave us a message either here in the comments or on Twitter.

P.S.: Because the plugin is still in an early stage of development, its many integrations may not support the whole variety of scenarios. This is why, at this point in time, we’re heavily relying on your feedback. In the event you see that an important user scenario (e.g. a certain authorization type, or some other specifics) is not supported, please let us know here in the comments, in the issue tracker, or in our feedback survey.

End of content

No more pages to load