Auto Added by WPeMatico

Big Data Tools EAP Is Now Also Available for DataGrip and PyCharm

At the end of last year, we announced a preview of the IntelliJ IDEA Ultimate plugin that integrated Apache Zeppelin notebooks into the IDE. At the same time we shared our roadmap, in which we promised to support more tools for working with Big Data. Since then, the plugin team has been working hard and has extended the plugin with support for Apache Spark, Apache Hadoop’s HDFS, AWS S3, Google Cloud Storage, and Parquet files.

Because the plugin originally started with the Scala support in Zeppelin notebooks, it was reasonable for it to only be available for IntelliJ IDEA Ultimate. Now that the plugin supports a much wider set of scenarios and tools, the time has come to make it available for other IDEs too. With that, we are excited to announce that Big Data Tools is now also available for DataGrip and PyCharm Professional.

Why DataGrip and PyCharm? Big Data Tools is one of the first JetBrains plugins that aims to solve problems involving both code and data. Since the plugin offers tools for working with data, we think it’s logical to make the plugin available to DataGrip users. We believe the plugin will extend the capabilities of DataGrip users when it comes to working with distributed file storage systems and columnar file formats. At the same time, the users of PyCharm who use PySpark or who also work with data will benefit from having this plugin available in their IDE.

It’s important to highlight that Big Data Tools is still under EAP and has some limitations. One of the most important limitations, for now, is that the current version of the plugin for PyCharm and DataGrip offers all features that are available in IntelliJ IDEA except Zeppelin notebooks. Adding Zeppelin notebooks support is in our roadmap and we hope to have it soon.

The current feature set includes:

  • A file browser for distributed file storage systems, such as AWS S3, HDFS, GCS (support for other cloud storage is coming soon, too, e.g. Microsoft Azure). With this browser, you can browse folders and files, preview files, and manage files, e.g. creating, copying, renaming, deleting, uploading, and downloading them.

  • A viewer for columnar file formats, such as Parquet (the support for other formats is coming soon too, e.g. Avro and ORC).

  • A monitoring console for Spark clusters. With this console, you can browse cluster nodes, Spark jobs, their stages, and tasks.

Please note that the plugin is currently available for IDEs with version numbers 2020.1 or higher.

Additional information on the plugin can be found in the plugin repository.

Documentation for the plugin is now available for both DataGrip and PyCharm.

The easiest way to install the plugin is by opening the IDE’s Plugin settings, clicking Marketplace, searching for “Big Data Tools”, installing and then restarting the IDE.

Feel free to try the plugin, share your feedback, and spread the word!

The JetBrains team
The Drive to Develop

Continue Reading Big Data Tools EAP Is Now Also Available for DataGrip and PyCharm

Big Data Tools EAP 6: Google Cloud Storage, Proxy, Kerberos, and Parquet Improvements

It’s been a while since our last update, but today we’re excited to give you a new EAP build. Originally we planned to exclusively work on bug fixes and stability improvements in this build. However, we couldn’t resist and added a completely new feature that has been on our roadmap for some time – integration with Google Cloud Storage.

Using the Google Cloud Storage integration is similar to working with AWS S3. Once you’ve configured a Google Cloud Storage bucket configuration in Big Data Tools Connections, you’ll see it and its contents in the Big Data Tools tool window.

Here’s what the configuration page looks like:

bdt_eap_6_gcs

You have to specify the path to your credentials JSON file, choose a bucket, and optionally a prefix if you’d like to work with a specific subfolder.

Once the bucket is configured, you’ll see the files and folders hierarchically in the Big Data Tools tool window:

bdt_eap_6_gcs_2

The context menu provides the same actions that work for AWS S3. You can copy, move, and rename your files and folders, download them to your local disk, and open them for a preview. In the case of a preview, the IDE downloads only a chunk of the file. This is very handy if you’d like to preview a large file, e.g. a Parquet or CSV.

Speaking of Parquet support, we’ve made certain improvements. First, we’ve fixed some edge cases in which it didn’t work. Second, we’ve reworked the appearance of the header to properly display the headers of the column and allow the user to sort the rows by any of the columns:

bdt_eap_6_parquet

Last but not least, we’ve added actions that let you copy the selected values, columns, or rows, or dump the whole document to the clipboard or a .CSV file.

The connection configuration for Spark and Zeppelin now supports HTTP proxies. Now you can configure a proxy for any of the connections in the Big Data Tools Connections settings:

bdt_eap_6_http

We hope this will make it easier for you to connect to Spark and Zeppelin in your secure environments.

Speaking of security, now the plugin also allows you to use and configure Kerberos authentication for connecting to your Spark server:

bdt_eap_6_kerberos

For more details on how to use the HTTP proxy and Kerberos, please see the updated documentation.

Those are all the major improvements in this update. The full list of changes (bug fixes and minor improvements) can be found in the release notes.

Having said that all, we’d like to ask you to try the new version of the plugin and share your feedback and bug reports with us.

If you have an idea for a cool feature that the Big Data Tools plugin could add in the future, please share it here in the comments or in the bug tracker, use this feedback form, or sound off in our Slack workspace. Thanks a lot!

The Big Data Tools team
The Drive to Develop

Continue Reading Big Data Tools EAP 6: Google Cloud Storage, Proxy, Kerberos, and Parquet Improvements

Update on Big Data Tools Plugin: Spark, HDFS, Parquet and More

It’s been a while since our last update. If you remember, last year, we announced IntelliJ IDEA’s integration with Apache Zeppelin, S3, and its experimental integration with Apache Spark. The latter integration was released as an experimental feature and was only available in the unstable update channel. But we have great news. Today we’re releasing a new version of the plugin that finally makes Spark support publicly available. It also adds support for HDFS and Parquet.

Spark Monitoring

Now that the Spark integration is available in the public update, let us quickly catch you up on what it can do for you.

To be able to monitor your Spark jobs, all you have to do now is go to the Big Data Tools Connections settings and add the URL of your Spark History Server:

Once you’ve done that, close the settings and open the Spark tool window in the bottom right of the IDE’s window. The Spark tool window displays the list of completed and running Spark applications (this is the Applications tab, which is collapsed by default), the list of the jobs, their stages, and tasks.

By clicking the Executors tab, you’ll see information about the active and non-active executors:

At the moment, the SQL tab shows a list of recent queries but it doesn’t yet include the actual SQL. Additionally, if you are using Kerberos with Spark, the IDE might not allow you to connect to the server. We’re working on fixing this in one of the next updates. If you use Kerberos, please let us know, so we prioritize this task over the others.

HDFS

Similar to the S3 support that we introduced in December, the plugin now allows you to connect to your HDFS servers to explore and manage your files from the IDE. To enable this feature, just go to the Big Data Tools Connections settings and add an HDFS configuration:

Currently, you have to specify the root path and the way to connect to the server: either Configuration Files Directory or Explicit URI.

Once you’ve configured HDFS servers, you’ll see them appear in the Big Data Tools tool window (next to your Apache Zeppelin notebooks and S3 buckets, if you’ve configured any of course):

The Big Data Tools tool window displays the files and folders that are stored in the configured servers. As is the case for S3, the CSV and Parquet files in HDFS can be expanded in the tree to show their file schemas. The context menu invoked on any file or folder provides a variety of actions:

These options allow you to manage files, copy them to your local machine, or preview them in the editor. Previewing allows you to see the first chunk of the file content without fully copying it to your machine.

Parquet

As mentioned above, this update introduces initial support for Parquet files. Now you can open any Parquet file in the IDE and view its content as a table:

When opening Parquet files, the plugin only displays the first portion rather than the entirety of the content. This is especially useful when you work with very large files.

Note that just as with Spark, you need physical access to the servers in order to access the files. This means that if your servers are behind an SSH tunnel, you currently have to establish the tunnel yourself. In the event that you experience any issues or inconveniences when accessing your files, please make sure to let us know about it. Otherwise, we might not know of specific scenarios that may not yet be supported. The sooner you provide your feedback, the better!

That’s it for today. As you might have also noticed, up until now, we’ve published our updates in the Scala blog, and this is the first update published in the IntelliJ IDEA blog. We’re doing this because now the plugin no longer merely offers Apache Zeppelin and Scala support. Instead, it integrates a much wider variety of tools for working with big data.

To see the complete list of bug fixes in this update, please refer to the release notes.

And last but not least, in case you need help on how to use any feature of the plugin, make sure to check out the documentation. Still need help? Please don’t hesitate to leave us a message either here in the comments or on Twitter.

P.S.: Because the plugin is still in an early stage of development, its many integrations may not support the whole variety of scenarios. This is why, at this point in time, we’re heavily relying on your feedback. In the event you see that an important user scenario (e.g. a certain authorization type, or some other specifics) is not supported, please let us know here in the comments, in the issue tracker, or in our feedback survey.

Continue Reading Update on Big Data Tools Plugin: Spark, HDFS, Parquet and More

End of content

No more pages to load