Workbench Spark integration

Related to NDS-1021 - Getting issue details... STATUS

Background

We are currently exploring how to support integration between the Workbench system and Spark clusters. This means the ability to authenticate into, run and monitor jobs on a Spark cluster remotely. There are existing integrations with Jupyter and the Zeppelin notebook frameworks as well as Rstudio. A simple proof of concept would be to demonstrate running Zeppelin or Jupyter notebooks (or both) in Workbench connecting to a remote Spark cluster.

Zeppelin v Jupyter v RStudio v Cloud9

Zeppelin is an Apache data-driven notebook application service, similar to Jupyter. Zeppelin supports both single and multi-user installations. They both seem to support very similar features with different strengths/weaknesses. This might be another compelling case for the Workbench system – we don't care whether you use Zeppelin notebooks or Jupyter notebooks.

Zeppelin/Spark integration is built-in, so there are no additional modules/plugins to install. Jupyter requires plugins. RStudio apparently also supports Spark integration via sparklyr, including Livy.

It looks like we should be able to launch jobs from either single-user Jupyter or Zeppelin notebook containers on remote Spark cluster within Workbench via Livy. The devil is in the details, but there are examples of both.

Livy v. no-livy

As noted in NDS-1013 - Getting issue details... STATUS , we have the ability to connect remotely to Spark via Yarn or Livy. Yarn integration seems to assume that you're on the same network with the cluster configuration available. Livy is designed specifically to support remote execution. The Livy REST API Server does appear to be required for both Zeppelin and Jupyter if trying to submit jobs remotely.

Jupyter/Spark integration

There are plenty of existing examples demonstrating Jupyter integration with Spark.

There is an existing Jupyter Spark Docker image, with Jupyter Spark magic

See also Kevin's Freund case in SC17 Demo

Big picture

We're starting to define a possible bigger picture for the integration of Workbench and Spark clusters:

wb-spark

A few notes:

Similar to the TERRA-REF/ROGER/HPC case, Workbench would provide nearby, web-based access to Spark resources.
We are platform agnostic – can support Zeppelin, Jupyter, Rstudio, Cloud9, whatever – focusing on the user needs.
Like the ROGER case, ideally we can support shared authentication and filesystems – the user has a single identity that can be used to SSH into the edge node or to run applications in Workbench.
Ideally, the user could transfer data into the system via Workbench – initiating transfers from remote systems to the Spark cluster easily.
We have two example datasets – the NBI data, imported from Mongo, and the NCSA Genomics case.
Still many details to work out.

Space shortcuts

Page tree

Background

Zeppelin v Jupyter v RStudio v Cloud9

Livy v. no-livy

Jupyter/Spark integration

Big picture