Workbench/Spark integration

NDS-1021 - Getting issue details... STATUS

Background

We are currently exploring how to support integration between the Workbench system and Spark. This basically means the ability to authenticate into, run and monitor jobs on a Spark cluster remotely. There are existing integrations with Jupyter and the Zeppelin notebook frameworks. A simple proof of concept would be to demonstrate running Zeppelin or Jupyter notebooks (or both) in Workbench connecting to a remote Spark cluster.

Zeppelin v Jupyter

Zeppelin is an Apache data-driven notebook application service, similar to Jupyter. Zeppelin supports both single and multi-user installations. They both seem to support very similar features with different strengths/weaknesses. This might be another compelling case for the Workbench system – we don't care whether you use Zeppelin notebooks or Jupyter notebooks.

Zeppelin/Spark integration is built-in, so there are no additional modules/plugins to install.

Livy

The Livy REST API Server does appear to be required for both Zeppelin and Jupyter if trying to submit jobs remotely.

Jupyter/Spark integration

There are plenty of existing examples demonstrating Jupyter integration with Spark.

There is an existing Jupyter Spark Docker image, with Jupyter Spark magic

Conclusions

It looks like we should be able to launch jobs from either single-user Jupyter or Zeppelin notebook containers on remote Spark cluster within Workbench via Livy. The devil is in the details, but there are examples of both.

Space shortcuts

Page tree

Background

Zeppelin v Jupyter

Livy

Jupyter/Spark integration

Conclusions