NDS-1021 - Getting issue details... STATUS
Background
We are currently exploring how to support integration between the Workbench system and Spark. This basically means the ability to authenticate into, run and monitor jobs on a Spark cluster remotely. There are existing integrations with Jupyter and the Zeppelin notebook frameworks. A simple proof of concept would be to demonstrate running Zeppelin or Jupyter notebooks (or both) in Workbench connecting to a remote Spark cluster.
Zeppelin v Jupyter
Zeppelin is an Apache data-driven notebook application service, similar to Jupyter. Zeppelin supports both single and multi-user installations. They both seem to support very similar features with different strengths/weaknesses. This might be another compelling case for the Workbench system – we don't care whether you use Zeppelin notebooks or Jupyter notebooks.
Zeppelin/Spark integration is built-in, so there are no additional modules/plugins to install.
See also:
- https://dwhsys.com/2017/03/25/apache-zeppelin-vs-jupyter-notebook/
- https://www.linkedin.com/pulse/comprehensive-comparison-jupyter-vs-zeppelin-hoc-q-phan-mba-
Livy
The Livy REST API Server does appear to be required for both Zeppelin and Jupyter if trying to submit jobs remotely.
See also:
Jupyter/Spark integration
There are plenty of existing examples demonstrating Jupyter integration with Spark.
- Using Jupyter on Apache Spark: Step-by-Step with a Terabyte of Reddit Data
- Install Jupyter notebook on your computer and connect to Apache Spark on HDInsight
There is an existing Jupyter Spark Docker image, with Jupyter Spark magic
Conclusions
It looks like we should be able to launch jobs from either single-user Jupyter or Zeppelin notebook containers on remote Spark cluster within Workbench via Livy. The devil is in the details, but there are examples of both.