Related to
Jira | ||||||
---|---|---|---|---|---|---|
|
Background
We are currently exploring how to support integration between the Workbench system and Spark clusters. This basically means the ability to authenticate into, run and monitor jobs on a Spark cluster remotely. There are existing integrations with Jupyter and the Zeppelin notebook frameworks as well as Rstudio. A simple proof of concept would be to demonstrate running Zeppelin or Jupyter notebooks (or both) in Workbench connecting to a remote Spark cluster.
Zeppelin v Jupyter v RStudio v Cloud9
Zeppelin is an Apache data-driven notebook application service, similar to Jupyter. Zeppelin supports both single and multi-user installations. They both seem to support very similar features with different strengths/weaknesses. This might be another compelling case for the Workbench system – we don't care whether you use Zeppelin notebooks or Jupyter notebooks.
Zeppelin/Spark integration is built-in, so there are no additional modules/plugins to install. . Jupyter requires plugins. RStudio apparently also supports Spark integration via sparklyr, including Livy.
It looks like we should be able to launch jobs from either single-user Jupyter or Zeppelin notebook containers on remote Spark cluster within Workbench via Livy. The devil is in the details, but there are examples of both.
See also:
- https://dwhsys.com/2017/03/25/apache-zeppelin-vs-jupyter-notebook/
- https://www.linkedin.com/pulse/comprehensive-comparison-jupyter-vs-zeppelin-hoc-q-phan-mba-
Livy v. no-livy
As noted in
Jira | ||||||
---|---|---|---|---|---|---|
|
...
Jupyter/Spark integration
There are plenty of existing examples demonstrating Jupyter integration with Spark.
...
See also Kevin's Freund case in SC17 Demo
Conclusions
...
Big picture
We're starting to define a possible bigger picture for the integration of Workbench and Spark clusters:
Gliffy Diagram | ||||
---|---|---|---|---|
|
A few notes:
- Similar to the TERRA-REF/ROGER/HPC case, Workbench would provide nearby, web-based access to Spark resources.
- We are platform agnostic – can support Zeppelin, Jupyter, Rstudio, Cloud9, whatever – focusing on the user needs.
- Like the ROGER case, ideally we can support shared authentication and filesystems – the user has a single identity that can be used to SSH into the edge node or to run applications in Workbench.
- Ideally, the user could transfer data into the system via Workbench – initiating transfers from remote systems to the Spark cluster easily.
- We have two example datasets – the NBI data, imported from Mongo, and the NCSA Genomics case.
- Still many details to work out.