Thoughts on generalizing what we currently call  "Labs Workbench"  as a general platform that can be used to support multiple distinct use cases.

Potential Use Cases

NDS Labs Workbench

This is the existing primary use case: a service for the NDS community to explore, develop, share, and test a variety of data management tools.

Education and training

One of the clearest proven uses of the platform is for education and training purposes. Labs Workbench was used for:

Each environment is unique, but there are a few basic requirements:

Scalable analysis environment

We can also envision the platform working as a replacement for the TERRA-REF toolserver or as a DataDNS analysis environment. In this case, the requirements include:

 

Platform for the development and deployment of research data portals

Another use case, really a re-purposing of the platform, is to support the development and deployment of research data portals – aka, the Zuhone case. In this case we have something like workbench to develop and test services, with the ability to "push" or "publish", which is still a bit unclear.

Requirements include:

Other: Working with Whole Tale

Whole Tale will also support launching Jupyter and R notebooks, but is more focused on 1) bringing data to the container and 2) reproducibility. Datasets are registered via Girder. Authentication is handed via Globus auth. User's home directory will be iRODS. WT will be deployed at NCSA and TACC. Containers will be time-constrained and launched at the site with the right data. A key component is the Data Management System which handles caching data locally and exposing via fuse filesystem to the container (and therefore handling permissions).  They are hoping to leverage Labs Workbench – or at least Kubernetes – for container management.

Other: Cyverse (work in progress)

Another case coming out of the Phenome conference is the possibility of using workbench to provide R/Jupyter support for Cyverse:

"I am very interested in setting up the CyVerse DataStore with iRODS on the Workbench. CyVerse has been talking for months about integrating Jupyter and RStudio into our ecosystem. The Labs workbench appears to be just the sort of thing we (or at least, I) need."

The Cyverse Data Store supports iRODS iCommands, FUSE,  or an API (http://www.cyverse.org/data-store). We can envision several approaches: 1) Workbench mounts the Cyverse data directly; 2) Workbench mounts data via iRods; 3) Workbench retrieves data via API. 

Requirements might include:

Other: Collaborative Development Cloud (work in progress)

One issue that has come up recently on the KnowEnG UI development is the need for TLS-protected development instances with basic auth in front. Since we offer a slew of development environments with built-in TLS and basic auth, this seemed like a natural fit.

We also offer Jenkins CI. ARI already has set up for some of the KnowEnG folks, but could help other similar teams gain experience with setting up their own CI, and even testing applications that they could develop from within Labs. I played around over the weekend, and discovered that there are also several GitLab and Atlassian suite (JIRA + Confluence) images floating around that might be usable from within Labs.

Given the above, we have the potential to offer the following powerful combination of tools for any team of collaborating developers:

True, you could outsource for any one of these (Atlassian provides the first three), but Labs is the only place I can think of where you could get them all! (wink)

Pros:

Cons:

Other: Workflow Orchestration (work in progress)

See 

Another need that has come up on the KnowEnG project is the ability to run a cluster of compute resources for scheduling analysis jobs. These jobs come in the form of a DAG (directed acyclical graph) and are effectively Docker containers with dependencies. Since the API server already contains much of the logic to talk with etcd and Kubernetes, it might not be so difficult to extend Workbench to run these types analysis jobs.

Our "spec" architecture is already set up to handle running dependent containers and ensuring that they are running before continuing on to the next containers in the chain. If we were to add a flag (i.e. type == "job") to the specs, that could signal to the API server to run a job, instead of a service/rc/ingress, and to wait for the job to be "Completed" before running the next dependency.

I created a simple example of a Job spec YAML on raw Kubernetes just to see how a multi-container job would run and be scheduled. Apparently multiple Jobs can be scheduled at once, containing multiple containers. Each container within the Job will run sequentially (in the order listed in the spec).

I still need to ask for an example of a real life example of both a simple and a complex DAG to gather more details and create a more realistic prototype. We had previously discussed investigating Kubernetes to handle the scheduling, but we decided to look into BD2K's cwltoil framework instead.

Pros:

Cons:

Current features/components

Deployment (OpenStack)

We currently have two methods of deploying the Labs Workbench service: 1) ndslabs-startup (single node) and 2) deploy-tools (multi-node OpenStack)

The ndslabs-startup tool provides a set of scripts to deploy NDS Labs services to a single VM. This is intended primarily for development and testing. The deployment is incomplete (no shared storage, NRPE, LMA, backup), but adding these services would be minor. Minikube was considered as an option, but is problematic when running on a VM in OpenStack and might require additional investigation.

The deploy-tools image provides a set of Ansible plays designed specifically to support the provisioning and deployment of a Kubernetes cluster on OpenStack with a hard dependencies on CoreOS and GlusterFS. It's unclear whether this can be replaced by openstack-heat. Deploy-tools has 3 parts: 1)  OpenStack provision, 2) Kubernetes install, and 3) Labs components install. The OpenStack provision uses the OpenStack API and Ansible support to provision instances and volumes. The Kubernetes install is based on the contrib/ansible community tools with very minor local modifications. The Labs components install is primarily deploying Kubernetes objects.

For commercial cloud providers, we cannot use our deployment process. Fortunately, these services already have the ability to provision Kubernetes clusters:  AWSAzure, and GCE.

CoreOS (Operating system)

The deploy-tools image assumes that you are deploying CoreOS instances. This choice is arbitrary, but there are many assumptions in the deploy-tools component that are bound to the OS choice. Different providers make different OS decisions.  Kubernetes seems to lean toward Fedora and Debian. GCE itself is Debian. Azure Ubuntu, etc. This may not be important if we can rely on Kuberenetes deployment provided by each commercial cloud provider.

Docker (Container)

The Labs Workbench system assumes Docker, but there are other container options. Kubernetes also supports rkt. This is something we've discussed but never explored.

Orchestration (Kubernetes)

Labs Workbench relies heavily on Kubernetes itself. The API server integrates directly with the Kubernetes API. Of all basic requirements, this seems to be one that's unlikely to change.

Gluster FS (Storage)

Labs Workench uses a custom Gluster FS solution for shared storage.  A single Gluster volume is provisioned (4 GFS servers) and mounted to each host. The shared volume is accessed via hostPath by containers. 

This approach was necessary due to lack of support for persistent volume claims for OpenStack.  For commercial cloud providers, we'll need to re-think this approach. We can either have a single volume claim (giant shared disk), volume claim per user, or volume claim per application. There are benefits/weaknesses in all of these approaches.  For example, in a cloud provider, you don't want to have a giant provisioned disk with no usage. The per account approach may be better.

Other storage includes mounted volumes for /var/lib/docker and /var/lib/kubelet.

Dedicated etcd

We no longer rely on the Kubernetes etcd service, and provide our own that runs within the cluster.

SMTP Server / Relay

We now provide an in-cluster SMTP relay that can be configured to use Google credentials. This makes it very simple to use your Google credentials to send verification / approval / support e-mails.

REST API Server/CLI

Labs Workbench provides a thin REST interface over Kubernetes. Basic operations include: authentication, account management (register, approve, deny, delete), service management (add/update/remove), application instance management (add/update/remove/start/stop/logs), console access. The primary purpose of the REST API is to support the Angular Web UI. The API depends on Kubernetes API, etcd, Gluster for shared volume support, and SMTP support.

Web UI

The Web UI is a monolithic AngularJS application that interfaces with the REST API.

Application Catalog

Labs workbench provides the ability to support custom application catalogs via Github.  Eventually, it may be nice to provide a more user-friendly method for adding/removing services.

Ingress Controller

Labs Workbench relies on the Kubernetes contrib Nginx ingress controller (reverse proxy) to provide access to running services including authentication. We've made only minor modifications to some of the configuration options.

We know that GCE uses a version of the Nginx controller, but it's unclear whether it's the same as the version we use.  

Wildcard DNS and TLS

Labs Worbench relies on wildcard DNS (*.workbench.nds.org) to provide access to running services. For security purpose, this also requires a wildcard TLS certificate (revokable, 1 year).
For short-term deployments, TLS can be disabled (DNS is still required). It's unclear how this relates to commercial cloud providers.

Backup

A backup container is provided to backup Gluster volumes, etcd, and Kubernetes configs. This is tightly coupled to the Workbench architecture. The backup server is hosted at SDSC.  We should be able to generalize this solution, if needed. 

Monitoring (Nagios/Qualys)

A Nagios NRPE image is provided to support monitoring instances with some Kubernetes support.  We also use the contrib addons (Grafana, etc), deployed as standard services.

Commercial cloud providers provide their own monitoring tools, e.g., GCE Monitoring.

Docker cache

The Labs Workbench system deployed via deploy-tools includes a local Docker cache to minimize network traffic for image pulls

Private Docker registry

The Labs Workbench system deployed via deploy-tools includes a private Docker registry to privately share images within your cluster without needing to share them out to Docker Hub

Automated Testing 

The Angular Web UI includes a facility for executing automated Selenium smoke tests.

 

What would need to change?

Other thoughts