Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Gliffy Diagram
namesc16-box-diagram

What Components that we already have available:

  • NCSA "Tool Manager" demonstrated at NDSC6 
    • Angular UI over a very simple Python/Flask REST API. 
    • The REST API allows you to get a list of supported tools, post/put/delete instances of running tools.  
    • Fronted with a basic NGINX proxy that routes traffic to the running container based on ID (e.g., http://tooserver/containerId/)
    • Data is retrieved via HTTP get using repository-specific APIs. Only Clowder and Dataverse are supported
    • Docker containers are managed via system calls (docker executable)
  • WholeTale/ytHub/tmpnb:
    • Girder with yt extension to support launching tmpnb notebooks
    • tmpnb proxy and custom notebook server (volman) The yt.hub team has extended Jupyter tmpnb to support volume mounts.  They've created fuse mounts for Girder.
  • PRAGMA PID service
    • Demonstrated at NDSC6, appears to allow attaching arbitrary metadata to registered PID. 
  • Analysis
    • For the Dark Sky dataset, we can use the notebook demonstrated by the yt team.

...

  • Copy MHD dataset to storage condo
  • Create Docker-enabled hosts with access to each dataset (e.g., NFS) at SDSC, possibly in the yt DXL project, and in the NDS Labs project for MHD
  • Decide whether to use/extend existing Tool Manager, Girder/yt/tmpnb or Jupyter tmpnp  (or something else)
  • Define strategy for managing containers at each site
    • Simple:  "ssh docker run -v" or use the Docker API
    • Harder:  Use Kubernetes or Docker Swarm for container orchestration.  For example, launch a jupyter container on a node with label "sdsc"
  • Implement the resolution/registry
    • Ability to register a data URL with some associated metadata.
    • Metadata would include site (SDSC, NCSA) and volume mount information for the dataset.
    • The PRAGMA PID service looks possible at first glance, but may be too complex for what we're trying to do.  It requires handle.net integration.
  • Implement bookmarklet or browser extension:  There was discussion of providing some bookmarklet javascript to link a data DOI/PID to the "tool manager" service
  • Authentication: 
    • TBD – how do we control who gets access, or is it open to the public?
    • In the case of Clowder/Dataverse, all API requests include an API key
  • Analysis:
    • Need to get notebooks/code to demonstrate how to work with the MHD and  Norman data.

...

Example case for resolution (not a real dataset for SC16)

Component Options

We will need to select one from each of the following categories.

All combinations are possible, although some combinations will likely be easier to accomplish than others.

"Repository" - User Frontend

Looking at the above diagram, we see four categories of services:

  • Repository: remote repository (e.g., Globus Publish, Dataverse, etc)
  • Resolver: Given a dataset identifier, return metadata about (e.g., location) and maybe launch the notebook. "Data DNS"
  • Agent: Services installed at each site to enable launching notebooks (e.g., proxy, notebook launcher, docker, etc)
  • Data: Actual mounted data at each site

All combinations are possible, although some combinations will likely be easier to accomplish than others.  

Open question: We've discussed the idea of federated services versus a centralized "resolver" service. See the "Federation Options" section for details.

"Repository" - User Frontend

 Options:

    1. user installs bookmarklet this may be restricted in modern browsers... more research is necessary
      • Pros
        • browser-agnostic
      • Cons
        • probably lots of learning involved here
        • user must seek out and install this
        • no notion of authentication
      user installs bookmarklet this may be restricted in modern browsers... more research is necessary
    2. user installs browser extension
      • Pros
        • more secure than bookmarklets... I guess?
      • Cons
        • probably lots of learning involved here
        • user must seek out and install this
        • no notion of authentication (see below)
        • browser-specific (we would need to develop and maintain one for each browser)
    3. developer(s) add a link to repo UI which leads to the existing ToolManager UI landing page, as in the NDSC6 demo
      • Pros
        • user does not need to install anything special on their local machine to launch tools
        • most repos inherently have a notion of "user" whose username and/or email we can use to identify tools launched by this user
      • Cons
        • repo UI developers who want to integrate with us need to add one line to their source to integrate with us
          • Dataverse, Clowder, Globus Publish, etc

...

 

"Resolver" - API endpoint to resolve

...

identifiers (e.g., DOI, URN, URL) to notebook URLs

Options:

    1. Extend the existing NCSA ToolManager to add a /lookup endpoint

Open question: federation (see below) - is this centralized or decentralized?

    1. existing ToolManager - this will very simply serve a JSON file from disk
      • Pros
        • Easy to set up and modify as we need to
      • Cons
        • Likely not a long-term solution, but simple enough to accomplish in the short-term
    2. Girder?+yt: add identifier to metadata and use mongo_search function to resolve 
      • Pros
        • Well-documented, extensible API, with existing notions of file, resource, and user management
      • Cons
        • likely overkill for this system, as we don't need any of the file management capabilities for resolving
        • language barriers in modifying Girder - python + javascript (raw? nodejs?)
    3. Build REST API over etcd?
      • Pros
        • Familiar - this is how the ndslabs API server works, so we can possibly leverage Craig's etcd.go
      • Cons
        • it might be quite a bit of work to build up a new API around etcd
    4. PRAGMA PID service?
      • Pros
        • sounds remarkably similar to what we're trying to accomplish here
        • supports a wide variety of different handle types (and repos?)
      • Cons
        • may be too complex to accomplish in the short term
        • unfamiliar code base / languages

 

 

...

        • Has specific notion of a PID that may be too restrictive.
        • Won't support multiple location resolution?

 

"Agent" - launches containers alongside the data on a Docker-enabled host

Options

    1. Use the existing ToolManager?
      • Pros
        • already parameterized to launch multiple tools (jupyter and rstudio)
      • Cons
        • no notion of "user" or authentication
    2. Girder/tmpnb?
      • Pros
        • notebooks automatically time out after a given period
        • inherited notion of "user"
      • Cons
        • can only launch single image type, currently (only jupyter)
        • inherited notion of "user" may present an interesting auth problem - how do we share these accounts between sites?
        • Girderisms: need to pass around "folderIds" or resolve dataset identifiers to folders.
    3. Kubernetes / Docker Swarm?
      • Pros
        • familiar - this is how the ndslabs API server works, so we can possibly leverage Craig's kube.go
        • orchestration keeps containers alive if possible when anything goes wrong
      • Cons
        • may be too complex to accomplish in the short term
    4. docker -H?
      • Pros
        • zero setup necessary, just need Docker installed and the port open
      • Cons
        • HIGHLY insecure - would require some form of authentication (see below)

 

"Data" - large datasets need to be mountable on a Docker-enabled host

    1. NFS?
    2. GFS?
    3. S3?
    4. other options?

Federation options

  1. Centralized
    1. New sites register with central API server as they come online (i.e. POST to /metadata)
      1. POSTed metadata should include all urls, DOIs, and other necessary info
    2. Central API server (Resolver) receives all requests, resolves DOIs to sites that have registered, and delegates jobs to the Agent
    3. All datasets are registered with a central service that is responsible for resolving identifiers to locations and launching notebooks at those locations.
    • Pros
      • synchronization / authentication (see below) may be slightly easier to solve (only one user)
    • Cons
      • single point of failure
  2. Federated
    1. Each site has it's own local stack but register with a federation server for id → location resolution
    • Pros
      • Users at each site can access data directly/launch notebooks without federation server
      • Can use existing instances, such as hub.yt
    Decentralized
    1. New sites register with each other
    2. Any API server receives request and can resolve and delegate to the appropriate Agent
    • Pros
      • no single point of failure
    • Cons
      • synchronization (see below) is still an open question (is this an open broadcast? handshake? do we keep a record of nearest neighbors?)authentication (see below) and sharing credentials between sites becomes a more complex probleman open broadcast? handshake? do we keep a record of nearest neighbors?)
      • authentication (see below) and sharing credentials between sites becomes a more complex problem

 

Gliffy Diagram
namesc16-central-feder8

Additional notes for this diagram:

  • Centralized 
    • Assuming that we use Girder as-is, the centralized model requires mounting each dataset filesystem via NFS/SSHFS for the initial metadata "ingest". This is only temporary and does not ingest the actual file data, but is awkward.
    • We would need to use or extend the Girder API to support the remote repository request – resolving the dataset identifier (DOI, URL, URN) to the Girder folderId to get the notebook.
    • Requires a user account on the Girder instance to launch notebooks at each site
    • Solves the Whole Tale problem of running remote docker containers.
  • Federated:
    • New "Data DNS" component to handle registration and resolution of IDs to sites
    • New "Federate" component at each site is needed to post data to the federation/Data DNS service
    • Assumes local user accounts at each site – which means users can access the datasets without the federation server, but also means that there are unique user accounts at each site. Using tmpnb, we can't have a single guest user, since there's one notebook per user?
    • In this model, we could use the hub.yt infrastructure as-is, with the addition of the "federat8" component.  No need to copy or mount the DarkSky dataset.
    • Doesn't solve the Whole Tale problem of running remote docker containers

Synchronization options

  1. Sites push their status to the resolver API
    • Assumption: failures are retried after a reasonable period
    • Pros
      • Updates happen in real-time (no delay except network latency)
    • Cons
      • Congestion if many sites come online at precisely the same second
      • More work for whatever we choose as the scheduler / orchestration system - a site missing a scheduled push means we may need to pull it out of rotation
  2. API Resolver service polls for each site's status
    • Assumption: failures are silent, and retried on the next poll interval
    • Pros
      • We will know explicitly when sites' are no longer available for launching tools
    • Cons
      • Time delay between polls means we could have stale data
      • Threading nightmare - this is either one thread short-lived per site, or one giant thread looping through all sites

...

  1. Build some kind of quasi-auth scheme (similar to ndslabs) on top of the existing ToolManager
  2. Inherit Girder's auth scheme and solve the problem of sharing these "users" between sites
  3. Create a "guest" user at each site and use that to launch tools from remote sources
    • NOTE: tmpnb only allows one notebook per user (per folder?), so anyone launching remotely would be sharing a notebook
    • this is undesirable, as ideally each request would launch a separate instance
    • lingering question: how do we get you back to the notebook if you lose the link? how do we know which notebook is yours?

Inclinations: SC16 Demo

  • transfer Transfer (if necessary) each  each dataset to existing cloud architecture - in progress?
  • discover mount points for each large dataset within existing cloud architecture - in progress?
  • spin Spin up a Docker-enabled host and mount nearby datasets (NFS, direct mount, etc.) - in progress?
  • Federated model
      using
      • Using docker-compose, bring up provided girder-dev environment on each Docker host - pending
    • extend existing ToolManager to receive site metadata - in progress
      • Develop the "resolver" REST API to 
        • Receive site metadata
      modify girder-dev to POST site metadata on startup
        • - in progress
      extend existing ToolManager to delegate tmpnb jobs
        • Delegate tmpnb requests to remote Girder instances
      of Girder using existing
        • using existing /notebook API endpoint
      wrap existing ToolManager in a simple auth mechanism
        • Add authentication:
          • We
      • we
          • simply need to collect an e-mail address (identity) to run things on their behalf
        could we
          • Could possibly import existing users from Girder using their API? probably not, due to security
        we
          • We could callback to Girder when sites push their metadata (assuming this can be done as Girder comes online)
      modify existing ToolManager
        • Extend UI to list off collections in connected Girder instance
          • Add a "Launch Notebook" button next to each dataset where no notebook is running
          • Add a "Stop Notebook" button next to each dataset where a notebook has been launched
      • Modify girder-dev to POST site metadata on startup (feder8)- in progress
    • Run a centralized resolver run a centralized ToolManager instance on Nebula for the purposes of the demo

    ...