Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

 

Component Options

 

We will need to select one from each of the following categories.

 

All combinations are possible, although some combinations will likely be easier to accomplish than others.

 

 

 

  1. "Repository" - User Frontend
    1. user installs bookmarklet this may be restricted in modern browsers... more research is necessary
    2. user installs browser extension
      • Pros
        • more secure than bookmarklets... I guess?
      • Cons
        • probably lots of learning involved here
        • user must seek out and install this
        • browser-specific (we would need to develop and maintain one for each browser)
    3. developer(s) add a link to repo UI which leads to the existing ToolManager UI landing page, as in the NDSC6 demo
      • Pros
        • user does not need to install anything special on their local machine to launch tools
      • Cons
        • repo UI developers who want to integrate with us need to add one line to their source to integrate with us
          • Dataverse, Clowder, Globus Publish, etc
  2. "Resolver" - API endpoint to resolve DOIs to tmpnb proxy URLs
    1. Serve a JSON file from disk? (this is more or less how the existing ToolManager works)
      • Pros
        • Easy to set up and modify as we need to
      • Cons
        • Likely not a long-term solution, but simple enough to accomplish in the short-term
    2. Girder?
      • Pros
        • Well-documented, extensible API, with existing notions of file, resource, and user management
      • Cons
        • likely overkill for this system, as we don't need any of the file management capabilities for resolving
    3. etcd?
      • Pros
        • familiar - this is how the ndslabs API server works, so we can possibly leverage Craig's etcd.go
      • Cons
        • it might be quite a bit of work to build up a new API around etcd
    4. PRAGMA PID service?
      • Pros
        • sounds remarkably similar to what we're trying to accomplish here
        • supports a wide variety of different handle types (and repos?)
      • Cons
        • may be too complex to accomplish in the short term
        • unfamiliar code base / languages
  3. "Agent" - launches containers alongside the data on a Docker-enabled host
    1. existing ToolManager?
      • Pros
        • already parameterized to launch multiple tools (jupyter and rstudio)
      • Cons
        • no notion of "user" or authentication
    2. Girder/tmpnb?
      • Pros
        • notebooks automatically time out after a given period
      • Cons
        • can only launch single image type, currently (only jupyter)
    3. Kubernetes / Docker Swarm?
      • Pros
        • familiar - this is how the ndslabs API server works, so we can possibly leverage Craig's kube.go
        • orchestration keeps containers alive if possible when anything goes wrong
      • Cons
        • may be too complex to accomplish in the short term
    4. docker -H?
      • Pros
        • zero setup necessary, just need Docker installed and the port open
      • Cons
        • HIGHLY insecure
  4. "Data" - large datasets need to be mountable on a Docker-enabled host
    1. NFS?
    2. GFS?
    3. other options?

 

Federation options

 

  1. Centralized
    1. New sites register with central API server as they come online (i.e. POST to /metadata)
      1. POSTed metadata should include all urls, DOIs, and other necessary info
    2. Central API server (Resolver) receives all requests, resolves DOIs to sites that have registered, and delegates jobs to the Agent
  2. Decentralized
    1. New sites register with each other (is this a broadcast? handshake? how to handle synchronization?)
    2. Any API server receives request and can resolve and delegate to the appropriate Agent

 

Synchronization options

 

  1. Sites push their status to the API
    • Assumption: failures are retried after a reasonable period
    • Pros
      • Updates happen in real-time (no delay except network latency)
    • Cons
      • Congestion if many sites come online at precisely the same second
  2. API polls for each site's status
    • Assumption: failures are silent, and retried on the next poll interval
    • Pros
      • ???
    • Cons
      • Time delay between polls means we could be desynchronized
      • Not scalable - this is either one thread per site, or one giant thread looping through all sites


Storyboard for Demo Presentation

...