Page History

...

A dataset has a Globus Publish landing page https://publish.globus.org/jspui/handle/ITEM/113
This dataset has the URL
- https://www.globus.org/app/transfer?origin_id=82f1b5c6-6e9b-11e5-ba47-22000b92c6ec&origin_path=/unpublished/publication_113/
This would map to Nebula:
- /scratch/mdf/publication_113

Component Options

We will need to select one from each of the following categories.

All combinations are possible, although some combinations will likely be easier to accomplish than others.

"Repository" - User Frontend
1. ~~user installs bookmarklet~~ this may be restricted in modern browsers... more research is necessary
  - Pros
    - browser-agnostic
  - Cons
    - probably lots of learning involved here
    - user must seek out and install this
    - injecting arbitrary JavaScript into pages does not feel very secure, and has since been replaced by modern browser extensions
      - see https://hypothes.is/blog/farewell-to-bookmarklets/
      - see https://medium.com/making-instapaper/bookmarklets-are-dead-d470d4bbb626#.co504ji62
2. user installs browser extension
  - Pros
    - more secure than bookmarklets... I guess?
  - Cons
    - probably lots of learning involved here
    - user must seek out and install this
    - browser-specific (we would need to develop and maintain one for each browser)
3. developer(s) add a link to repo UI which leads to the existing ToolManager UI landing page, as in the NDSC6 demo
  - Pros
    - user does not need to install anything special on their local machine to launch tools
  - Cons
    - repo UI developers who want to integrate with us need to add one line to their source to integrate with us
      - Dataverse, Clowder, Globus Publish, etc
"Resolver" - API endpoint to resolve DOIs to tmpnb proxy URLs
1. Serve a JSON file from disk? (this is more or less how the existing ToolManager works)
  - Pros
    - Easy to set up and modify as we need to
  - Cons
    - Likely not a long-term solution, but simple enough to accomplish in the short-term
2. Girder?
  - Pros
    - Well-documented, extensible API, with existing notions of file, resource, and user management
  - Cons
    - likely overkill for this system, as we don't need any of the file management capabilities for resolving
3. etcd?
  - Pros
    - familiar - this is how the ndslabs API server works, so we can possibly leverage Craig's etcd.go
  - Cons
    - it might be quite a bit of work to build up a new API around etcd
4. PRAGMA PID service?
  - Pros
    - sounds remarkably similar to what we're trying to accomplish here
    - supports a wide variety of different handle types (and repos?)
  - Cons
    - may be too complex to accomplish in the short term
    - unfamiliar code base / languages
"Agent" - launches containers alongside the data on a Docker-enabled host
1. existing ToolManager?
  - Pros
    - already parameterized to launch multiple tools (jupyter and rstudio)
  - Cons
    - no notion of "user" or authentication
2. Girder/tmpnb?
  - Pros
    - notebooks automatically time out after a given period
  - Cons
    - can only launch single image type, currently (only jupyter)
3. Kubernetes / Docker Swarm?
  - Pros
    - familiar - this is how the ndslabs API server works, so we can possibly leverage Craig's kube.go
    - orchestration keeps containers alive if possible when anything goes wrong
  - Cons
    - may be too complex to accomplish in the short term
4. docker -H?
  - Pros
    - zero setup necessary, just need Docker installed and the port open
  - Cons
    - HIGHLY insecure
"Data" - large datasets need to be mountable on a Docker-enabled host
1. NFS?
2. GFS?
3. other options?

Federation options

Centralized
1. New sites register with central API server as they come online (i.e. POST to /metadata)
  1. POSTed metadata should include all urls, DOIs, and other necessary info
2. Central API server (Resolver) receives all requests, resolves DOIs to sites that have registered, and delegates jobs to the Agent
Decentralized
1. New sites register with each other (is this a broadcast? handshake? how to handle synchronization?)
2. Any API server receives request and can resolve and delegate to the appropriate Agent

Synchronization options

Sites push their status to the API
- Assumption: failures are retried after a reasonable period
- Pros
  - Updates happen in real-time (no delay except network latency)
- Cons
  - Congestion if many sites come online at precisely the same second
API polls for each site's status
- Assumption: failures are silent, and retried on the next poll interval
- Pros
  - ???
- Cons
  - Time delay between polls means we could be desynchronized
  - Not scalable - this is either one thread per site, or one giant thread looping through all sites

Storyboard for Demo Presentation

...

Space shortcuts

Page tree

Versions Compared

Old Version 32

New Version 33

Key

Component Options

Federation options

Synchronization options

Storyboard for Demo Presentation