...
- A dataset has a Globus Publish landing page https://publish.globus.org/jspui/handle/ITEM/113
- This dataset has the URL
- This would map to Nebula:
- /scratch/mdf/publication_113
Component Options
We will need to select one from each of the following categories.
All combinations are possible, although some combinations will likely be easier to accomplish than others.
- "Repository" - User Frontend
user installs bookmarkletthis may be restricted in modern browsers... more research is necessary- Pros
- browser-agnostic
- Cons
- probably lots of learning involved here
- user must seek out and install this
- injecting arbitrary JavaScript into pages does not feel very secure, and has since been replaced by modern browser extensions
- Pros
- user installs browser extension
- Pros
- more secure than bookmarklets... I guess?
- Cons
- probably lots of learning involved here
- user must seek out and install this
- browser-specific (we would need to develop and maintain one for each browser)
- Pros
- developer(s) add a link to repo UI which leads to the existing ToolManager UI landing page, as in the NDSC6 demo
- Pros
- user does not need to install anything special on their local machine to launch tools
- Cons
- repo UI developers who want to integrate with us need to add one line to their source to integrate with us
- Dataverse, Clowder, Globus Publish, etc
- repo UI developers who want to integrate with us need to add one line to their source to integrate with us
- Pros
- "Resolver" - API endpoint to resolve DOIs to tmpnb proxy URLs
- Serve a JSON file from disk? (this is more or less how the existing ToolManager works)
- Pros
- Easy to set up and modify as we need to
- Cons
- Likely not a long-term solution, but simple enough to accomplish in the short-term
- Pros
- Girder?
- Pros
- Well-documented, extensible API, with existing notions of file, resource, and user management
- Cons
- likely overkill for this system, as we don't need any of the file management capabilities for resolving
- Pros
- etcd?
- Pros
- familiar - this is how the ndslabs API server works, so we can possibly leverage Craig's etcd.go
- Cons
- it might be quite a bit of work to build up a new API around etcd
- Pros
- PRAGMA PID service?
- Pros
- sounds remarkably similar to what we're trying to accomplish here
- supports a wide variety of different handle types (and repos?)
- Cons
- may be too complex to accomplish in the short term
- unfamiliar code base / languages
- Pros
- Serve a JSON file from disk? (this is more or less how the existing ToolManager works)
- "Agent" - launches containers alongside the data on a Docker-enabled host
- existing ToolManager?
- Pros
- already parameterized to launch multiple tools (jupyter and rstudio)
- Cons
- no notion of "user" or authentication
- Pros
- Girder/tmpnb?
- Pros
- notebooks automatically time out after a given period
- Cons
- can only launch single image type, currently (only jupyter)
- Pros
- Kubernetes / Docker Swarm?
- Pros
- familiar - this is how the ndslabs API server works, so we can possibly leverage Craig's kube.go
- orchestration keeps containers alive if possible when anything goes wrong
- Cons
- may be too complex to accomplish in the short term
- Pros
- docker -H?
- Pros
- zero setup necessary, just need Docker installed and the port open
- Cons
- HIGHLY insecure
- Pros
- existing ToolManager?
- "Data" - large datasets need to be mountable on a Docker-enabled host
- NFS?
- GFS?
- other options?
Federation options
- Centralized
- New sites register with central API server as they come online (i.e. POST to /metadata)
- POSTed metadata should include all urls, DOIs, and other necessary info
- Central API server (Resolver) receives all requests, resolves DOIs to sites that have registered, and delegates jobs to the Agent
- New sites register with central API server as they come online (i.e. POST to /metadata)
- Decentralized
- New sites register with each other (is this a broadcast? handshake? how to handle synchronization?)
- Any API server receives request and can resolve and delegate to the appropriate Agent
Synchronization options
- Sites push their status to the API
- Assumption: failures are retried after a reasonable period
- Pros
- Updates happen in real-time (no delay except network latency)
- Cons
- Congestion if many sites come online at precisely the same second
- API polls for each site's status
- Assumption: failures are silent, and retried on the next poll interval
- Pros
- ???
- Cons
- Time delay between polls means we could be desynchronized
- Not scalable - this is either one thread per site, or one giant thread looping through all sites
Storyboard for Demo Presentation
...