...
- Centralized
- All datasets are registered with a central service that is responsible for resolving identifiers to locations and launching notebooks at those locations.
- Pros
- synchronization / authentication (see below) may be slightly easier to solve (only one user)
- Can mix data from two sites: having information about data in one place allows users to compose freely.
- Cons
- single point of failure
- Local access to datasets at a site requires going through an external service
- Federated
- Each site has it's own local stack but register with a federation server for id → location resolution
- Pros
- Users at each site can access data directly/launch notebooks without federation server
- Can use existing instancesservices, such as hub.yt
- Cons
- synchronization (see below) is still an open question (is this an open broadcast? handshake? do we keep a record of nearest neighbors?)
- authentication (see below) and sharing credentials between sites becomes a more complex problem
- Can't mix data from two sites
Gliffy Diagram | ||||
---|---|---|---|---|
|
...
- Centralized
- Assuming that we use Girder as-is, the centralized model requires mounting each dataset filesystem via NFS/SSHFS for the initial metadata "ingest". This is only temporary and does not ingest the actual file data, but is awkward.
- We would need to use or extend the Girder API to support the remote repository request – resolving the dataset identifier (DOI, URL, URN) to the Girder folderId to get the notebook.
- Requires a user account on the Girder instance to launch notebooks at each site
- Solves the Whole Tale problem of running remote docker containers.
- Federated:
- New "Data DNS" component to handle registration and resolution of IDs to sites
- New "Federate" component at each site is needed to post data to the federation/Data DNS service
- Assumes local user accounts at each site – which means users can access the datasets without the federation server, but also means that there are unique user accounts at each site. Using tmpnb, we can't have a single guest user, since there's one notebook per user?
- In this model, we could use the hub.yt infrastructure as-is, with the addition of the "federat8" component. No need to copy or mount the DarkSky dataset.
- Doesn't solve the Whole Tale problem of running remote docker containers
Synchronization options
- Sites push their status to the resolver API
- Assumption: failures are retried after a reasonable period
- Pros
- Updates happen in real-time (no delay except network latency)
- Cons
- Congestion if many sites come online at precisely the same second
- More work for whatever we choose as the scheduler / orchestration system - a site missing a scheduled push means we may need to pull it out of rotation
- Resolver service polls for each site's status
- Assumption: failures are silent, and retried on the next poll interval
- Pros
- We will know explicitly when sites' are no longer available for launching tools
- Cons
- Time delay between polls means we could have stale data
- Threading nightmare - this is either one thread short-lived per site, or one giant thread looping through all sites
...
- leverage existing Labs/Kubernetes API for authentication and container orchestration / access across remote sites
- etcd.go / kube.go can likely take care of talking to the necessary APIs for us, maybe needing some slight modification
- possibly extend Labs apiserver to include the functionality of delegating jobs to tmpnb and/or ToolManager agents?
- this leaves an open questions: single geodistributed kubernetes cluster? or one kubernetes cluster per site, federated across all sites ("ubernetes")?
Storyboard for Demo Presentation
...