Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Centralized
    1. All datasets are registered with a central service that is responsible for resolving identifiers to locations and launching notebooks at those locations.
    • Pros
      • synchronization / authentication (see below) may be slightly easier to solve (only one user)
      • Can mix data from two sites: having information about data in one place allows users to compose freely.
    • Cons
      • single point of failure
      • Local access to datasets at a site requires going through an external service
  2. Federated
    1. Each site has it's own local stack but register with a federation server for id → location resolution
    • Pros
      • Users at each site can access data directly/launch notebooks without federation server
      • Can use existing instancesservices, such as hub.yt
    • Cons
      • synchronization (see below) is still an open question (is this an open broadcast? handshake? do we keep a record of nearest neighbors?)
      • authentication (see below) and sharing credentials between sites becomes a more complex problem
      • Can't mix data from two sites

 

Gliffy Diagram
namesc16-central-feder8

...

  • Centralized 
    • Assuming that we use Girder as-is, the centralized model requires mounting each dataset filesystem via NFS/SSHFS for the initial metadata "ingest". This is only temporary and does not ingest the actual file data, but is awkward.
    • We would need to use or extend the Girder API to support the remote repository request – resolving the dataset identifier (DOI, URL, URN) to the Girder folderId to get the notebook.
    • Requires a user account on the Girder instance to launch notebooks at each site
    • Solves the Whole Tale problem of running remote docker containers.
  • Federated:
    • New "Data DNS" component to handle registration and resolution of IDs to sites
    • New "Federate" component at each site is needed to post data to the federation/Data DNS service
    • Assumes local user accounts at each site – which means users can access the datasets without the federation server, but also means that there are unique user accounts at each site. Using tmpnb, we can't have a single guest user, since there's one notebook per user?
    • In this model, we could use the hub.yt infrastructure as-is, with the addition of the "federat8" component.  No need to copy or mount the DarkSky dataset.
    • Doesn't solve the Whole Tale problem of running remote docker containers

Synchronization options

  1. Sites push their status to the resolver API
    • Assumption: failures are retried after a reasonable period
    • Pros
      • Updates happen in real-time (no delay except network latency)
    • Cons
      • Congestion if many sites come online at precisely the same second
      • More work for whatever we choose as the scheduler / orchestration system - a site missing a scheduled push means we may need to pull it out of rotation
  2. Resolver service polls for each site's status
    • Assumption: failures are silent, and retried on the next poll interval
    • Pros
      • We will know explicitly when sites' are no longer available for launching tools
    • Cons
      • Time delay between polls means we could have stale data
      • Threading nightmare - this is either one thread short-lived per site, or one giant thread looping through all sites

...

  • leverage existing Labs/Kubernetes API for authentication and container orchestration / access across remote sites
    • etcd.go / kube.go can likely take care of talking to the necessary APIs for us, maybe needing some slight modification
    • possibly extend Labs apiserver to include the functionality of delegating jobs to tmpnb and/or ToolManager agents?
    • this leaves an open questions: single geodistributed kubernetes cluster? or one kubernetes cluster per site, federated across all sites ("ubernetes")?


Storyboard for Demo Presentation

...