Related issue:  NDS-736 - Getting issue details... STATUS

Use cases

Phenome2017

  • Asked to copy ~100MB of data to each user's directory (along with .betykey file for API access). Although users could have done this themselves (i.e., downloaded from Box), this was thought to be friendlier for the workshop.
  • Note: There was some confusion about the difference between "~" (effective users home) and /home/namespace, but this is likely out of scope for this issue.

iSchool (anticipating requirements since we haven't talked with them)

  • Classes often have the need to share data. This might be official data for the class or data shared between users.

TERRA-REF

  • Ability to mount RO data volume (e.g., ROGER FS) in every container. This would likely be an NFS mount point on each compute node (hostPath).
  • Ability for containers to retrieve data from remote API

CyVerse (emerging requirements)

  • CyVerse Data Store provides access to shared data via iRODS and FUSE.  
  • Request from Cyverse team to configure workbench to "use Data Store via iRODS". This would allow containers to access Cyverse data.

DataDNS:

  • We've talked about using Labs Workbench as the platform for the DataDNS service. This will mean providing some facility for containers to access data stored on some underlying storage system (e.g., Swift, NFS, etc).

Other requirements

  • Can we support permissions? 
  • How do they get data into the volume?  (File manager, sftp, etc.)
  • What about when we move to commercial cloud providers that don't support Gluster?

Current implementation

  • Global shared directory via hostPath. This is usually a Gluster filesystem. 
  • Each user has a home directory on this global disk. 
  • Containers with volume map to home/AppData/steack to persist data across restarts
  • Users can copy data to/from their home directory, but do not have access to other users' directories

Design ideas

Global directories

  • Our current Gluster implementation is essentially a Global directory that gets mounted to "home". This is a sort of scratch space for users.
  • We could generalize this solution to optionally allow for multiple Global directories mounted in each container. 
    • In the TERRA-REF case, the ROGER:/terraref directory would be NFS mounted RO to each compute node
    • This could then be exposed to every container under /data/shared/terraref or similar
  • A given installation could have as many Global directories as desired. At first, these will likely need to be RO or RW (unless we solve the permissions problem)

Data Providers

This is still pretty rough: To use Labs Workbench in place of the TERRA-REF tool server, we need a consistent mechanism for getting data into containers via API. This is related to the request from Cyverse to integrate with iRODS, which is effectively a set of commands for transferring data.

  • Provide a "Data Provider" interface that can be used to implement support for transferring data from different services (e.g., Clowder, DataVerse). To support the toolserver approach, we'll need the ability to "add" data to the running container. For us, this should be pretty easy since we can just transfer data into the user's home or AppData directory.
  • I'm still working through iRODS. It might be easy enough for the user to simply run the iCommands to transfer data locally and pick-up inside a Jupyter or Rstudio container.  I'm not sure that we need to wrap iRODS with another API.

Permissions

Currently, all containers run as root. The global filesystem is a shared Gluster filesystem owned by root. Data is protected only by not allowing users to see other users' home directories. 

Kubernetes does support fsGroup and supplementalGroup options on the Pod spec to control group ownership of files.  Pod specs also support the runAsUser security context to specify and effective UID. However, these options are only available for a subset of filesystem types – and not hostPath (runAsUser does appear to work). seLinux is also an option, but will require more investigation. Per this document, Kubelet should never manage seLinux permissions for hostPath. Actually, this document makes SELinux support for NFS and Gluster seem unlikely.

It is currently possible to set the runAsUser UID on any running container to control file and directory permissions, however the GID is not set. The fsGroup and supplementalGroup options are available to set the GID for some filesystem types, although not hostPath (which we currently use).

See also:

Kubernetes Volumes

Volume support in Kubernetes seems largely unchanged since we started.  They do now offer support for subPaths, which are interesting.  It is likely that we'll need to use persistent volume claims as a way to support global volumes on commercial cloud providers.

Transferring Data

Individual users can use the File Manager or console to transfer data into the shared directory via standard upload, SFTP, SCP methods. 

Implementation

These would roughly translate into new tickets:

  • Global NFS shared directory support (other than Gluster): Extend the API server to support multiple global shared filesystems, beginning with NFS
  • Data provider interfaces for Clowder and iRODS: Extend the API server to support a) a Data Provider plugin and b) the ability to request that data be added to the user's home/scratch space and mounted to a container. This is to support the TERRA "analyze this datafile" requirement.
  • Filesystem ownership by UID:  Extend user model to include ID that can be mapped to filesystem (API server)
  • Filesystem ownership by GID: Extend user model to include GID include GID that can be mapped to filesystem (API server) 

 

  • No labels