-
Bug
-
Resolution: Fixed
-
Critical
-
None
-
Labs Workbench - Beta
-
None
-
NDS Sprint 30
We saw this with ETK2017 and (later) the EarthCube Workbench instance: filling up the disk is the worst thing that can ever happen to one of these clusters, and the most difficult to recover from without a full reprovision/migration.
We should discuss strategies for deployment that will prevent us from forgetting to mount such a volume.
We should also make sure that /ndslabs/data is housed on this new volume as well.
Ultimately we should at least do one or more of the following:
- Formalize and thoroughly document the single-node deployment via ndslabs-startup, automating whenever possible, to prevent us from accidentally skipping steps
- Think of a way to deploy a gluster-free cluster, consisting of a single master and a single compute node
- Attempt to use deploy-tools as-is with a gluster-less inventory to produce
- Write a new playbook to deploy such a cluster
The big checklist of things to include in the documentation:
- Request / verify TLS certs
- Verify MTU settings
- Verify volume sizes
- ALWAYS attach a large data volume for /var/lib/docker and cluster data
- Import or mount existing data / configuration
- Verify NGINX config
- for example, max body size
- Ensure that custom default backend is deployed (Ansible does not do this yet)
- Enable NAGIOS / LMA
- Disable Logging via ElasticSearch
- Double-check node labels
- Create accounts
- Double check that a basic auth secret exists for all accounts
- Verify service specs
- for example, is everything present? are sensible limits set for all specs?
- Cache service images
- Smoke test
- Write some tutorials / examples for usage of the new instance
- for example, is there anything new that is specific to this instance? new services, new features, etc
This ticket is complete when the above has been discussed, formalized, presented in a digestible fashion, and automated wherever possible. This ticket might need to be broken down into smaller tasks.