Uploaded image for project: 'National Data Service'
  1. National Data Service
  2. NDS-984

Single-node installations can fill up the root disk, causing catastrophic failure

XMLWordPrintableJSON

    • NDS Sprint 30

      We saw this with ETK2017 and (later) the EarthCube Workbench instance: filling up the disk is the worst thing that can ever happen to one of these clusters, and the most difficult to recover from without a full reprovision/migration.

      We should discuss strategies for deployment that will prevent us from forgetting to mount such a volume.

      We should also make sure that /ndslabs/data is housed on this new volume as well.

      Ultimately we should at least do one or more of the following:

      • Formalize and thoroughly document the single-node deployment via ndslabs-startup, automating whenever possible, to prevent us from accidentally skipping steps
      • Think of a way to deploy a gluster-free cluster, consisting of a single master and a single compute node
        1. Attempt to use deploy-tools as-is with a gluster-less inventory to produce
        2. Write a new playbook to deploy such a cluster

      The big checklist of things to include in the documentation:

      1. Request / verify TLS certs
      2. Verify MTU settings
      3. Verify volume sizes
        • ALWAYS attach a large data volume for /var/lib/docker and cluster data
      4. Import or mount existing data / configuration
      5. Verify NGINX config
        • for example, max body size
      6. Ensure that custom default backend is deployed (Ansible does not do this yet)
      7. Enable NAGIOS / LMA
      8. Disable Logging via ElasticSearch
      9. Double-check node labels
      10. Create accounts
      11. Double check that a basic auth secret exists for all accounts
      12. Verify service specs
        • for example, is everything present? are sensible limits set for all specs?
      13. Cache service images
      14. Smoke test
      15. Write some tutorials / examples for usage of the new instance
        • for example, is there anything new that is specific to this instance? new services, new features, etc

      This ticket is complete when the above has been discussed, formalized, presented in a digestible fashion, and automated wherever possible. This ticket might need to be broken down into smaller tasks.

              lambert8 Sara Lambert
              lambert8 Sara Lambert
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved:

                  Estimated:
                  Original Estimate - 1 day, 4 hours
                  1d 4h
                  Remaining:
                  Time Spent - 30 minutes Remaining Estimate - 1 day, 3 hours, 30 minutes
                  1d 3h 30m
                  Logged:
                  Time Spent - 30 minutes Remaining Estimate - 1 day, 3 hours, 30 minutes
                  30m