Uploaded image for project: 'National Data Service'
  1. National Data Service
  2. NDS-985

Single-node installations flake out in high-load scenarios

XMLWordPrintableJSON

    • NDS Sprint 30

      We saw this with ETK2017: if a single-node (large as it may be) does not have dedicated resources for master, the master pod can be left behind in high-load scenarios.

      Master went into CrashLoopBackoff during the ETK2017 compile phase... load average was upward of ~60 for a 64 core node (i.e. approaching the danger zone). Anything above 45 should have been cause for concern. While we have historically been over-provisioning these clusters, for a short-lived workshop where every minute matters it is much better to be over-provisioned than under.

      We should look into ways of planning for workload that would only allow people to use 70% of the system's resources, and leave the rest for overhead.

      This ticket is complete when we have discussed the above, although there may be no concrete deliverable resulting from the discussion.

      To test: run through the ETK tutorial use case – replicating the 30 user scenario and recommend best solution, which may be master+1 (NDS-984).

      From https://docs.einsteintoolkit.org/et-docs/Simplified_Tutorial_for_New_Users (with slight modifications):

      # Override using Python2.7
      source ~/../.bashrc
       
      # Reset test state
      cd ~
      rm -rf /home/jovyan/work/*
       
      # Build a thing
      curl -O -L https://raw.githubusercontent.com/gridaphobe/CRL/ET_2017_06/GetComponents
      chmod a+x GetComponents
      ./GetComponents --parallel https://bitbucket.org/einsteintoolkit/manifest/raw/ET_2017_06/einsteintoolkit.th
      cd Cactus
      ./simfactory/bin/sim setup-silent --optionlist=ubuntu.cfg --runscript debian.sh
      ./simfactory/bin/sim build --mdbkey make 'make -j2' --thornlist=manifest/einsteintoolkit.th

       

              lambert8 Sara Lambert
              lambert8 Sara Lambert
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved:

                  Estimated:
                  Original Estimate - 1 day
                  1d
                  Remaining:
                  Remaining Estimate - 1 day
                  1d
                  Logged:
                  Time Spent - Not Specified
                  Not Specified