Objective
- Generate load on the system for a given number of users
- Monitor the system's resource utilization using Grafana
- This will give us a benchmark of the expected "load" on the cluster
- Take user feedback regarding general usability of the system under the desired load conditions
- This will let us know if user performance has degraded due to any stress on the system
- Take note of how any node additions / removals affect resource constraint, and to what degree
- If the system's resources become constrained, add a node to the cluster alleviating the resource constraint
- If the system is far over-saturated with resources, remove a node from the cluster to simulate a downed node
Resulting Actions
1 hour group testing
1 hour writing new issues
Phase 1: Labs Workbench + Management
Current Cluster Configuration
See inventory at: https://github.com/nds-org/ndslabs-deploy-tools/commit/d8d8ef30dac74b1fe84185c7abc6136516d60e7b
Participants
- Craig Willis
- David Raila
- Mike Lambert
Measurement Utilities
- https://kubedash.workbench.nationaldataservice.org/#!/
- https://grafana.workbench.nationaldataservice.org/dashboard/db/cluster
Results
- Mike: API server crashed with an unknown error shortly after beginning the test
- I started owncloud + cloudcmd + postgres x2 + mysql + dspace simultaneously
- David: Catalog links do not seem to work in Firefox
- David: File Manager occasionally refuses to start in Chrome
- Popup blocker?
- David: HTTP Basic is old-timey and gross (I concur)
- Mike: Redis has an HTTP endpoint?
- Mike: Jenkins encountered the "no data available error"
- Craig: pyCharm encountered "no data available" error
- Process was hung and would not shut down
- See NDS-464 - Getting issue details... STATUS
- See https://nationaldataservice.slack.com/files/craig-willis/F2MK9HJKY/example_error.txt
- Process was hung and would not shut down
- Mike: Fedora Commons encountered SSL errors on the REST endpoint
- Mike: Clowder Digest Extractor label missing from dropdown
- Mike: Clowder starts slower than it used to - need to bump up the readinessProbe to accomodate
- Mike: Clowder extractors / toolserver fails due to 401 (HTTP basic auth)
Prognosis
So far, aside from a few minor issues, everything is running super smoothly.
Peak usage was measured at:
- 6% cluster memory usage
- 3-4% cluster CPU usage