Overview
This is a page to house the results of the manual load testing done on the NDS Labs Workbench (Beta)
Objective
- Generate load on the system for a given number of users
- Monitor the system's resource utilization using Grafana
- This will give us a benchmark of the expected "load" on the cluster
- Take user feedback regarding general usability of the system under the desired load conditions
- This will let us know if user performance has degraded due to any stress on the system
- Take note of how any node additions / removals affect resource constraint, and to what degree
- If the system's resources become constrained, add a node to the cluster alleviating the resource constraint
- If the system is far over-saturated with resources, remove a node from the cluster to simulate a downed node
Current Cluster Configuration
See inventory at: https://github.com/nds-org/ndslabs-deploy-tools/commit/d8d8ef30dac74b1fe84185c7abc6136516d60e7b
Resulting Actions
1 hour group testing
1 hour writing new issues
Phase 1: Labs Workbench + Management
Workbench Version
- 1.0.5
Participants
- Craig Willis
- David Raila
- Mike Lambert
Measurement Utilities
- https://kubedash.workbench.nationaldataservice.org/#!/
- https://grafana.workbench.nationaldataservice.org/dashboard/db/cluster
Results
- Mike: API server crashed with an unknown error shortly after beginning the test
- I started owncloud + cloudcmd + postgres x2 + mysql + dspace simultaneously
- Craig: server.go 1200 on latest (but which latest?) - changed to 1.0.5
- no stack trace, so no ticket filed... if we see it again we will address it
- David: Catalog links do not seem to work in Firefox
- See https://files.slack.com/files-pri/T16F0Q17E-F2MK2PPT9/firefox_error.txt
- Cached page?
- See NDS-173 - Getting issue details... STATUS
- David: File Manager occasionally refuses to start in Chrome
- Popup blocker?
- David: HTTP Basic is old-timey and gross
- I agree.
- Mike: Redis has an HTTP endpoint?
- See NDS-621 - Getting issue details... STATUS
- Mike: Jenkins encountered the "no data available error"
See https://nationaldataservice.slack.com/files/bodom0015/F2MKNMY2E/jenkins_error.txt
See NDS-464 - Getting issue details... STATUS
- Craig: pyCharm encountered "no data available" error
- Process was hung and would not shut down
- See https://nationaldataservice.slack.com/files/craig-willis/F2MK9HJKY/example_error.txt
- See NDS-464 - Getting issue details... STATUS
- Mike: Fedora Commons encountered SSL errors on the REST endpoint
- SSL errors prevented CSS from rendering
- New ticket: NDS-644 - Getting issue details... STATUS
- Mike: Clowder Digest Extractor label missing from dropdown
- Mike: Clowder starts slower than it used to - need to bump up the readinessProbe to accommodate
- New ticket: NDS-645 - Getting issue details... STATUS
- Mike: Clowder extractors / toolserver fails due to 401 (HTTP basic auth)
- PlantCV failed (the TERRA demo from NDSC5)
- See https://nationaldataservice.slack.com/files/bodom0015/F2MLHT2AZ/extractors-error.txt
- New ticket: NDS-646 - Getting issue details... STATUS
- Craig: Somehow the endpoints are being returned as the home page?
- I have seen this intermittently, but am unable
- Craig: Rstudio has a default password
- Craig: Cloud9 needs java 8 to build dataverse, also killed with OOM
- New ticket: NDS-640 - Getting issue details... STATUS
- Craig: Redis endpoint shouldn't be external
- See NDS-621 - Getting issue details... STATUS
- David: No numpy in JupyterLab
- New ticket: NDS-647 - Getting issue details... STATUS
- Mike: Kibana redirects to Grafana
- New ticket: NDS-649 - Getting issue details... STATUS
- Craig: Chisel didn't work as expected.
- See NDS-646 - Getting issue details... STATUS
Prognosis
So far, aside from a few minor issues, everything is running super smoothly.
Peak usage was measured at:
- 6% cluster memory usage
- 3-4% cluster CPU usage
Nearly every service possible was started at some point during 2-ish hours of testing, and only 2 or 3 services encountered the notorious "no data available" problem:
- pyCharm
- Jenkins
Overall, this is fantastic news for the stability of the platform. The testing has brought to light several issues that will need to be addressed
Resulting Actions
Higher priority:
- NDS-464 - Getting issue details... STATUS
- NDS-640 - Getting issue details... STATUS
- NDS-621 - Getting issue details... STATUS
- NDS-648 - Getting issue details... STATUS
- NDS-173 - Getting issue details... STATUS
Lower priority:
- NDS-646 - Getting issue details... STATUS
- NDS-647 - Getting issue details... STATUS
- NDS-645 - Getting issue details... STATUS
- NDS-644 - Getting issue details... STATUS
- NDS-649 - Getting issue details... STATUS
Phase 2: Bug Party
Workbench Version
- 1.0.6
Participants
- Craig Willis
- David Raila
- Mike Lambert
- Michal
- Jing
- Sandeep
- Qiyue
- Marcus
Measurement Utilities
- Node Performance: https://kubedash.workbench.nationaldataservice.org/#!/
- Cluster Performance: https://grafana.workbench.nationaldataservice.org/dashboard/db/cluster
- Centralized Logging: https://kibana.workbench.nationaldataservice.org/
Results
- Michal: No indication of which fields are required for registration
- Mike: I have a better UI design for the catalog to propose :X
- David: Recommend whitelisting our site for / disabling pop-ups - can we detect this and make a recommendation to users without correct settings?
- Michal: couldn't sign up for DSpace - address in use
- This is a more general problem with any service that generates admin credentials... user should be directed to the Config page
- Jing: Docker image name validation is incomplete
- Underscore should be among accepted characters
- Mike: Saw a failure adding Sufia, only one time... next time it added properly
- Jing: Custom service failed to start
- Qiyue: How do we use different versions... for example: Cloud9 Java7 vs Cloud9 Java8
- Qiyue: What is the storage quota? 20GB
- Jing: Redis is missing an info link
- Marcus: NDS Confluence went down, as a result icons could not load