Overview
This is a page to house the results of the manual load testing done on the NDS Labs Workbench (Beta)
Objective
- Generate load on the system for a given number of users
- Monitor the system's resource utilization using Grafana
- This will give us a benchmark of the expected "load" on the cluster
- Take user feedback regarding general usability of the system under the desired load conditions
- This will let us know if user performance has degraded due to any stress on the system
- Take note of how any node additions / removals affect resource constraint, and to what degree
- If the system's resources become constrained, add a node to the cluster alleviating the resource constraint
- If the system is far over-saturated with resources, remove a node from the cluster to simulate a downed node
Current Cluster Configuration
See inventory at: https://github.com/nds-org/ndslabs-deploy-tools/commit/d8d8ef30dac74b1fe84185c7abc6136516d60e7b
Resulting Actions
1 hour group testing
1 hour writing new issues
Phase 1: Labs Workbench + Management
Workbench Version
- 1.0.5
Participants
- Craig Willis
- David Raila
- Mike Lambert
Measurement Utilities
- https://kubedash.workbench.nationaldataservice.org/#!/
- https://grafana.workbench.nationaldataservice.org/dashboard/db/cluster
Results
- Mike: API server crashed with an unknown error shortly after beginning the test
- I started owncloud + cloudcmd + postgres x2 + mysql + dspace simultaneously
- Craig: server.go 1200 on latest (but which latest?) - changed to 1.0.5
- no stack trace, so no ticket filed... if we see it again we will address it
- David: Catalog links do not seem to work in Firefox
- See https://files.slack.com/files-pri/T16F0Q17E-F2MK2PPT9/firefox_error.txt
- Cached page?
- See NDS-173 - Getting issue details... STATUS
- David: File Manager occasionally refuses to start in Chrome
- Popup blocker?
- David: HTTP Basic is old-timey and gross
- I agree.
- Mike: Redis has an HTTP endpoint?
- See NDS-621 - Getting issue details... STATUS
- Mike: Jenkins encountered the "no data available error"
See https://nationaldataservice.slack.com/files/bodom0015/F2MKNMY2E/jenkins_error.txt
See NDS-464 - Getting issue details... STATUS
- Craig: pyCharm encountered "no data available" error
- Process was hung and would not shut down
- See https://nationaldataservice.slack.com/files/craig-willis/F2MK9HJKY/example_error.txt
- See NDS-464 - Getting issue details... STATUS
- Mike: Fedora Commons encountered SSL errors on the REST endpoint
- SSL errors prevented CSS from rendering
- New ticket: NDS-644 - Getting issue details... STATUS
- Mike: Clowder Digest Extractor label missing from dropdown
- Mike: Clowder starts slower than it used to - need to bump up the readinessProbe to accommodate
- New ticket: NDS-645 - Getting issue details... STATUS
- Mike: Clowder extractors / toolserver fails due to 401 (HTTP basic auth)
- PlantCV failed (the TERRA demo from NDSC5)
- See https://nationaldataservice.slack.com/files/bodom0015/F2MLHT2AZ/extractors-error.txt
- New ticket: NDS-646 - Getting issue details... STATUS
- Craig: Somehow the endpoints are being returned as the home page?
- I have seen this intermittently, but am unable
- Craig: Rstudio has a default password
- Craig: Cloud9 needs java 8 to build dataverse, also killed with OOM
- New ticket: NDS-640 - Getting issue details... STATUS
- Craig: Redis endpoint shouldn't be external
- See NDS-621 - Getting issue details... STATUS
- David: No numpy in JupyterLab
- New ticket: NDS-647 - Getting issue details... STATUS
- Mike: Kibana redirects to Grafana
- New ticket: NDS-649 - Getting issue details... STATUS
- Craig: Chisel didn't work as expected.
- See NDS-646 - Getting issue details... STATUS
Prognosis
So far, aside from a few minor issues, everything is running super smoothly.
Peak usage was measured at:
- 6% cluster memory usage
- 3-4% cluster CPU usage
Nearly every service possible was started at some point during 2-ish hours of testing, and only 2 or 3 services encountered the notorious "no data available" problem:
- pyCharm
- Jenkins
Overall, this is fantastic news for the stability of the platform. The testing has brought to light several issues that will need to be addressed
Resulting Actions
Higher priority:
- NDS-464 - Getting issue details... STATUS
- NDS-640 - Getting issue details... STATUS
- NDS-621 - Getting issue details... STATUS
- NDS-648 - Getting issue details... STATUS
- NDS-173 - Getting issue details... STATUS
Lower priority:
- NDS-646 - Getting issue details... STATUS
- NDS-647 - Getting issue details... STATUS
- NDS-645 - Getting issue details... STATUS
- NDS-644 - Getting issue details... STATUS
- NDS-649 - Getting issue details... STATUS
Phase 2: Bug Party
Workbench Version
- 1.0.6
Participants
- Craig Willis
- David Raila
- Mike Lambert
- Michal
- Jing
- Sandeep
- Qiyue
- Marcus
Measurement Utilities
- Node Performance: https://kubedash.workbench.nationaldataservice.org/#!/
- Cluster Performance: https://grafana.workbench.nationaldataservice.org/dashboard/db/cluster
- Centralized Logging: https://kibana.workbench.nationaldataservice.org/
Results
- Michal: No indication of which fields are required for registration
- New ticket: NDS-661 - Getting issue details... STATUS
- Michal: Needs to know what they are doing (i.e., Quickstart)
- See NDS-485 - Getting issue details... STATUS
- David: Recommend whitelisting our site for / disabling pop-ups - can we detect this and make a recommendation to users without correct settings?
- New ticket: NDS-662 - Getting issue details... STATUS
- Michal: couldn't sign up for DSpace - address in use
- This is a more general problem with any service that generates admin credentials... user should be directed to the Config page
- See NDS-560 - Getting issue details... STATUS
- Jing: Docker image name validation is incomplete
- Underscore should be among accepted characters
- New ticket: NDS-663 - Getting issue details... STATUS
- Jing: No indication of required fields during spec create?
- See NDS-661 - Getting issue details... STATUS
- Mike: Saw a failure adding Sufia, only one time... next time it added properly
- Was not able to reliably recreate, and no error message given... will file a ticket if I see it again
- Jing: Custom service failed to start
- See JSON: https://nationaldataservice.slack.com/files/bodom0015/F2NG4BRHS/jings_service_error.txt
- This was due to her container running a single command and then stopping
- To the user, this appears to be a CrashLoop, even though the command has successfully run
- Need to discuss how to handle non-service container... perhaps Kubernetes jobs instead of pods?
- New ticket: NDS-664 - Getting issue details... STATUS
- Qiyue: No indication of which fields are required for registration
- See NDS-661 - Getting issue details... STATUS
- Qiyue: How do we use different versions... for example: Cloud9 Java7 vs Cloud9 Java8
- New ticket: NDS-665 - Getting issue details... STATUS
- Qiyue: What is the storage quota? 20GB
- See NDS-201 - Getting issue details... STATUS
- Jing: Redis is missing an info link
- Marcus: NDS Confluence went down, as a result icons could not load
- See NDS-591 - Getting issue details... STATUS
- Jing: Error messages are confusing - need to translate the error messages (or document them)
- New ticket: NDS-666 - Getting issue details... STATUS
- Michal: would it be better to have a pre-populated instances?
- This would be nice, but may be difficult to handle programmatically in a general way
- Qiyue: Any plans to support Fortran?
- New ticket: NDS-667 - Getting issue details... STATUS
- Mike: Kibana caused the following nagios alerts to come from the LMA node:
- "workbench-lma/Load is WARNING:"
- "WARNING - load average: 8.94, 7.92, 6.52"
- New ticket: NDS-668 - Getting issue details... STATUS
- Jing: Order of top menu – Catalog then Applications?
- New ticket: NDS-669 - Getting issue details... STATUS
- Michal: Can I use this framework to compare montecarlo simulations?
- See NDS-664 - Getting issue details... STATUS
- David: Green/red bars are too big or other parts of application UI are too small.
- ???
- David: Stopped "X" is confusing – thought it was delete
- New ticket: NDS-670 - Getting issue details... STATUS
- Sandeep: Better way of differentiating user versus system specs (little icon isn't readily apparent)
- Sandeep: Help pages as Wiki isn't great – should be part of application
- See NDS-485 - Getting issue details... STATUS
- Marcus: Not sure what to do (quickstart/tutorial)
- See NDS-485 - Getting issue details... STATUS
- Marcus: Documentation isn't clear
- See NDS-485 - Getting issue details... STATUS
- Marcus: Can I use this to launch Jupyter notebooks for BrownDog users?
- Yes
- Craig: iRODS problems (multiple volumes; CloudBrowser Zone)
- See NDS-654 - Getting issue details... STATUS
- Craig: Multiple port problem
- See NDS-655 - Getting issue details... STATUS
Prognosis
Resulting Actions
- NDS-201 - Getting issue details... STATUS
- NDS-560 - Getting issue details... STATUS
- NDS-591 - Getting issue details... STATUS
- NDS-485 - Getting issue details... STATUS
- NDS-654 - Getting issue details... STATUS
- NDS-655 - Getting issue details... STATUS
- NDS-661 - Getting issue details... STATUS
- NDS-662 - Getting issue details... STATUS
- NDS-663 - Getting issue details... STATUS
- NDS-664 - Getting issue details... STATUS
- NDS-665 - Getting issue details... STATUS
- NDS-666 - Getting issue details... STATUS
- NDS-667 - Getting issue details... STATUS
- NDS-668 - Getting issue details... STATUS
- NDS-669 - Getting issue details... STATUS
- NDS-670 - Getting issue details... STATUS