Overview
This is a page to house the results of the manual load testing done on the NDS Labs Workbench (Beta)
Objective
- Generate load on the system for a given number of users
- Monitor the system's resource utilization using Grafana
- This will give us a benchmark of the expected "load" on the cluster
- Take user feedback regarding general usability of the system under the desired load conditions
- This will let us know if user performance has degraded due to any stress on the system
- Take note of how any node additions / removals affect resource constraint, and to what degree
- If the system's resources become constrained, add a node to the cluster alleviating the resource constraint
- If the system is far over-saturated with resources, remove a node from the cluster to simulate a downed node
Current Cluster Configuration
See inventory at: https://github.com/nds-org/ndslabs-deploy-tools/commit/d8d8ef30dac74b1fe84185c7abc6136516d60e7b
Resulting Actions
1 hour group testing
1 hour writing new issues
Phase 1: Labs Workbench + Management
Workbench Version
- 1.0.5
Participants
- Craig Willis
- David Raila
- Mike Lambert
Measurement Utilities
- https://kubedash.workbench.nationaldataservice.org/#!/
- https://grafana.workbench.nationaldataservice.org/dashboard/db/cluster
Results
- Mike: API server crashed with an unknown error shortly after beginning the test
- I started owncloud + cloudcmd + postgres x2 + mysql + dspace simultaneously
- Craig: server.go 1200 on latest (but which latest?) - changed to 1.0.5
- no stack trace, so no ticket filed... if we see it again we will address it
- David: Catalog links do not seem to work in Firefox
- See https://files.slack.com/files-pri/T16F0Q17E-F2MK2PPT9/firefox_error.txt
- Cached page?
- See NDS-173 - Getting issue details... STATUS
- David: File Manager occasionally refuses to start in Chrome
- Popup blocker?
- David: HTTP Basic is old-timey and gross
- I agree.
- Mike: Redis has an HTTP endpoint?
- See NDS-621 - Getting issue details... STATUS
- Mike: Jenkins encountered the "no data available error"
See https://nationaldataservice.slack.com/files/bodom0015/F2MKNMY2E/jenkins_error.txt
See NDS-464 - Getting issue details... STATUS
- Craig: pyCharm encountered "no data available" error
- Process was hung and would not shut down
- See https://nationaldataservice.slack.com/files/craig-willis/F2MK9HJKY/example_error.txt
- See NDS-464 - Getting issue details... STATUS
- Mike: Fedora Commons encountered SSL errors on the REST endpoint
- SSL errors prevented CSS from rendering
- New ticket: NDS-644 - Getting issue details... STATUS
- Mike: Clowder Digest Extractor label missing from dropdown
- Mike: Clowder starts slower than it used to - need to bump up the readinessProbe to accommodate
- New ticket: NDS-645 - Getting issue details... STATUS
- Mike: Clowder extractors / toolserver fails due to 401 (HTTP basic auth)
- PlantCV failed (the TERRA demo from NDSC5)
- See https://nationaldataservice.slack.com/files/bodom0015/F2MLHT2AZ/extractors-error.txt
- New ticket: NDS-646 - Getting issue details... STATUS
- Craig: Somehow the endpoints are being returned as the home page?
- I have seen this intermittently, but am unable
- Craig: Rstudio has a default password
- Craig: Cloud9 needs java 8 to build dataverse, also killed with OOM
- New ticket: NDS-640 - Getting issue details... STATUS
- Craig: Redis endpoint shouldn't be external
- See NDS-621 - Getting issue details... STATUS
- David: No numpy in JupyterLab
- New ticket: NDS-647 - Getting issue details... STATUS
- Mike: Kibana redirects to Grafana
- New ticket: NDS-649 - Getting issue details... STATUS
- Craig: Chisel didn't work as expected.
- See NDS-646 - Getting issue details... STATUS
Prognosis
So far, aside from a few minor issues, everything is running super smoothly.
Peak usage was measured at:
- 6% cluster memory usage
- 3-4% cluster CPU usage
Nearly every service possible was started at some point during 2-ish hours of testing, and only 2 or 3 services encountered the notorious "no data available" problem:
- pyCharm
- Jenkins
Overall, this is fantastic news for the stability of the platform. The testing has brought to light several issues that will need to be addressed
Resulting Actions
Higher priority:
- NDS-464 - Getting issue details... STATUS
- NDS-640 - Getting issue details... STATUS
- NDS-621 - Getting issue details... STATUS
- NDS-648 - Getting issue details... STATUS
- NDS-173 - Getting issue details... STATUS
Lower priority:
- NDS-646 - Getting issue details... STATUS
- NDS-647 - Getting issue details... STATUS
- NDS-645 - Getting issue details... STATUS
- NDS-644 - Getting issue details... STATUS
- NDS-649 - Getting issue details... STATUS
Phase 2: Bug Party
Workbench Version
- 1.0.6
Participants
- Craig Willis
- David Raila
- Mike Lambert
- Michal
- Jing
- Sandeep
- Qiyue
- Marcus
Measurement Utilities
- Node Performance: https://kubedash.workbench.nationaldataservice.org/#!/
- Cluster Performance: https://grafana.workbench.nationaldataservice.org/dashboard/db/cluster
- Centralized Logging: https://kibana.workbench.nationaldataservice.org/
Results
- Michal: No indication of which fields are required for registration
- New ticket: NDS-661 - Getting issue details... STATUS
- Michal: Needs to know what they are doing (i.e., Quickstart)
- See NDS-485 - Getting issue details... STATUS
- David: Recommend whitelisting our site for / disabling pop-ups - can we detect this and make a recommendation to users without correct settings?
- New ticket: NDS-662 - Getting issue details... STATUS
- Michal: couldn't sign up for DSpace - address in use
- This is a more general problem with any service that generates admin credentials... user should be directed to the Config page
- See NDS-560 - Getting issue details... STATUS
- Jing: Docker image name validation is incomplete
- Underscore should be among accepted characters
- New ticket: NDS-663 - Getting issue details... STATUS
- Jing: No indication of required fields during spec create?
- See NDS-661 - Getting issue details... STATUS
- Mike: Saw a failure adding Sufia, only one time... next time it added properly
- Was not able to reliably recreate, and no error message given... will file a ticket if I see it again
- Jing: Custom service failed to start
- See JSON: https://nationaldataservice.slack.com/files/bodom0015/F2NG4BRHS/jings_service_error.txt
- This was due to her container running a single command and then stopping
- To the user, this appears to be a CrashLoop, even though the command has successfully run
- Need to discuss how to handle non-service container... perhaps Kubernetes jobs instead of pods?
- New ticket: NDS-664 - Getting issue details... STATUS
- Qiyue: No indication of which fields are required for registration
- See NDS-661 - Getting issue details... STATUS
- Qiyue: How do we use different versions... for example: Cloud9 Java7 vs Cloud9 Java8
- New ticket: NDS-665 - Getting issue details... STATUS
- Qiyue: What is the storage quota? 20GB
- See NDS-201 - Getting issue details... STATUS
- Jing: Redis is missing an info link
- Marcus: NDS Confluence went down, as a result icons could not load
- See NDS-591 - Getting issue details... STATUS
- Jing: Error messages are confusing - need to translate the error messages (or document them)
- New ticket: NDS-666 - Getting issue details... STATUS
- Michal: would it be better to have a pre-populated instances?
- This would be nice, but may be difficult to handle programmatically in a general way
- Qiyue: Any plans to support Fortran?
- New ticket: NDS-667 - Getting issue details... STATUS
- Mike: Kibana caused the following nagios alerts to come from the LMA node:
- "workbench-lma/Load is WARNING:"
- "WARNING - load average: 8.94, 7.92, 6.52"
- New ticket: NDS-668 - Getting issue details... STATUS
- Jing: Order of top menu – Catalog then Applications?
- New ticket: NDS-669 - Getting issue details... STATUS
- Michal: Can I use this framework to compare montecarlo simulations?
- See NDS-664 - Getting issue details... STATUS
- David: Green/red bars are too big or other parts of application UI are too small.
- I would be happy to look over any UI mockups that you would be willing to provide
- David: Stopped "X" is confusing – thought it was delete
- New ticket: NDS-670 - Getting issue details... STATUS
- Sandeep: Better way of differentiating user versus system specs (little icon isn't readily apparent)
- Sandeep: Help pages as Wiki isn't great – should be part of application
- See NDS-485 - Getting issue details... STATUS
- Marcus: Not sure what to do (quickstart/tutorial)
- See NDS-485 - Getting issue details... STATUS
- Marcus: Documentation isn't clear
- See NDS-485 - Getting issue details... STATUS
- Marcus: Can I use this to launch Jupyter notebooks for BrownDog users?
- Labs Workbench is more for testing and development - publically-accessible services with real users are highly discouraged
- That being said, if users did want to use Workbench to spin up personal notebook for their own private analysis, that would be highly encouraged
- Craig: iRODS problems (multiple volumes; CloudBrowser Zone)
- See NDS-654 - Getting issue details... STATUS
- Craig: Multiple port problem
- See NDS-655 - Getting issue details... STATUS
Prognosis
Aside from a slew of UX problems, the platform itself performed rather well!
Usage from 8 users peaked at:
- ~10% Memory
- ~6% CPU
This means that we should be able to easily support our target of 50 users.
Optimistically, assuming that gluster doesn't fall over and that our usage scales fairly linearly with increasing users, these results mean that we might be able to support upward of 60 or 70 users simultaneously using the Beta cluster without needing to resize it.
Resulting Actions
- NDS-201 - Getting issue details... STATUS
- NDS-560 - Getting issue details... STATUS
- NDS-591 - Getting issue details... STATUS
- NDS-485 - Getting issue details... STATUS
- NDS-654 - Getting issue details... STATUS
- NDS-655 - Getting issue details... STATUS
- NDS-661 - Getting issue details... STATUS
- NDS-662 - Getting issue details... STATUS
- NDS-663 - Getting issue details... STATUS
- NDS-664 - Getting issue details... STATUS
- NDS-665 - Getting issue details... STATUS
- NDS-666 - Getting issue details... STATUS
- NDS-667 - Getting issue details... STATUS
- NDS-668 - Getting issue details... STATUS
- NDS-669 - Getting issue details... STATUS
- NDS-670 - Getting issue details... STATUS