Overview
This is a page to house the results of the manual load testing done on the NDS Labs Workbench (Beta)
Objective
- Generate load on the system for a given number of users
- Monitor the system's resource utilization using Grafana
- This will give us a benchmark of the expected "load" on the cluster
- Take user feedback regarding general usability of the system under the desired load conditions
- This will let us know if user performance has degraded due to any stress on the system
- Take note of how any node additions / removals affect resource constraint, and to what degree
- If the system's resources become constrained, add a node to the cluster alleviating the resource constraint
- If the system is far over-saturated with resources, remove a node from the cluster to simulate a downed node
Current Cluster Configuration
See inventory at: https://github.com/nds-org/ndslabs-deploy-tools/commit/d8d8ef30dac74b1fe84185c7abc6136516d60e7b
Resulting Actions
1 hour group testing
1 hour writing new issues
Phase 1: Labs Workbench + Management
Workbench Version
- 1.0.5
Participants
- Craig Willis
- David Raila
- Mike Lambert
Measurement Utilities
- https://kubedash.workbench.nationaldataservice.org/#!/
- https://grafana.workbench.nationaldataservice.org/dashboard/db/cluster
Results
- Mike: API server crashed with an unknown error shortly after beginning the test
- I started owncloud + cloudcmd + postgres x2 + mysql + dspace simultaneously
- Craig: server.go 1200 on latest (but which latest?) - changed to 1.0.5
- no stack trace, so no ticket filed... if we see it again we will address it
- David: Catalog links do not seem to work in Firefox
- See https://files.slack.com/files-pri/T16F0Q17E-F2MK2PPT9/firefox_error.txt
- Cached page?
- See - NDS-173Getting issue details... STATUS
- David: File Manager occasionally refuses to start in Chrome
- Popup blocker?
- David: HTTP Basic is old-timey and gross
- I agree.
- Mike: Redis has an HTTP endpoint?
- Mike: Jenkins encountered the "no data available error"
- Craig: pyCharm encountered "no data available" error
- Process was hung and would not shut down
- See https://nationaldataservice.slack.com/files/craig-willis/F2MK9HJKY/example_error.txt
- See - NDS-464Getting issue details... STATUS
- Mike: Fedora Commons encountered SSL errors on the REST endpoint
- Mike: Clowder Digest Extractor label missing from dropdown
- Mike: Clowder starts slower than it used to - need to bump up the readinessProbe to accommodate
- Mike: Clowder extractors / toolserver fails due to 401 (HTTP basic auth)
- PlantCV failed (the TERRA demo from NDSC5)
- See https://nationaldataservice.slack.com/files/bodom0015/F2MLHT2AZ/extractors-error.txt
- New ticket: - NDS-646Getting issue details... STATUS
- Craig: Somehow the endpoints are being returned as the home page?
- I have seen this intermittently, but am unable
- Craig: Rstudio has a default password
- Craig: Cloud9 needs java 8 to build dataverse, also killed with OOM
- Craig: Redis endpoint shouldn't be external
- David: No numpy in JupyterLab
- Mike: Kibana redirects to Grafana
- Craig: Chisel didn't work as expected.
Prognosis
So far, aside from a few minor issues, everything is running super smoothly.
Peak usage was measured at:
- 6% cluster memory usage
- 3-4% cluster CPU usage
Nearly every service possible was started at some point during 2-ish hours of testing, and only 2 or 3 services encountered the notorious "no data available" problem:
- pyCharm
- Jenkins
Overall, this is fantastic news for the stability of the platform. The testing has brought to light several issues that will need to be addressed
Resulting Actions
Higher priority:
- - NDS-464Getting issue details... STATUS
- - NDS-640Getting issue details... STATUS
- - NDS-621Getting issue details... STATUS
- - NDS-648Getting issue details... STATUS
- - NDS-173Getting issue details... STATUS
Lower priority:
- - NDS-646Getting issue details... STATUS
- - NDS-647Getting issue details... STATUS
- - NDS-645Getting issue details... STATUS
- - NDS-644Getting issue details... STATUS
- - NDS-649Getting issue details... STATUS
Phase 2: Bug Party
Workbench Version
- 1.0.6
Participants
- Craig Willis
- David Raila
- Mike Lambert
- Michal
- Jing
- Sandeep
- Qiyue
- Marcus
Measurement Utilities
- Node Performance: https://kubedash.workbench.nationaldataservice.org/#!/
- Cluster Performance: https://grafana.workbench.nationaldataservice.org/dashboard/db/cluster
- Centralized Logging: https://kibana.workbench.nationaldataservice.org/
Results
- Michal: No indication of which fields are required for registration
- Michal: Needs to know what they are doing (i.e., Quickstart)
- David: Recommend whitelisting our site for / disabling pop-ups - can we detect this and make a recommendation to users without correct settings?
- Michal: couldn't sign up for DSpace - address in use
- Jing: Docker image name validation is incomplete
- Jing: No indication of required fields during spec create?
- Mike: Saw a failure adding Sufia, only one time... next time it added properly
- Was not able to reliably recreate, and no error message given... will file a ticket if I see it again
- Jing: Custom service failed to start
- See JSON: https://nationaldataservice.slack.com/files/bodom0015/F2NG4BRHS/jings_service_error.txt
- This was due to her container running a single command and then stopping
- To the user, this appears to be a CrashLoop, even though the command has successfully run
- Need to discuss how to handle non-service container... perhaps Kubernetes jobs instead of pods?
- New ticket: - NDS-664Getting issue details... STATUS
- Qiyue: No indication of which fields are required for registration
- Qiyue: How do we use different versions... for example: Cloud9 Java7 vs Cloud9 Java8
- Qiyue: What is the storage quota? 20GB
- Jing: Redis is missing an info link
- Marcus: NDS Confluence went down, as a result icons could not load
- Jing: Error messages are confusing - need to translate the error messages (or document them)
- Michal: would it be better to have a pre-populated instances?
- This would be nice, but may be difficult to handle programmatically in a general way
- Qiyue: Any plans to support Fortran?
- Mike: Kibana caused the following nagios alerts to come from the LMA node:
- Jing: Order of top menu – Catalog then Applications?
- Michal: Can I use this framework to compare montecarlo simulations?
- David: Green/red bars are too big or other parts of application UI are too small.
- I would be happy to look over any UI mockups that you would be willing to provide
- David: Stopped "X" is confusing – thought it was delete
- Sandeep: Better way of differentiating user versus system specs (little icon isn't readily apparent)
- Sandeep: Help pages as Wiki isn't great – should be part of application
- Marcus: Not sure what to do (quickstart/tutorial)
- Marcus: Documentation isn't clear
- Marcus: Can I use this to launch Jupyter notebooks for BrownDog users?
- Labs Workbench is more for testing and development - publically-accessible services with real users are highly discouraged
- That being said, if users did want to use Workbench to spin up personal notebook for their own private analysis, that would be highly encouraged
- Craig: iRODS problems (multiple volumes; CloudBrowser Zone)
- Craig: Multiple port problem
Prognosis
Aside from a slew of UX problems, the platform itself performed rather well!
Usage from 8 users peaked at:
- ~10% Memory
- ~6% CPU
This means that we should be able to easily support our target of 50 users.
Optimistically, assuming that gluster doesn't fall over and that our usage scales fairly linearly with increasing users, these results mean that we might be able to support upward of 60 or 70 users simultaneously using the Beta cluster without needing to resize it.
Resulting Actions
- - NDS-201Getting issue details... STATUS
- - NDS-560Getting issue details... STATUS
- - NDS-591Getting issue details... STATUS
- - NDS-485Getting issue details... STATUS
- - NDS-654Getting issue details... STATUS
- - NDS-655Getting issue details... STATUS
- - NDS-661Getting issue details... STATUS
- - NDS-662Getting issue details... STATUS
- - NDS-663Getting issue details... STATUS
- - NDS-664Getting issue details... STATUS
- - NDS-665Getting issue details... STATUS
- - NDS-666Getting issue details... STATUS
- - NDS-667Getting issue details... STATUS
- - NDS-668Getting issue details... STATUS
- - NDS-669Getting issue details... STATUS
- - NDS-670Getting issue details... STATUS