Overview
This is a page to house the results of the manual load testing done on the NDS Labs Workbench (Beta)
Objective
- Generate load on the system for a given number of users
- Monitor the system's resource utilization using Grafana
- This will give us a benchmark of the expected "load" on the cluster
- Take user feedback regarding general usability of the system under the desired load conditions
- This will let us know if user performance has degraded due to any stress on the system
- Take note of how any node additions / removals affect resource constraint, and to what degree
- If the system's resources become constrained, add a node to the cluster alleviating the resource constraint
- If the system is far over-saturated with resources, remove a node from the cluster to simulate a downed node
Current Cluster Configuration
See inventory at: https://github.com/nds-org/ndslabs-deploy-tools/commit/d8d8ef30dac74b1fe84185c7abc6136516d60e7b
Resulting Actions
1 hour group testing
1 hour writing new issues
Phase 1: Labs Workbench + Management
Workbench Version
Participants
- Craig Willis
- David Raila
- Mike Lambert
Measurement Utilities
Results
- Mike: API server crashed with an unknown error shortly after beginning the test
- I started owncloud + cloudcmd + postgres x2 + mysql + dspace simultaneously
- Craig: server.go 1200 on latest (but which latest?) - changed to 1.0.5
- no stack trace, so no ticket filed... if we see it again we will address it
- David: Catalog links do not seem to work in Firefox
- David: File Manager occasionally refuses to start in Chrome
- David: HTTP Basic is old-timey and gross
- Mike: Redis has an HTTP endpoint?
- See
- Mike: Jenkins encountered the "no data available error"
- Craig: pyCharm encountered "no data available" error
- Mike: Fedora Commons encountered SSL errors on the REST endpoint
- SSL errors prevented CSS from rendering
- New ticket:
- Mike: Clowder Digest Extractor label missing from dropdown
- Mike: Clowder starts slower than it used to - need to bump up the readinessProbe to accommodate
- New ticket:
- Mike: Clowder extractors / toolserver fails due to 401 (HTTP basic auth)
- Craig: Somehow the endpoints are being returned as the home page?
- I have seen this intermittently, but am unable
- Craig: Rstudio has a default password
- See
- New ticket:
- Craig: Cloud9 needs java 8 to build dataverse, also killed with OOM
- New ticket:
- Craig: Redis endpoint shouldn't be external
- See
- David: No numpy in JupyterLab
- New ticket:
- Mike: Kibana redirects to Grafana
- New ticket:
- Craig: Chisel didn't work as expected.
- See
Prognosis
So far, aside from a few minor issues, everything is running super smoothly.
Peak usage was measured at:
- 6% cluster memory usage
- 3-4% cluster CPU usage
Nearly every service possible was started at some point during 2-ish hours of testing, and only 2 or 3 services encountered the notorious "no data available" problem:
Overall, this is fantastic news for the stability of the platform. The testing has brought to light several issues that will need to be addressed
Resulting Actions
Higher priority:
Lower priority:
Phase 2: Bug Party
Workbench Version
Participants
- Craig Willis
- David Raila
- Mike Lambert
- Michal
- Jing
- Sandeep
- Qiyue
- Marcus
Measurement Utilities
Results
- Michal: No indication of which fields are required for registration
- Michal: Needs to know what they are doing (i.e., Quickstart)
- Mike: I have a better UI design for the catalog to propose :X
- David: Recommend whitelisting our site for / disabling pop-ups - can we detect this and make a recommendation to users without correct settings?
- Michal: couldn't sign up for DSpace - address in use
- This is a more general problem with any service that generates admin credentials... user should be directed to the Config page
- Jing: Docker image name validation is incomplete
- Underscore should be among accepted characters
- Jing: No indication of required fields during spec create?
- Mike: Saw a failure adding Sufia, only one time... next time it added properly
- Jing: Custom service failed to start
- Qiyue: No indication of which fields are required for registration
- Qiyue: How do we use different versions... for example: Cloud9 Java7 vs Cloud9 Java8
- Qiyue: What is the storage quota? 20GB
- Jing: Redis is missing an info link
- Marcus: NDS Confluence went down, as a result icons could not load
- Jing: Error messages are confusing - need to translate the error messages (or document them)
- Michal: would it be better to have a pre-populated instances?
- This would be nice, but may be difficult to handle programmatically in a general way
- Qiyue: Any plans to support Fortran?
- Mike: Kibana caused the following nagios alerts to come from the LMA node:
- "workbench-lma/Load is WARNING:"
- "WARNING - load average: 8.94, 7.92, 6.52"
- Jing: Order of top menu – Catalog then Applications?
- Michal: Can I use this framework to compare montecarlo simulations?
- David: Green/red bars are too big or other parts of application UI are too small.
- David: Stopped "X" is confusing – thought it was delete
- Sandeep: Better way of differentiating user versus system specs (little icon isn't readily apparent)
- Sandeep: Help pages as Wiki isn't great – should be part of application
- Marcus: Not sure what to do (quickstart/tutorial)
- Marcus: Documentation isn't clear
- Marcus: Can I use this to launch Jupyter notebooks for BrownDog users?
Prognosis
Resulting Actions