Overview
This is a page to house the results of the manual load testing done on the NDS Labs Workbench (Beta)
Objective
- Generate load on the system for a given number of users
- Monitor the system's resource utilization using Grafana
- This will give us a benchmark of the expected "load" on the cluster
- Take user feedback regarding general usability of the system under the desired load conditions
- This will let us know if user performance has degraded due to any stress on the system
- Take note of how any node additions / removals affect resource constraint, and to what degree
- If the system's resources become constrained, add a node to the cluster alleviating the resource constraint
- If the system is far over-saturated with resources, remove a node from the cluster to simulate a downed node
Current Cluster Configuration
See inventory at: https://github.com/nds-org/ndslabs-deploy-tools/commit/d8d8ef30dac74b1fe84185c7abc6136516d60e7b
Resulting Actions
1 hour group testing
1 hour writing new issues
Phase 1: Labs Workbench + Management
Workbench Version
Participants
- Craig Willis
- David Raila
- Mike Lambert
Measurement Utilities
Results
- Mike: API server crashed with an unknown error shortly after beginning the test
- I started owncloud + cloudcmd + postgres x2 + mysql + dspace simultaneously
- Craig: server.go 1200 on latest (but which latest?) - changed to 1.0.5
- no stack trace, so no ticket filed... if we see it again we will address it
- David: Catalog links do not seem to work in Firefox
- David: File Manager occasionally refuses to start in Chrome
- David: HTTP Basic is old-timey and gross
- Mike: Redis has an HTTP endpoint?
- See
- Mike: Jenkins encountered the "no data available error"
- Craig: pyCharm encountered "no data available" error
- Mike: Fedora Commons encountered SSL errors on the REST endpoint
- SSL errors prevented CSS from rendering
- New ticket:
- Mike: Clowder Digest Extractor label missing from dropdown
- Mike: Clowder starts slower than it used to - need to bump up the readinessProbe to accommodate
- New ticket:
- Mike: Clowder extractors / toolserver fails due to 401 (HTTP basic auth)
- Craig: Somehow the endpoints are being returned as the home page?
- I have seen this intermittently, but am unable
- Craig: Rstudio has a default password
- See
- New ticket:
- Craig: Cloud9 needs java 8 to build dataverse, also killed with OOM
- New ticket:
- Craig: Redis endpoint shouldn't be external
- See
- David: No numpy in JupyterLab
- New ticket:
- Mike: Kibana redirects to Grafana
- New ticket:
- Craig: Chisel didn't work as expected.
- See
Prognosis
So far, aside from a few minor issues, everything is running super smoothly.
Peak usage was measured at:
- 6% cluster memory usage
- 3-4% cluster CPU usage
Nearly every service possible was started at some point during 2-ish hours of testing, and only 2 or 3 services encountered the notorious "no data available" problem:
Overall, this is fantastic news for the stability of the platform. The testing has brought to light several issues that will need to be addressed
Resulting Actions
Higher priority:
Lower priority:
Phase 2: Bug Party
Workbench Version
Participants
- Craig Willis
- David Raila
- Mike Lambert
- Michal
- Jing
- Sandeep
- Qiyue
- Marcus
Measurement Utilities
Results
- Michal: No indication of which fields are required for registration
- New ticket:
- Michal: Needs to know what they are doing (i.e., Quickstart)
- See
- David: Recommend whitelisting our site for / disabling pop-ups - can we detect this and make a recommendation to users without correct settings?
- New ticket:
- Michal: couldn't sign up for DSpace - address in use
- This is a more general problem with any service that generates admin credentials... user should be directed to the Config page
- See
- Jing: Docker image name validation is incomplete
- Underscore should be among accepted characters
- New ticket:
- Jing: No indication of required fields during spec create?
- See
- Mike: Saw a failure adding Sufia, only one time... next time it added properly
- Was not able to reliably recreate, and no error message given... will file a ticket if I see it again
- Jing: Custom service failed to start
- Qiyue: No indication of which fields are required for registration
- See
- Qiyue: How do we use different versions... for example: Cloud9 Java7 vs Cloud9 Java8
- New ticket:
- Qiyue: What is the storage quota? 20GB
- See
- Jing: Redis is missing an info link
- Marcus: NDS Confluence went down, as a result icons could not load
- See
- Jing: Error messages are confusing - need to translate the error messages (or document them)
- New ticket:
- Michal: would it be better to have a pre-populated instances?
- This would be nice, but may be difficult to handle programmatically in a general way
- Qiyue: Any plans to support Fortran?
- New ticket:
- Mike: Kibana caused the following nagios alerts to come from the LMA node:
- "workbench-lma/Load is WARNING:"
- "WARNING - load average: 8.94, 7.92, 6.52"
- New ticket:
- Jing: Order of top menu – Catalog then Applications?
- New ticket:
- Michal: Can I use this framework to compare montecarlo simulations?
- See
- David: Green/red bars are too big or other parts of application UI are too small.
- I would be happy to look over any UI mockups that you would be willing to provide
- David: Stopped "X" is confusing – thought it was delete
- New ticket:
- Sandeep: Better way of differentiating user versus system specs (little icon isn't readily apparent)
- Sandeep: Help pages as Wiki isn't great – should be part of application
- See
- Marcus: Not sure what to do (quickstart/tutorial)
- See
- Marcus: Documentation isn't clear
- See
- Marcus: Can I use this to launch Jupyter notebooks for BrownDog users?
- Labs Workbench is more for testing and development - publically-accessible services with real users are highly discouraged
- That being said, if users did want to use Workbench to spin up personal notebook for their own private analysis, that would be highly encouraged
- Craig: iRODS problems (multiple volumes; CloudBrowser Zone)
- See
- Craig: Multiple port problem
- See
Prognosis
Aside from a slew of UX problems, the platform itself performed rather well!
Usage from 8 users peaked at:
This means that we should be able to easily support our target of 50 users.
Optimistically, assuming that gluster doesn't fall over and that our usage scales fairly linearly with increasing users, these results mean that we might be able to support upward of 60 or 70 users simultaneously using the Beta cluster without needing to resize it.
Resulting Actions