Details
-
Story
-
Resolution: Unresolved
-
Normal
-
None
-
Labs Workbench - Beta
-
None
-
NDS Sprint 42, NDS Sprint 43
Description
The Workbench public beta has been online for almost a year at SDSC. We've made some considerable changes that would allow us to cut down on the resource footprint created by this cluster:
- shrank required number of GLFS node replicas from 4 to 2
- nodes can now have multiple labels to make more efficient use of resources
The following resources are currently provisioned at SDSC:
Instance Name | Flavor | vCPUs (M) | RAM Size (GB) | Root Disk Size (GB) |
---|---|---|---|---|
workbench-master1 | m1.large | 2 | 8 | 20 |
workbench-loadbal | m1.large | 2 | 8 | 20 |
workbench-lma | r1.xlarge | 4 | 32 | 20 |
workbench-node1 | r1.xlarge | 4 | 32 | 20 |
workbench-node2 | r1.xlarge | 4 | 32 | 20 |
workbench-gfs1 | r1.large | 2 | 16 | 20 |
workbench-gfs2 | r1.large | 2 | 16 | 20 |
workbench-gfs3 | r1.large | 2 | 16 | 20 |
workbench-gfs4 | r1.large | 2 | 16 | 20 |
Total | — | 24 | 176 | 180 |
Planned changes:
- Cut 4 dedicated GLFS instances down to 2
- Remove dedicated LMA node (since it's only running grafana)
After reprovision:
Instance Name | Flavor | vCPUs (M) | RAM Size (GB) | Root Disk Size (GB) |
---|---|---|---|---|
workbench-master1 | m1.large | 2 | 8 | 20 |
workbench-loadbal | m1.large | 2 | 8 | 20 |
workbench-node1 | r1.xlarge | 4 | 32 | 20 |
workbench-node2 | r1.xlarge | 4 | 32 | 20 |
workbench-gfs1 | r1.large | 2 | 16 | 20 |
workbench-gfs2 | r1.large | 2 | 16 | 20 |
Total | — | 16 | 112 | 120 |
For discussion:
- Remove dedicated loadbal node?
- Resize compute nodes?
- Dedicated etcd node?
- Are there any other considerations that might have been missed?
Assume purge GLFS and running stacks. See also https://opensource.ncsa.illinois.edu/confluence/display/NDS/Beta+release+communication
Completion criteria:
- Send downtime announcement
- Backup etcd (ndslabs user data)
- Deploy new cluster based on above specs
- Integration tests pass
- Send resume announcement
- Tear down old cluster
This ticket is complete when the above reprovision has been discussed and executed.
Gliffy Diagrams
Attachments
Issue Links
- depends on
-
NDS-1125 API server errors and restarts while trying to shutdown inactive service
-
- Open
-
-
NDS-1133 Stack trace starting standard Docker application
-
- Open
-
-
NDS-1168 API server stack trace + crash + restart when starting toolmanager
-
- Open
-
-
NDS-1212 Clicking endpoint link creates onslaught of check_token calls
-
- Open
-
-
NDS-1130 Frequent 500 errors from /accounts/{account-id}
-
- Open
-
-
NDS-1213 Flannel subnet changes break networking
-
- Resolved
-
-
NDS-998 Intermittent etcd timeouts from apiserver
-
- Resolved
-
-
NDS-1199 MTU problems deploying at SDSC
-
- Resolved
-
-
NDS-1200 Deploy tools bug with conditional register
-
- Resolved
-
-
NDS-1201 Deploy tools has wrong ingress configuration
-
- Resolved
-
- duplicates
-
NDS-1173 Deploy 1.1 to beta
-
- Closed
-
- is related to
-
NDS-833 Fix problem with flavors when combining node function
-
- Open
-
- mentioned in
-
Page Loading...