Monitoring
See NDS Labs Monitoring.
Backup/Disaster Recovery
- GFS, Etcd "best effort" for beta
- Cluster config (using kubectl)
- Deploy tools provisioning
- Where? AWS, BW, TACC
- How? Some script/Job/rsync
- When? Daily rolling
- Q
- Hot backup of DBs – backupz + side car
- GFS backup options, depends on # of users
- Snapshots + diffs
- Checkpointing
- Replication to another GFS/geolocation
Performance Testing
- GFS
- “iassist” redux
- est. per-user quotas?
- what do we need on day 1
Open Questions
- How many beta users?
- what is the workload?
- By what performance metrics do we judge pass/fail?
- How do we learn our limits?
- Capacity planning / monitoring
- What happens when we need to:
- add GFS bricks?
- add kubernetes nodes?
- What constitutes a failure?
- Dead node