Monitoring, Backup/Disaster Recovery, Performance Testing, Capacity Planning

Created by Craig Willis, last modified on Sep 28, 2016

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Monitoring

See NDS Labs Monitoring.

Backup/Disaster Recovery

GFS, Etcd "best effort" for beta
Cluster config (using kubectl)
Deploy tools provisioning

Where? AWS, BW, TACC
How? Some script/Job/rsync
When? Daily rolling
Q
- Hot backup of DBs – backupz + side car
GFS backup options, depends on # of users
- Snapshots + diffs
- Checkpointing
- Replication to another GFS/geolocation

Performance Testing

GFS
“iassist” redux

est. per-user quotas?
what do we need on day 1

Open Questions

How many beta users?

what is the workload?

By what performance metrics do we judge pass/fail?
How do we learn our limits?

Capacity planning / monitoring

What happens when we need to:

add GFS bricks?
add kubernetes nodes?

What constitutes a failure?

Dead node

Capacity Planning

No labels