Monitoring
- Qualys (vulnerability)
- Loadbalancer, Nginx controller
- Nagios
- Need to understand
- Where? AWS, TACC, ISDA instance
- Who gets notified?
- When does it run
- Kube tools/Prometheus
- Log aggregation
- Healthz on all services?
- Priorities
- Ingress - Nginx - using default backend 404
- Web UI / API (Kube API/Etcd availability)
- Kube system (GFS, etc)
- Openstack
- Backups
Backup/Disaster Recovery
- GFS, Etcd "best effort" for beta
- Cluster config (using kubectl)
- Deploy tools provisioning
- Where? AWS, BW, TACC
- How? Some script/Job/rsync
- When? Daily rolling
- Q
- Hot backup of DBs – backupz + side car
- GFS backup options, depends on # of users
- Snapshots + diffs
- Checkpointing
- Replication to another GFS/geolocation
Performance Testing