Monitoring
Qualys
Qualys is used by NCSA IT for vulnerability assessment and management. Qualys will require SSH access to any public-facing host or service. This will likely mean the loadbalancer host and Nginx ingress controller container.
- Create SSH keypair
- Open SSH access to NCSA Qualys server (IP)
- Create non-root user
- Install Qualys client?
Nagios
- Nagios
- Need to understand
- Where? AWS, TACC, ISDA instance
- Who gets notified?
- When does it run
Features:
- Public service monitoring
- Private service monitoring (CPU, memory, disk, logged in users)
Kube tools/Prometheus
- Kube tools/Prometheus
- Log aggregation
- Healthz on all services?
- Priorities
- Ingress - Nginx - using default backend 404
- Web UI / API (Kube API/Etcd availability)
- Kube system (GFS, etc)
- Openstack
- Backups
Backup/Disaster Recovery
- GFS, Etcd "best effort" for beta
- Cluster config (using kubectl)
- Deploy tools provisioning
- Where? AWS, BW, TACC
- How? Some script/Job/rsync
- When? Daily rolling
- Q
- Hot backup of DBs – backupz + side car
- GFS backup options, depends on # of users
- Snapshots + diffs
- Checkpointing
- Replication to another GFS/geolocation
Performance Testing
- GFS
- “iassist” redux
- est. per-user quotas?
- what do we need on day 1
Open Questions
- How many beta users?
- what is the workload?
- By what performance metrics do we judge pass/fail?
- How do we learn our limits?
- Capacity planning / monitoring
- What happens when we need to:
- add GFS bricks?
- add kubernetes nodes?
- What constitutes a failure?
- Dead node