Monitoring, Backup/Disaster Recovery, Performance Testing, Capacity Planning

Monitoring

Qualys

Qualys is used by NCSA IT for vulnerability assessment and management. Qualys will require SSH access to any public-facing host or service. This will likely mean the loadbalancer host and Nginx ingress controller container.

Create SSH keypair
Open SSH access to NCSA Qualys server (IP)
Create non-root user
Install Qualys client?

Nagios

Nagios
- Need to understand
- Where? AWS, TACC, ISDA instance
- Who gets notified?
- When does it run

Features:

Public service monitoring
Private service monitoring (CPU, memory, disk, logged in users)

Kube tools/Prometheus

Kube tools/Prometheus
Log aggregation
Healthz on all services?
Priorities
- Ingress - Nginx - using default backend 404
- Web UI / API (Kube API/Etcd availability)
- Kube system (GFS, etc)
- Openstack
- Backups

Backup/Disaster Recovery

GFS, Etcd "best effort" for beta
Cluster config (using kubectl)
Deploy tools provisioning

Where? AWS, BW, TACC
How? Some script/Job/rsync
When? Daily rolling
Q
- Hot backup of DBs – backupz + side car
GFS backup options, depends on # of users
- Snapshots + diffs
- Checkpointing
- Replication to another GFS/geolocation

Performance Testing

GFS
“iassist” redux

est. per-user quotas?
what do we need on day 1

Open Questions

How many beta users?

what is the workload?

By what performance metrics do we judge pass/fail?
How do we learn our limits?

Capacity planning / monitoring

What happens when we need to:

add GFS bricks?
add kubernetes nodes?

What constitutes a failure?

Dead node

Space shortcuts

Page tree