Monitoring, Backup/Disaster Recovery, Performance Testing, Capacity Planning

Monitoring

Qualys

Qualys is used by NCSA IT for vulnerability assessment and management. Qualys will require SSH access to any public-facing host or service.

NCSA Security has opened a ticket for this: https://jira.ncsa.illinois.edu/browse/SECOPS-340. We need to:

Provide a list of IPs that we want scanned (in general they try to scan one system of each type)
Security will provide SSH public key to use to login to local qualys user account.
Instructions for setting up Qualys user: https://wiki.ncsa.illinois.edu/pages/viewpage.action?pageId=41461115
Provide email address for reports.
We will also need to do this to public-facing containers (e.g., Nginx controller)

Associated tickets:

NDS-565 - Getting issue details... STATUS

Nagios

Nagios is an open source monitoring system. In general, the Nagios server is installed in one location and the Nagios Remote Plugin Executor (NRPE) on each node to be monitored. Nagios provides public service monitoring through standard plugins (e.g., DNS, HTTP, SMTP, etc). It provides private service monitoring throug NRPE (CPU, memory, disk, etc).

For NDS Labs, we'll do the following:

Evaluate using https://github.com/QuantumObject/docker-nagios
Create Nagios server Docker image if docker-nagios is not acceptable, following the instructions in
- https://www.digitalocean.com/community/tutorials/how-to-install-nagios-4-and-monitor-your-servers-on-ubuntu-14-04
Create Nagios daemonset for NRPE following the instructions in
- https://www.digitalocean.com/community/tutorials/how-to-install-nagios-4-and-monitor-your-servers-on-ubuntu-14-04
Provision VM to run Nagios server at remote site (TACC)
Create nagios configuration github repository to maintain versioned nagios monitoring per-cluster (starting with beta) configurations
Configure Nagios contacts
Configure Nagios hosts for priority systems. This includes;
- Ingress/Nginx
- Web UI/API including Kube API/Etcd availability
- Kube system (GFS, LMA tools, etc)
- Openstack
- Backups
- NOTE: nagios server will not be able to directly access cluster servers which currently live in private network without going through ingress loadbalancer. Monitoring should be direct if possible, which is addressed by NDS-581

Additionally, we will want to add health checks (healthz) to all system services.

Associated tickets:

NDS-566 - Getting issue details... STATUS

Usage monitoring

We will use the Kubernetes addons, specifically ELK and Grafana, to monitor usage during the beta period.

Backup/Disaster Recovery

GFS, Etcd "best effort" for beta
Cluster config (using kubectl)
Deploy tools provisioning

Where? AWS, BW, TACC
How? Some script/Job/rsync
When? Daily rolling
Q
- Hot backup of DBs – backupz + side car
GFS backup options, depends on # of users
- Snapshots + diffs
- Checkpointing
- Replication to another GFS/geolocation

Performance Testing

GFS
“iassist” redux

est. per-user quotas?
what do we need on day 1

Open Questions

How many beta users?

what is the workload?

By what performance metrics do we judge pass/fail?
How do we learn our limits?

Capacity planning / monitoring

What happens when we need to:

add GFS bricks?
add kubernetes nodes?

What constitutes a failure?

Dead node

Space shortcuts

Page tree