Monitoring

Qualys

Qualys is used by NCSA IT for vulnerability assessment and management.  Qualys will require SSH access to any public-facing host or service.  

NCSA Security has opened a ticket for this: https://jira.ncsa.illinois.edu/browse/SECOPS-340. We need to:

Associated tickets:

Nagios

Nagios is an open source monitoring system. In general, the Nagios server is installed in one location and the Nagios Remote Plugin Executor (NRPE) on each node to be monitored. Nagios provides public service monitoring through standard plugins (e.g., DNS, HTTP, SMTP, etc).  It provides private service monitoring throug NRPE (CPU, memory, disk, etc).

For NDS Labs, we'll do the following:

Additionally, we will want to add health checks (healthz) to all system services.

Associated tickets:

Usage monitoring

We will use the Kubernetes addons, specifically ELK and Grafana, to monitor usage during the beta period.

Backup/Disaster Recovery

  1. GFS, Etcd "best effort" for beta
  2. Cluster config (using kubectl)
  3. Deploy tools provisioning

 

Performance Testing

Open Questions

Capacity Planning