Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 

Table of Contents

Monitoring

  • Qualys (vulnerability)
    • Loadbalancer, Nginx controller
  • Nagios
    • Need to understand
    • Where? AWS, TACC, ISDA instance
    • Who gets notified?
    • When does it run
  • Kube tools/Prometheus
  • Log aggregation
  • Healthz on all services?
  • Priorities
    • Ingress - Nginx - using default backend 404
    • Web UI / API (Kube API/Etcd availability)
    • Kube system (GFS, etc)
    • Openstack
    • Backups

...

  • Where? AWS, BW, TACC
  • How? Some script/Job/rsync
  • When? Daily rolling
  • Q
    • Hot backup of DBs – backupz + side car
  • GFS backup options, depends on # of users
    • Snapshots + diffs
    • Checkpointing
    • Replication to another GFS/geolocation

 

Performance Testing

...

  • GFS
  • “iassist” redux
    • est. per-user quotas?
    • what do we need on day 1 

Open Questions

  • How many beta users?
    • what is the workload?
  • By what performance metrics do we judge pass/fail?
  • How do we learn our limits?
    • Capacity planning / monitoring
  • What happens when we need to:
    • add GFS bricks?
    • add kubernetes nodes?
  • What constitutes a failure?
    • Dead node

Capacity Planning