Loading...

XML

Word

Printable

JSON

This is a tracking/accounting task for work done last week after catastrophic loss of the beta loadbalancer instance.

On 1/12, we received alerts from Nagios about timeouts from the beta loadbalancer instance.

The instance was slow to respond, so I rebooted. The reboot took >30 minutes (usual time is ~30 seconds).
Opened ticket with Nebula team. Worked with Chris Lindsey to investigate instance. David thought the issue might be related to previous Nebula GFS problems. Chris was unable to find any problems, live migrated the instance to another compute node and the problem continued.
David suggested the following solution:
- Create a snapshot of an existing instance via OpenStack. In this case, we chose LMA
- Rename old loadbal instance (loadbal-old)
- Use the snapshot to provision a new instance (loadbal)
- Disassociate public IP from old instance, assign to new instance
- Update /etc/kubernetes/kubelet config to have correct hostname
- Add nagios user/key, qualys user/key per usual install process.
- Change passwd/group/shadow/permissions
- Shutdown old loadbal node