Uploaded image for project: 'National Data Service'
  1. National Data Service
  2. NDS-724

Loadbalancer failure

XMLWordPrintableJSON

    • Icon: Task Task
    • Resolution: Fixed
    • Icon: Normal Normal
    • Labs Workbench - Beta
    • None
    • None
    • None
    • NDS Sprint 18, NDS Sprint 19

      This is a tracking/accounting task for work done last week after catastrophic loss of the beta loadbalancer instance.

      On 1/12, we received alerts from Nagios about timeouts from the beta loadbalancer instance.

      • The instance was slow to respond, so I rebooted. The reboot took >30 minutes (usual time is ~30 seconds).
      • Opened ticket with Nebula team. Worked with Chris Lindsey to investigate instance. David thought the issue might be related to previous Nebula GFS problems. Chris was unable to find any problems, live migrated the instance to another compute node and the problem continued.
      • David suggested the following solution:
        • Create a snapshot of an existing instance via OpenStack. In this case, we chose LMA
        • Rename old loadbal instance (loadbal-old)
        • Use the snapshot to provision a new instance (loadbal)
        • Disassociate public IP from old instance, assign to new instance
        • Update /etc/kubernetes/kubelet config to have correct hostname
        • Add nagios user/key, qualys user/key per usual install process.
        • Change passwd/group/shadow/permissions
        • Shutdown old loadbal node

              willis8 Craig Willis
              willis8 Craig Willis
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: