Starting to track incidents and how they were resolved.
Date/Time | What happened | How was it resolved |
---|---|---|
10/31/2016 ~8am | NAGIOS errors on gfs2, node1, node3 | Attempted to reboot nodes, but encountered error: > "Error: Failed to perform requested operation on instance "workbench- Emailed Nebula group – apparently a problem with their glusterfs. Resolved at 10:30 AM |
11/3/2016 | NAGIOS errors "could not complete SSH Handshake" node6 | Looked at node6 console via Nebula. Appears to be OOM problem (maybe old swap issue?). kubectl get nodes says all nodes except node6 are ready. Node is totally inaccessible. Tried soft reboot via Horizon, but node was then in error state. Spoke with Nebula group, this was related to the error from Monday. They resolved the underlying problem, but I still wasn't able to start the instance. Using cli: nova show <instance> nova reset-state --active <instance> nova start instance Did the trick |
11/4/2016 | NAGIOS error for labstest-lma | Same as above. Nebula team resolved the glusterfs issue. Did not have permission to issue the reset state command. |
11/8/2016 | API server not accessible – all Kubernetes services down on workbench-master1 | It again appears that etcd2 went down, probably due to memory problems. Rebooted the node. |