Maintenance Log

Starting to track incidents and how they were resolved.

Date/Time	What happened	How was it resolved
10/31/2016 ~8am	NAGIOS errors on gfs2, node1, node3	Attempted to reboot nodes, but encountered error: > "Error: Failed to perform requested operation on instance "workbench- > gfs2", the instance has an error status: Please try again later > [Error: cannot write data to file '/etc/libvirt/qemu/instance- > 00002cf1.xml.new': No space left on device]." Emailed Nebula group – apparently a problem with their glusterfs. Resolved at 10:30 AM
11/3/2016	NAGIOS errors "could not complete SSH Handshake" node6	Looked at node6 console via Nebula. Appears to be OOM problem (maybe old swap issue?). kubectl get nodes says all nodes except node6 are ready. Node is totally inaccessible. Tried soft reboot via Horizon, but node was then in error state. Spoke with Nebula group, this was related to the error from Monday. They resolved the underlying problem, but I still wasn't able to start the instance. Using cli: nova show <instance> nova reset-state --active <instance> nova start instance Did the trick
11/4/2016	NAGIOS error for labstest-lma	Same as above. Nebula team resolved the glusterfs issue. Did not have permission to issue the reset state command.
11/8/2016	API server not accessible – all Kubernetes services down on workbench-master1	It again appears that etcd2 went down, probably due to memory problems. Rebooted the node.

Space shortcuts