Maintenance Log

Starting to track incidents and how they were resolved.

Date/Time	What happened	How was it resolved
10/31/2016 ~8am	NAGIOS errors on gfs2, node1, node3	Attempted to reboot nodes, but encountered error: > "Error: Failed to perform requested operation on instance "workbench- > gfs2", the instance has an error status: Please try again later > [Error: cannot write data to file '/etc/libvirt/qemu/instance- > 00002cf1.xml.new': No space left on device]." Emailed Nebula group – apparently a problem with their glusterfs. Resolved at 10:30 AM
11/3/2016	NAGIOS errors "could not complete SSH Handshake" node6	Looked at node6 console via Nebula. Appears to be OOM problem (maybe old swap issue?). kubectl get nodes says all nodes except node6 are ready. Node is totally inaccessible. Tried soft reboot via Horizon, but node was then in error state. Spoke with Nebula group, this was related to the error from Monday. They resolved the underlying problem, but I still wasn't able to start the instance. Using cli: nova show <instance> nova reset-state --active <instance> nova start instance Did the trick
11/4/2016	NAGIOS error for labstest-lma	Same as above. Nebula team resolved the glusterfs issue. Did not have permission to issue the reset state command.
11/8/2016	API server not accessible – all Kubernetes services down on workbench-master1	It again appears that etcd2 went down, probably due to memory problems. Rebooted the node.
12/26/2016	GFS1 not accessible. Rebooting via Nebula put node in error state	Resolved on 1/3 by Nebula team – apparent problem with Gluster server. Node was able to restart.
1/4/2017	GFS4 not accessible.	Resolved 1/4 by Nebula team – continued problem with Gluster server.
1/12/2017	Loadbalancer sluggish	workbench-loadbal has been sluggish, slow response times resulting in numerous false positive nagios alerts. At some point this afternoon, it was unresponsive. Hard reboot via Horizon took >30 minutes for CoreOS1122 (which takes ~30 seconds on a normal day). Login was slow after reboot, services never fully revived. David suggests that this is a storage problem, but Nebula team can find no apparent cause. Starting standalone CoreOS instances works without error. Tried two different approaches: 1. shutdown -h of the instance and restart to see if hypervisor moves somewhere more friendly. 2. create a snapshot of another node (lma) and use this to create a new instance from it. After boot, edit /etc/kubernetes/kubelet change KUBELET_HOSTNAME from lma to loabal, systemctl restart kubelet. After this, kubectl get nodes showed loadbal in ready state with correct label. Disassociated the IP, associated with new instance. Shutdown bad instance.
1/20/2017	Node1/Node3 unavailable	Nodes 1 and 3 where not accessible via SSH from the Nagios instances. Node3 was totally unaccessible – Horizon console indicated OOM. Hard reboot succeeded, but CoreOS upgraded to 1235, introducing the flannel error. Copied the flannel config to /run/flannel and restarted. Node3 was accessible, but docker was down. Restarting docker failed until /var/lib/docker was deleted. Also upgraded to 1235, requiring the flannel change.
	OPS node read-only	OPS node is currently in read-only state (same old Nebula problem). Should be resolved by reboot when needed.
	Master Kubelet down	Master Kubelet died due to etcd memory error (known issue). Rebooted, CoreOS upgrade required flannel fix.

Space shortcuts

Page tree