You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Starting to track incidents and how they were resolved.

 

Date/TimeWhat happenedHow was it resolved
10/31/2016 ~8amNAGIOS errors on gfs2, node1, node3

Attempted to reboot nodes, but encountered error:

> "Error: Failed to perform requested operation on instance "workbench-
> gfs2", the instance has an error status: Please try again later
> [Error: cannot write data to file '/etc/libvirt/qemu/instance-
> 00002cf1.xml.new': No space left on device]."

Emailed Nebula group – apparently a problem with their glusterfs. Resolved at 10:30 AM

11/3/2016NAGIOS errors "could not complete SSH Handshake" node6

Looked at node6 console via Nebula. Appears to be OOM problem (maybe old swap issue?).

kubectl get nodes says all nodes except node6 are ready.

Node is totally inaccessible. Tried soft reboot via Horizon, but node was then in error state.

Spoke with Nebula group, this was related to the error from Monday. They resolved the underlying problem, but I still wasn't able to start the instance. Using cli:

nova show <instance>

nova reset-state --active <instance>

nova start instance

Did the trick

11/4/2016NAGIOS error for labstest-lmaSame as above. Nebula team resolved the glusterfs issue. Did not have permission to issue the reset state command.
11/8/2016API server not accessible – all Kubernetes services down on workbench-master1It again appears that etcd2 went down, probably due to memory problems. Rebooted the node.
12/26/2016GFS1 not accessible. Rebooting via Nebula put node in error stateResolved on 1/3 by Nebula team – apparent problem with Gluster server. Node was able to restart.
1/4/2017GFS4 not accessible.Resolved 1/4 by Nebula team – continued problem with Gluster server.
  • No labels