Page History

...

Date/Time

What happened

How was it resolved

10/31/2016 ~8am

NAGIOS errors on gfs2, node1, node3

Attempted to reboot nodes, but encountered error:

> "Error: Failed to perform requested operation on instance "workbench-
> gfs2", the instance has an error status: Please try again later
> [Error: cannot write data to file '/etc/libvirt/qemu/instance-
> 00002cf1.xml.new': No space left on device]."

Emailed Nebula group – apparently a problem with their glusterfs. Resolved at 10:30 AM

11/3/2016

NAGIOS errors "could not complete SSH Handshake" node6

Looked at node6 console via Nebula. Appears to be OOM problem (maybe old swap issue?).

kubectl get nodes says all nodes except node6 are ready.

Node is totally inaccessible. Tried soft reboot via Horizon, but node was then in error state.

Spoke with Nebula group, this was related to the error from Monday. They resolved the underlying problem, but I still wasn't able to start the instance. Using cli:

nova show <instance>

nova reset-state --active <instance>

nova start instance

Did the trick

11/4/2016

NAGIOS error for labstest-lma

Same as above. Nebula team resolved the glusterfs issue. Did not have permission to issue the reset state command.

11/8/2016

API server not accessible – all Kubernetes services down on workbench-master1

It again appears that etcd2 went down, probably due to memory problems. Rebooted the node.

12/26/2016

GFS1 not accessible. Rebooting via Nebula put node in error state

Resolved on 1/3 by Nebula team – apparent problem with Gluster server. Node was able to restart.

1/4/2017

GFS4 not accessible.

Resolved 1/4 by Nebula team – continued problem with Gluster server.

1/12/2017

Loadbalancer sluggish

workbench-loadbal has been sluggish, slow response times resulting in numerous false positive nagios alerts. At some point this afternoon, it was unresponsive. Hard reboot via Horizon took >30 minutes for CoreOS1122 (which takes ~30 seconds on a normal day). Login was slow after reboot, services never fully revived. David suggests that this is a storage problem, but Nebula team can find no apparent cause. Starting standalone CoreOS instances works without error. Tried two different approaches: 1. shutdown -h of the instance and restart to see if hypervisor moves somewhere more friendly. 2. create a snapshot of another node (lma) and use this to create a new instance from it. After boot, edit /etc/kubernetes/kubelet change KUBELET_HOSTNAME from lma to loabal, systemctl restart kubelet. After this, kubectl get nodes showed loadbal in ready state with correct label. Disassociated the IP, associated with new instance. Shutdown bad instance.

1/20/2017

Node1/Node3 unavailable

Nodes 1 and 3 where not accessible via SSH from the Nagios instances. Node3 was totally unaccessible – Horizon console indicated OOM. Hard reboot succeeded, but CoreOS upgraded to 1235, introducing the flannel error. Copied the flannel config to /run/flannel and restarted. Node3 was accessible, but docker was down. Restarting docker failed until /var/lib/docker was deleted. Also upgraded to 1235, requiring the flannel change.

OPS node read-only

OPS node is currently in read-only state (same old Nebula problem). Should be resolved by reboot when needed.

Master Kubelet down

Master Kubelet died due to etcd memory error (known issue). Rebooted, CoreOS upgrade required flannel fix.

1/25/2017

GFS nodes

Multiple incidents of GFS server pods not responding during healthz. In all cases, one or more glfs-server pods will not respond to exec. SSH to GFS node is find, but docker is unresponsive (docker ps hangs). journalctl shows errors related to registry cache

Jan 28 11:10:23 workbench-gfs4.os.ncsa.edu dockerd[26174]: time="2017-01-28T11:10:23.365820903-06:00" level=warning msg="Error getting v2 registry: Get http://localhost:5001/v2/: read tcp 127.0.0.1:36906->127.0.0.1:5001: read: connection reset by peer"
Jan 28 11:10:23 workbench-gfs4.os.ncsa.edu dockerd[26174]: time="2017-01-28T11:10:23.365843976-06:00" level=error msg="Attempting next endpoint for pull after error: Get http://localhost:5001/v2/: read tcp 127.0.0.1:36906->127.0.0.1:5001: read: connection reset by peer"

Generally, restarting docker daemon temporarily resolves problem.

1/29

node1

-bash: /usr/bin/wc: Input/output error

Gluster problems on Nebula

2/9

Multiple instances

Multiple instance I/O errors across projects, apparently due to Gluster outage on Nebula. Problem first detected at 3AM, reported at 6AM. No updates as of 10:30AM.

7/4

gfs4
out of disk

/media/storage ran out of disk, due to the docker cache pod filling up the disk... pod had already been recreated so I looked up the uuid of the broken pod, deleted it, and SSH'd into gfs4 to delete its folder from

core@workbench-master1 ~ $ kubectl get pods -o wide
NAME                         READY   STATUS      RESTARTS   AGE   IP            NODE
default-http-backend-zjhdb   1/1     Running     1          23d   10.100.35.5   loadbal
docker-cache-fwkd6           0/1     OutOfDisk   0          130d  <none>        gfs4
docker-cache-gnc8m           1/1     Running     0          4h    10.100.33.5   gfs3

core@workbench-master1 ~ $ kubectl get pod -o yaml docker-cache-fwkd6 | grep uid
 uid: bf5284e8-fa16-11e6-9d8b-fa163e19eb19

core@workbench-gfs4 ~ $ sudo su
workbench-gfs4 core # rm -rf /var/lib/kubelet/pods/bf535ef3-fa16-11e6-9d8b-fa163e19eb19

----

nagios pod was missing on gfs4 after this, ~~so I had to restart the whole daemonset~~ but thankfully a kubectl apply on the nagios YAML recreated the missing pod without touching the working ones

core@workbench-master1 ~ $ kubectl get ds nagios-nrpe --namespace=kube-system
NAME          DESIRED   CURRENT   READY   NODE-SELECTOR   AGE
nagios-nrpe   7         7         7       <none>          130d

core@workbench-master1 ~ $ kubectl apply -f nagios-nrpe-ds.yaml
daemonset "nagios-nrpe" configured

core@workbench-master1 ~ $ kubectl get ds nagios-nrpe --namespace=kube-system
NAME          DESIRED   CURRENT   READY   NODE-SELECTOR   AGE
nagios-nrpe   8         8         8       <none>          130d

7/22

Single Pod
restart threshold surpassed

NAGIOS started complaining shortly after 7pm: "workbench-master1/Kubernetes Pods is WARNING: 1 pods exceeding WARNING restart threshold."

A Fedora Commons pod had restarted a sixth time (due to OOMKilled), which started triggering these warnings.

Solution was to delete the pod in question to reset the restart count.

Code Block

language	bash

# List pods Sorted by Restart Count
$ kubectl get pods --all-namespaces --sort-by='.status.containerStatuses[0].restartCount'

8/8
~3am

loadbal
out of disk

NAGIOS alerted that node was nearly out of disk space.

Craig restarted the ilb pod to clear out the huge 9.5GB log file.

8/9
~7pm

loadbal
out of disk

NAGIOS alerted that node was nearly out of disk space (again).

Mike restarted the loadbalancer node with a sudo reboot

9/24

~1pm

loadbal
out of disk

NAGIOS alerted that node was nearly out of disk space (again).

Mike restarted the ilb pod to clear out the log file.

This did not appear to alleviate the symptom, so I he also restarted the node with a sudo reboot.

...

Space shortcuts

Page tree

Versions Compared

Old Version 15

New Version 16

Key