...
Date/Time | What happened | How was it resolved |
---|---|---|
9/24/2018 | Disk space errors gfs2 | Kube registry and GLFS client pods were using ~1.5GB each. Cleared out the offending log files without restarting any containers using |
9/7/2018 - 9/8/2018 | Disk space errors gfs1 | Gluster client log file was using 9.6GB. Cleared out the log file without restarting any containers using |
8/12/2018 | Disk space errors lma | Gluster client log file was using way too much space Cleared out the log file without restarting any containers using |
8/10/2018 | Disk space errors | Gluster client log file was using way too much space. Cleared out the log file without restarting any containers using |
7/24/2018 | load warnings gfs2/node2/loadbal | Load warnings returned on these same three nodes again, and continued for several hours. This issue is still unresolved, as the load warnings stopped after a time without any obvious manual intervention. |
7/2/2018 | load warnings | More unexplained load warnings on loadbal. |
6/28/2018 | load warnings | More unexplained load warnings on gfs2. Cause is still unknown, but we think this may be related to when users are accessing the NBI data. |
6/26/2018 | load warnings pod restart warnings NBI data loss | Several Pods went into CrashLoopBackoff as a result of the NBI data being somehow reset. MongoDB reported the size as 500MB, instead of the expected ~20GB. NBI was scaled down and the data was restored (I think?) |
6/18/2018 - 6/22/2018 | load warnings gfs2/node2/loadbal | Still unexplained - load warnings started popping up on these three nodes and continued for several hours. This issue is still unresolved, as the load warnings stopped after a time without any obvious manual intervention. |
6/18/2018 | SSH brute force attempts all nodes | Noticed a lot of brute force attempts on many of our nodes. Only allowing a subset of NCSA/TACC/SDSC public IPs for now, and my home IP when remote access is needed. |
6/14/2018 | Disk space errors gfs3 | The registry cache was using ~37GB Couldn't exec into cache as below, due to OutOfDisk default docker-cache-gnc8m 0/1 OutOfDisk 0 345d default docker-cache-q1jh2 1/1 Running 0 17m Since the pod had already been moved elsewhere, just deleted it. However, the daemonset wouldn't create the pod on gfs3 unless I edited the spec. Added a simple label (other; test) and the pod appeared. |
1/19/2018 | Disk space warnings gfs3 | The registry cache was using ~34GB disk. kubectl exec -it regsitry sh wget localhost:5001/v2/_catalog -O - (lists images in cache) cd /var/lib/registry/docker/registry/v2 find something that can be removed (e.g., repositories/craigwillis/apiserver) rm -r repositories/craigwillis/apiserver /bin/registry garbage-collect /etc/docker/registry/config.yml Deletes cached blobs |
1/8/2018 | transport connection errors | Started receiving alerts about exceeded pod restart thresholds for two mongo containers. Noticed I/O errors in mongo logs. Exec'd into Gluster server and noted that two bricks (node1, node2) were offline. Restarted both pods, one at a time. |
1/14/2018 | gfs4 load warnings | Ongoing load warning on gfs4. Noticed gfs2 brick not connected. Restarted gfs2 gluster server. Rebooted gfs4 node. Ran gluster volume heal global info gluster volume heal global to heal files |
2/12/2018 | LMA disk space warnings | LMA node on public beta does not appear to have a /var/lib/docker mount.. this would be fine, except that the node also had "ndslabs-role-compute: true" set, so client pods had been scheduled there. This included one instance each of NBI and MDF Forge, each of which have huge images (~4GB) with NBI also having a larger-than-average docker overlay folder. Short term: I have temporarily removed the compute label from LMA and deleted the MDF Forge pod and image - the NBI instance is Akshay's, so I will leave it running to avoid interrupting their work. Long term: Once the user services are gone from this node (e.g. timeout), we can stop the docker daemon on LMA and remount /var/lib/docker as a bind-mount from /media/storage, as is standard on the other nodes. |
4/10/2018 | SSL handshake errors | Nagios NRPE container disappeared from only node2 Performing a "kubectl apply -f ~/nagios-nrpe.ds.yaml" brought it back on Also cleared out some space on node2's /var/lib/docker (it was at 94%) by deleting /var/lib/docker/tmp and restarting the docker daemon |
4/23/2018 | LMA disk space warnings | Same thing as 2/12/2018... I deleted the jupyter-nbi Docker image from that node (again) to clear up some space. We should probably consider/discuss removing the "compute" node label from this node to prevent it from happening again. |
4/29/2018 | gfs2 disk space warnings | Same problem as 4/23/2018 and 2/12/2018, except on GFS2. On nodes where we did not initially plan to execute user services, we did not mount /var/lib/docker. Hopefully in the coming weeks we will be able to reprovision the Workbench Beta to reset the clock on these warnings. |
...