...
Date/Time | What happened | How was it resolved |
---|---|---|
6/14/2018 | Disk space errors gfs3 | The registry cache was using ~37GB Couldn't exec into cache as below, due to OutOfDisk default docker-cache-gnc8m 0/1 OutOfDisk 0 345d default docker-cache-q1jh2 1/1 Running 0 17m Since the pod had already been moved elsewhere, just deleted it. |
1/19/2018 | Disk space warnings gfs3 | The registry cache was using ~34GB disk. kubectl exec -it regsitry sh wget localhost:5001/v2/_catalog -O - (lists images in cache) cd /var/lib/registry/docker/registry/v2 find something that can be removed (e.g., repositories/craigwillis/apiserver) rm -r repositories/craigwillis/apiserver /bin/registry garbage-collect /etc/docker/registry/config.yml Deletes cached blobs |
1/8/2018 | transport connection errors | Started receiving alerts about exceeded pod restart thresholds for two mongo containers. Noticed I/O errors in mongo logs. Exec'd into Gluster server and noted that two bricks (node1, node2) were offline. Restarted both pods, one at a time. |
1/14/2018 | gfs4 load warnings | Ongoing load warning on gfs4. Noticed gfs2 brick not connected. Restarted gfs2 gluster server. Rebooted gfs4 node. Ran gluster volume heal global info gluster volume heal global to heal files |
2/12/2018 | LMA disk space warnings | LMA node on public beta does not appear to have a /var/lib/docker mount.. this would be fine, except that the node also had "ndslabs-role-compute: true" set, so client pods had been scheduled there. This included one instance each of NBI and MDF Forge, each of which have huge images (~4GB) with NBI also having a larger-than-average docker overlay folder. Short term: I have temporarily removed the compute label from LMA and deleted the MDF Forge pod and image - the NBI instance is Akshay's, so I will leave it running to avoid interrupting their work. Long term: Once the user services are gone from this node (e.g. timeout), we can stop the docker daemon on LMA and remount /var/lib/docker as a bind-mount from /media/storage, as is standard on the other nodes. |
4/10/2018 | SSL handshake errors | Nagios NRPE container disappeared from only node2 Performing a "kubectl apply -f ~/nagios-nrpe.ds.yaml" brought it back on Also cleared out some space on node2's /var/lib/docker (it was at 94%) by deleting /var/lib/docker/tmp and restarting the docker daemon |
4/23/2018 | LMA disk space warnings | Same thing as 2/12/2018... I deleted the jupyter-nbi Docker image from that node (again) to clear up some space. We should probably consider/discuss removing the "compute" node label from this node to prevent it from happening again. |
4/29/2018 | gfs2 disk space warnings | Same problem as 4/23/2018 and 2/12/2018, except on GFS2. On nodes where we did not initially plan to execute user services, we did not mount /var/lib/docker. Hopefully in the coming weeks we will be able to reprovision the Workbench Beta to reset the clock on these warnings. |
...