Page History

...

Date/Time	What happened	How was it resolved
10/26/2018	Disk space errors gfs2	Same as below on gfs2. We really need to do some cleanup on SDSC and redeploy the beta instance).
9/24/2018	Disk space errors gfs2	Kube registry and GLFS client pods were using ~1.5GB each. Cleared out the offending log files without restarting any containers using `echo " " > big-log-file.json`
9/7/2018 - 9/8/2018	Disk space errors gfs1	Gluster client log file was using 9.6GB. Cleared out the log file without restarting any containers using `echo " " > big-log-file.json`
8/12/2018	Disk space errors lma	Gluster client log file was using way too much space Cleared out the log file without restarting any containers using `echo " " > big-log-file.json`
8/10/2018	Disk space errors gfs4	Gluster client log file was using way too much space. Cleared out the log file without restarting any containers using `echo " " > big-log-file.json`
7/24/2018	load warnings gfs2/node2/loadbal	Load warnings returned on these same three nodes again, and continued for several hours. This issue is still unresolved, as the load warnings stopped after a time without any obvious manual intervention.
7/2/2018	load warnings loadbal	More unexplained load warnings on loadbal.
6/28/2018	load warnings gfs	More unexplained load warnings on gfs2. Cause is still unknown, but we think this may be related to when users are accessing the NBI data.
6/26/2018	load warnings pod restart warnings NBI data loss	Several Pods went into CrashLoopBackoff as a result of the NBI data being somehow reset. MongoDB reported the size as 500MB, instead of the expected ~20GB. NBI was scaled down and the data was restored (I think?)
6/18/2018 - 6/22/2018	load warnings gfs2/node2/loadbal	Still unexplained - load warnings started popping up on these three nodes and continued for several hours. This issue is still unresolved, as the load warnings stopped after a time without any obvious manual intervention.
6/18/2018	SSH brute force attempts all nodes	Noticed a lot of brute force attempts on many of our nodes. Only allowing a subset of NCSA/TACC/SDSC public IPs for now, and my home IP when remote access is needed.
6/14/2018	Disk space errors gfs3	The registry cache was using ~37GB Couldn't exec into cache as below, due to OutOfDisk default docker-cache-gnc8m 0/1 OutOfDisk 0 345d default docker-cache-q1jh2 1/1 Running 0 17m Since the pod had already been moved elsewhere, just deleted it. However, the daemonset wouldn't create the pod on gfs3 unless I edited the spec. Added a simple label (other; test) and the pod appeared.
1/19/2018	Disk space warnings gfs3	The registry cache was using ~34GB disk. kubectl exec -it regsitry sh wget localhost:5001/v2/_catalog -O - (lists images in cache) cd /var/lib/registry/docker/registry/v2 find something that can be removed (e.g., repositories/craigwillis/apiserver) rm -r repositories/craigwillis/apiserver /bin/registry garbage-collect /etc/docker/registry/config.yml Deletes cached blobs
1/8/2018	transport connection errors	Started receiving alerts about exceeded pod restart thresholds for two mongo containers. Noticed I/O errors in mongo logs. Exec'd into Gluster server and noted that two bricks (node1, node2) were offline. Restarted both pods, one at a time.
1/14/2018	gfs4 load warnings	Ongoing load warning on gfs4. Noticed gfs2 brick not connected. Restarted gfs2 gluster server. Rebooted gfs4 node. Ran gluster volume heal global info gluster volume heal global to heal files
2/12/2018	LMA disk space warnings	LMA node on public beta does not appear to have a /var/lib/docker mount.. this would be fine, except that the node also had "ndslabs-role-compute: true" set, so client pods had been scheduled there. This included one instance each of NBI and MDF Forge, each of which have huge images (~4GB) with NBI also having a larger-than-average docker overlay folder. Short term: I have temporarily removed the compute label from LMA and deleted the MDF Forge pod and image - the NBI instance is Akshay's, so I will leave it running to avoid interrupting their work. Long term: Once the user services are gone from this node (e.g. timeout), we can stop the docker daemon on LMA and remount /var/lib/docker as a bind-mount from /media/storage, as is standard on the other nodes.
4/10/2018	SSL handshake errors	Nagios NRPE container disappeared from only node2 Performing a "kubectl apply -f ~/nagios-nrpe.ds.yaml" brought it back on Also cleared out some space on node2's /var/lib/docker (it was at 94%) by deleting /var/lib/docker/tmp and restarting the docker daemon
4/23/2018	LMA disk space warnings	Same thing as 2/12/2018... I deleted the jupyter-nbi Docker image from that node (again) to clear up some space. We should probably consider/discuss removing the "compute" node label from this node to prevent it from happening again.
4/29/2018	gfs2 disk space warnings	Same problem as 4/23/2018 and 2/12/2018, except on GFS2. On nodes where we did not initially plan to execute user services, we did not mount /var/lib/docker. Hopefully in the coming weeks we will be able to reprovision the Workbench Beta to reset the clock on these warnings.

...

Space shortcuts

Page tree

Versions Compared

Old Version 32

New Version 33

Key