Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Date/TimeWhat happenedHow was it resolved
10/26/2018Disk space errors gfs2Same as below on gfs2. We really need to do some cleanup on SDSC and redeploy the beta instance).
9/24/2018Disk space errors
gfs2

Kube registry and GLFS client pods were using ~1.5GB each.

Cleared out the offending log files without restarting any containers using echo " " > big-log-file.json

9/7/2018 - 9/8/2018Disk space errors
gfs1

Gluster client log file was using 9.6GB.

Cleared out the log file without restarting any containers using echo " " > big-log-file.json

8/12/2018Disk space errors
lma

Gluster client log file was using way too much space

Cleared out the log file without restarting any containers using echo " " > big-log-file.json

8/10/2018

Disk space errors
gfs4

Gluster client log file was using way too much space.

Cleared out the log file without restarting any containers using echo " " > big-log-file.json

7/24/2018

load warnings

gfs2/node2/loadbal

Load warnings returned on these same three nodes again, and continued for several hours.

This issue is still unresolved, as the load warnings stopped after a time without any obvious manual intervention.

7/2/2018

load warnings
loadbal

More unexplained load warnings on loadbal.
6/28/2018

load warnings
gfs

More unexplained load warnings on gfs2.

Cause is still unknown, but we think this may be related to when users are accessing the NBI data.

6/26/2018

load warnings

pod restart warnings

NBI data loss

Several Pods went into CrashLoopBackoff as a result of the NBI data being somehow reset.

MongoDB reported the size as 500MB, instead of the expected ~20GB.

NBI was scaled down and the data was restored (I think?)

6/18/2018 - 6/22/2018

load warnings

gfs2/node2/loadbal

Still unexplained - load warnings started popping up on these three nodes and continued for several hours.

This issue is still unresolved, as the load warnings stopped after a time without any obvious manual intervention.

6/18/2018SSH brute force attempts
all nodes

Noticed a lot of brute force attempts on many of our nodes.

Only allowing a subset of NCSA/TACC/SDSC public IPs for now, and my home IP when remote access is needed.

6/14/2018Disk space errors gfs3

The registry cache was using ~37GB

Couldn't exec into cache as below, due to OutOfDisk

default       docker-cache-gnc8m                     0/1       OutOfDisk     0          345d

default       docker-cache-q1jh2                     1/1       Running       0          17m

Since the pod had already been moved elsewhere, just deleted it.

However, the daemonset wouldn't create the pod on gfs3 unless I edited the spec. Added a simple label (other; test) and the pod appeared.

1/19/2018Disk space warnings gfs3

The registry cache was using ~34GB disk.

kubectl exec -it regsitry sh

 wget localhost:5001/v2/_catalog -O - (lists images in cache)

cd /var/lib/registry/docker/registry/v2

find something that can be removed (e.g., repositories/craigwillis/apiserver)

rm -r repositories/craigwillis/apiserver

/bin/registry  garbage-collect  /etc/docker/registry/config.yml

Deletes cached blobs


1/8/2018transport connection errorsStarted receiving alerts about exceeded pod restart thresholds for two mongo containers. Noticed I/O errors in mongo logs. Exec'd into Gluster server and noted that two bricks (node1, node2) were offline. Restarted both pods, one at a time.
1/14/2018gfs4 load warnings

Ongoing load warning on gfs4. Noticed gfs2 brick not connected. Restarted gfs2 gluster server. Rebooted gfs4 node.

Ran gluster volume heal global info

gluster volume heal global

to heal files

2/12/2018LMA disk space warnings

LMA node on public beta does not appear to have a /var/lib/docker mount.. this would be fine, except that the node also had "ndslabs-role-compute: true" set, so client pods had been scheduled there.

This included one instance each of NBI and MDF Forge, each of which have huge images (~4GB) with NBI also having a larger-than-average docker overlay folder.

Short term: I have temporarily removed the compute label from LMA and deleted the MDF Forge pod and image - the NBI instance is Akshay's, so I will leave it running to avoid interrupting their work.

Long term: Once the user services are gone from this node (e.g. timeout), we can stop the docker daemon on LMA and remount /var/lib/docker as a bind-mount from /media/storage, as is standard on the other nodes.

4/10/2018SSL handshake errors

Nagios NRPE container disappeared from only node2

Performing a "kubectl apply -f ~/nagios-nrpe.ds.yaml" brought it back on

Also cleared out some space on node2's /var/lib/docker (it was at 94%) by deleting /var/lib/docker/tmp and restarting the docker daemon

4/23/2018LMA disk space warnings

Same thing as 2/12/2018... I deleted the jupyter-nbi Docker image from that node (again) to clear up some space.

We should probably consider/discuss removing the "compute" node label from this node to prevent it from happening again.

4/29/2018gfs2 disk space warnings

Same problem as 4/23/2018 and 2/12/2018, except on GFS2. On nodes where we did not initially plan to execute user services, we did not mount /var/lib/docker.

Hopefully in the coming weeks we will be able to reprovision the Workbench Beta to reset the clock on these warnings.

...