Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Date/TimeWhat happenedHow was it resolved
6/14/2018Disk space errors gfs3

The registry cache was using ~37GB

Couldn't exec into cache as below, due to OutOfDisk

default       docker-cache-gnc8m                     0/1       OutOfDisk     0          345d

default       docker-cache-q1jh2                     1/1       Running       0          17m

Since the pod had already been moved elsewhere, just deleted it.

However, the daemonset wouldn't create the pod on gfs3 unless I edited the spec. Added a simple label (other; test) and the pod appeared.

1/19/2018Disk space warnings gfs3

The registry cache was using ~34GB disk.

kubectl exec -it regsitry sh

 wget localhost:5001/v2/_catalog -O - (lists images in cache)

cd /var/lib/registry/docker/registry/v2

find something that can be removed (e.g., repositories/craigwillis/apiserver)

rm -r repositories/craigwillis/apiserver

/bin/registry  garbage-collect  /etc/docker/registry/config.yml

Deletes cached blobs


1/8/2018transport connection errorsStarted receiving alerts about exceeded pod restart thresholds for two mongo containers. Noticed I/O errors in mongo logs. Exec'd into Gluster server and noted that two bricks (node1, node2) were offline. Restarted both pods, one at a time.
1/14/2018gfs4 load warnings

Ongoing load warning on gfs4. Noticed gfs2 brick not connected. Restarted gfs2 gluster server. Rebooted gfs4 node.

Ran gluster volume heal global info

gluster volume heal global

to heal files

2/12/2018LMA disk space warnings

LMA node on public beta does not appear to have a /var/lib/docker mount.. this would be fine, except that the node also had "ndslabs-role-compute: true" set, so client pods had been scheduled there.

This included one instance each of NBI and MDF Forge, each of which have huge images (~4GB) with NBI also having a larger-than-average docker overlay folder.

Short term: I have temporarily removed the compute label from LMA and deleted the MDF Forge pod and image - the NBI instance is Akshay's, so I will leave it running to avoid interrupting their work.

Long term: Once the user services are gone from this node (e.g. timeout), we can stop the docker daemon on LMA and remount /var/lib/docker as a bind-mount from /media/storage, as is standard on the other nodes.

4/10/2018SSL handshake errors

Nagios NRPE container disappeared from only node2

Performing a "kubectl apply -f ~/nagios-nrpe.ds.yaml" brought it back on

Also cleared out some space on node2's /var/lib/docker (it was at 94%) by deleting /var/lib/docker/tmp and restarting the docker daemon

4/23/2018LMA disk space warnings

Same thing as 2/12/2018... I deleted the jupyter-nbi Docker image from that node (again) to clear up some space.

We should probably consider/discuss removing the "compute" node label from this node to prevent it from happening again.

4/29/2018gfs2 disk space warnings

Same problem as 4/23/2018 and 2/12/2018, except on GFS2. On nodes where we did not initially plan to execute user services, we did not mount /var/lib/docker.

Hopefully in the coming weeks we will be able to reprovision the Workbench Beta to reset the clock on these warnings.

...