Uploaded image for project: 'National Data Service'
  1. National Data Service
  2. NDS-986

Kubernetes/CoreOS occasionally fails to pull large images


    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Normal Normal
    • None
    • None
    • Infrastructure
    • None
    • NDS Sprint 30, NDS Sprint 32

      The error encountered was:

      ErrImagePull: "net/http: request canceled"

      I wonder if this timeout value is configurable, or if it's hard-coded? We might consider raising the timeout, but I would much prefer to look into whatever is causing the network slowness to begin with.

      A few theories have surfaced:

      • it could be MTU-related, as we've seen previously
      • this could be a misconfiguration of our deployments resulting from misunderstanding the "layer-cake" of volumes and FS types

      We have seen this occur in several scenarios:

      1. multi-node on OpenStack volumes with underlying XFS
      2. single-node without an OpenStack volume at all

      As i recall, changing our xfs docker volume to ext4 was a slight improvement, but that doesn't explain the shoddy performance on a single-node installation when no are volumes involved. Perhaps we should default to this going forward? More discussion will certainly be needed.

      Should we just abandon CoreOS? This is a reasonable solution.

              willis8 Craig Willis
              lambert8 Sara Lambert
              0 Vote for this issue
              3 Start watching this issue


                  Original Estimate - 4 hours
                  Remaining Estimate - 4 hours
                  Time Spent - Not Specified
                  Not Specified