Uploaded image for project: 'National Data Service'
  1. National Data Service
  2. NDS-691

Explore what happens when bad things happen

XMLWordPrintableJSON

    • Icon: Task Task
    • Resolution: Fixed
    • Icon: Normal Normal
    • Labs Workbench - Beta
    • None
    • None
    • None
    • NDS Sprint 16, NDS Sprint 17

      For each of the following, test and document what happens, for each kind of node (compute, loadbal, gfs, master, etc):

      • Reboot node
      • Cordon/drain node
      • Bring node back online
      • Pod in pending state
      • Node not responding:
        • Hung node due to resource constraint - pegged cpu, out of memory, out of disk, etc
        • Paused node
        • Dead kubelet (this is apparently caused by resource constraints)
        • Unschedulable node

      Be sure to take note of:

      • What happens to running pods?
        • Read-only pods
        • Read-write pods
        • Is some manual step needed?
        • Do they recover automatically? (reboot => ok)
      • What happens to kube services?
        • Do they fail?
        • Do they recover?

              lambert8 Sara Lambert
              willis8 Craig Willis
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved:

                  Estimated:
                  Original Estimate - 2 days
                  2d
                  Remaining:
                  Time Spent - 6 hours Remaining Estimate - 1 day, 2 hours
                  1d 2h
                  Logged:
                  Time Spent - 6 hours Remaining Estimate - 1 day, 2 hours
                  6h