Uploaded image for project: 'National Data Service'
  1. National Data Service
  2. NDS-422

Cluster Deploy Fails in Nebula

XMLWordPrintableJSON

    • Icon: Bug Bug
    • Resolution: Won't Fix
    • Icon: Blocking Blocking
    • Labs Workbench - Beta
    • Labs Workbench - Beta
    • Development
    • None
    • NDS Sprint 9

      The primary issue is the MTU change in nebula networking, which
      causes failure of any cluster deploy currently from any version of deploy-tools.

      Nebula team increased the MTU on 8/2: https://wiki.ncsa.illinois.edu/display/NEBULA/2016-08-02+Nebula+Maintenance

      Secondary issue is required changed in ansible related to NDS-344, where there is
      an interaction between provisioning a node with/without public IP, and with/without security groups which casues failure of the load balanacer.

      Note that currently running system should stay running as-is, but may fail if they are rebooted and pick up the new MTU.

      Where MTU matters:
      CoreOS - picked up upon boot and reboot, so can change if a machine running prior to 8/2 is rebooted.
      Docker Networking - Picks up the MTU from the system, the MTU is transmitted to system network bridges by the docker engine. These are not reset when the host system is rebooted.
      Flannel - the overlay network picks up MTU from the hosts - unclear of behavior on reboot. This affects other networking overlays as well.
      Contiainers - Containers inherit the MTU of their host interfaces and bridges.
      Neutron Components - Some OpenStack network components have MTU settings,
      such as networks, that we can't see as openstack clients.

      Symptoms:
      Cluster provision seemingly succeeds but fails underneath without docker daemon starting
      Flannel cluster networking fails due to inability to talk with etcd:
      {{etcdctl -C http://master1:2379/ ls /cluster.local/network/subnets
      Error: client: etcd cluster is unavailable or misconfigured
      error #0: client: endpoint http://192.168.100.216:2379 exceeded header timeout}}

      Fix strategies:
      determine end-end MTU performs in general on new instances on new network:

      • confirms that MTU is OK at system and openstack level
        determine MTU end-to-end between containers on multiple hosts
      • confirms that docker, container, bridges, and intervening systems MTUs are compatible
        determine that kubernetes supportive services are compatible - etcd, flannel, kube components.
      • confirms that kubernentes should run properly
        confirm the the NDS-344 can run/deploy a full system.

      Original estimate includes time already spent

              raila David Raila
              raila David Raila
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved:

                  Estimated:
                  Original Estimate - 1 day, 2 hours
                  1d 2h
                  Remaining:
                  Remaining Estimate - 1 day, 2 hours
                  1d 2h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified