-
Bug
-
Resolution: Won't Fix
-
Blocking
-
Labs Workbench - Beta
-
None
-
NDS Sprint 9
The primary issue is the MTU change in nebula networking, which
causes failure of any cluster deploy currently from any version of deploy-tools.
Nebula team increased the MTU on 8/2: https://wiki.ncsa.illinois.edu/display/NEBULA/2016-08-02+Nebula+Maintenance
Secondary issue is required changed in ansible related to NDS-344, where there is
an interaction between provisioning a node with/without public IP, and with/without security groups which casues failure of the load balanacer.
Note that currently running system should stay running as-is, but may fail if they are rebooted and pick up the new MTU.
Where MTU matters:
CoreOS - picked up upon boot and reboot, so can change if a machine running prior to 8/2 is rebooted.
Docker Networking - Picks up the MTU from the system, the MTU is transmitted to system network bridges by the docker engine. These are not reset when the host system is rebooted.
Flannel - the overlay network picks up MTU from the hosts - unclear of behavior on reboot. This affects other networking overlays as well.
Contiainers - Containers inherit the MTU of their host interfaces and bridges.
Neutron Components - Some OpenStack network components have MTU settings,
such as networks, that we can't see as openstack clients.
Symptoms:
Cluster provision seemingly succeeds but fails underneath without docker daemon starting
Flannel cluster networking fails due to inability to talk with etcd:
{{etcdctl -C http://master1:2379/ ls /cluster.local/network/subnets
Error: client: etcd cluster is unavailable or misconfigured
error #0: client: endpoint http://192.168.100.216:2379 exceeded header timeout}}
Fix strategies:
determine end-end MTU performs in general on new instances on new network:
- confirms that MTU is OK at system and openstack level
determine MTU end-to-end between containers on multiple hosts - confirms that docker, container, bridges, and intervening systems MTUs are compatible
determine that kubernetes supportive services are compatible - etcd, flannel, kube components. - confirms that kubernentes should run properly
confirm the theNDS-344can run/deploy a full system.
Original estimate includes time already spent