Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This section documents the results of

Jira
serverJIRA
serverIdb14d4ad9-eb00-3a94-88ac-a843fb6fa1ca
keyNDS-239
. The goal of this ticket was to determine whether the Nginx ingress controller would be a performance bottleneck for the NDS Labs system.

Cluster configuration

The cluster used for testing is the original IASSIST cluster. This uses an m1.medium (2CPU, 4G) for the loadbalancer and 4 compute nodes.

 

Image Added

Baseline service: Nginx

 This test uses the nginx-ingress-controller as the loadbalancer and a simple Nginx webserver as the backend service. An ingress rule was created manually to map perf-nginx.cluster.ndslabs.org to the backend service.

Load generation: boom

Use the boom load test generator to scale up concurrent requests on using a Nebula m1.medium VM:large VM (8 VCPUs). The following script calls boom with increasing number of concurrent requests (-c in 100:1000) while also increasing the number of total requests (-n in 1000:10000).

Code Block
for i in  `seq 1 10`
do
   con=$((100*$i))
   req=$((1001000*$i))
   echo "bin/boom -cpus 4 -n 1000$req -c $req$con http://perf-nginx.iassist.ndslabs.org/"
   bin/boom -cpus 4 -n 1000$req -c $req$con http://perf-nginx.iassist.ndslabs.org/
   sleep 1
done

Measuring latency and resource usage

Measuring latency: boom

Boom The boom utility produces response time output , for exampleincluding a summary of the average response time for each request as well as the distribution of response times and latency.

Code Block
bin/boom -cpus 4 -n 10008000 -c 500800 http://perf-nginx.iassist.ndslabs.org/
Summary:
  Total:	03.15394305 secs
  Slowest:	03.13350162 secs
  Fastest:	0.01930009 secs
  Average:	0.06851335 secs
  Requests/sec:	48422332.28400068

Status code distribution:
  [200]	7458000 responses

Response time histogram:
  0.019001 [1]	|
  0.031302 [287093]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.042604 [110371]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.054906 [694]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  01.065207 [161471]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  01.076509 [15728]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  01.088810 [600]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  02.099112 [380]	|∎∎∎∎∎∎∎∎∎
  02.111413 [490]	|∎∎∎∎∎∎∎∎∎∎∎∎
  02.122715 [370]	|∎∎∎∎∎∎∎∎∎
  03.134016 [3532]	|∎∎∎∎∎∎∎∎

Latency distribution:
  10% in 0.03940111 secs
  25% in 0.05020183 secs
  50% in 0.06520305 secs
  75% in 0.08080554 secs
  90% in 0.11033304 secs
  95% in 01.12170200 secs
  99% in 01.12930767 secs

Measuring latency: netperf

Measure latency and throughput to services inside kubernetes

Measuring CPU/Memory/IO utilization

 

Results

Image RemovedImage Removed

 

sec

 

Below is a plot of average response time with increasing concurrent requests (-n 1000 requests) and replicas. Average response times increase as the number of concurrent requests increase, but still remain below 1 second. Adding more replicas does not have an apparent effect, suggesting that the response time is related to the ingress load-balancer, not the backend service.

 

Image Added

 

Below is a plot of the latency distribution at 25%, 50%, 75%, and 95% of requests with increasing concurrent connections. So, up to 1000 concurrent connections, 75% of requests have latency < 0.1 seconds. Starting around 200 concurrent requests, 5% of requests have increasing latency – up to 1 second.

Image Added

 

Measuring CPU/Memory utilization

Memory and CPU utilization was measured using pidstat. The nginx ingress controller has two worker threads in this test, labeled as proc1 and proc2 (process).

 

CPU utilization

The following table reports CPU utilization for each process during the boom test. %CPU peaks at 12%.

 %usr %system%guest%CPU 
 
 minflt/s majflt/s VSZ RSS %MEM 
 proc1proc2proc1proc2proc1proc2proc1proc2proc1proc2
15:56:521000003261323259921520800150680.380.37
15:56:531106060032613232599215208150680.380.3712
15:56:5412303003261323259921520806150680.380.37
15:56:551332902993003253283259921440415068060.360.37
15:56:561450477500325328032757614404163600.361000.4
15:56:5715001003253283257681440414844010.360.37
15:56:5816406484003253283284161440408172160.360.42
15:56:5917001003253283253281440401144040.360.36
15:5756:001850102160032532832942014404183600110.360.45
15:5756:011910100032532832614014404152160.36200.38
15:5756:022020400032532832614014404152160.3660.38
15:5756:0321106301003253283267641440415840020.360.39
15:5756:042230400032532832580814404148840.3670.37
15:5756:05231010020003253280329908114404188400.360.46
15:5756:0624047405032532832562814404147040.360090.36
15:5756:0725010127510032532833078414404197160.360.492
15:5756:082605106003253281325884144041496011
0.360.3715:5756:0927015020003253283319601440400207560.360.51
15:5756:10284060003253283253281440414404100.360.36
15:5756:11290012581003253283291281440418204010.360.45
15:5756:1230000032532832532814404144040.360.3615:57:13000

 

Memory utilization

The following table reports memory utilization for each process during the boom test. %MEM remains relatively stable throughout the test.

 

 minflt/s majflt/s VSZ RSS %MEM 
 proc1proc2
032532832532814404144040.360.36
 %usr %system%guest%CPU 
 proc1proc2proc1proc2proc1proc2proc1proc2
15:56:105200003261323259920152080150680.380.37
15:56:1153006003261323259921520860150680.38012.37
15:56:1254303000326132325992152081506806.380.37
15:56:13552929903032532833259920144040150680.3660.37
15:56:14556054770003253283275761440416360100.360.4
15:56:1557001000325328325768144041484401.360.37
15:56:165840464800325328328416144041721608.360.42
15:56:1759001000325328325328144041440401.360.36
15:5657:185000610210032532832942014404183600.36110.45
15:5657:19011001000325328326140144041521602.360.38
15:5657:20022004000325328326140144041521606.360.38
15:5657:210310163000032532832676414404158400.3620.39
15:5657:22043004000325328325808144041488407.360.37
15:5657:23050110020003253283299081440418840001.360.46
15:5657:24060470403253283256285144040147040.3609.36
15:56:2557:071127500132532803307841144040197160.3602.49
15:56:2657:08000032532853258841144046149600.360111.37
15:5657:2709015020003253283319600144040207560.360.51
15:5657:281040060003253283253281440414404010.360.36
15:56:2957:11012580032532813291280144040182040.3610.45
15:5657:3012000032532832532814404144040.360.36
15:57:13000

...

0

...

325328

...

325328

...

14404

 

Scaling services

Large-file upload/download

...

144040.360.36

Killing the loadbalancer

Running kubectl delete pod on the nginx-ilb pod, the running pod is in a terminating state for ~30 seconds. During this time, the replication controller creates a new pod, but it remains in a pending state for the 30 second period.  Some responses are handled, but there is the risk of ~30 seconds of downtime between pod restarts. This may be related to the shutdown of the default-http-backend, but this isn't clear.