Overview

TERRA-REF PI David LeBauer is teaching a two-week bootcamp for the NSF-funded "PI4 Program for Interdisciplinary and Industrial Internships at Illinois" [1].    The focus this year is on the use and analysis of large and complex data.  The course will be making use of RStudio, Jupyter, and OpenRefine containers and associated tutorials for analysis of the TERRA-REF dataset. The course is being co-taught by Neal Davis from Computer Science.

The original plan was to use the official TERRA-REF Workbench instance, but we decided that it didn't make sense for these students to have long-term access to the TERRA-REF system. Also, the TERRA allocation in Nebula doesn't have sufficient resources to scale.

Procedures

A few notes about monitoring the system:

  • Open Grafana (http://grafana.pi4.ndslabs.org/), login as "admin" with the apiserver admin password. Keep an eye on Cluster CPU and memory usage.
  • ssh pi4-master1
    • watch "kubectl get pods --all-namespaces  -o wide | grep -v kube-system | grep -v default"
    • Look for restarts, Pending and Terminating instances – these can be indicative of resources problems
    • We don't yet have good Kubernetes monitoring that can help us identify potential problems.
  • Enable full monitoring on Nagios server

Notes

The following are notes from active support of Workbench in this scenario. These notes will be used to inform future workshops and hackathons

  • 5/22 
    • Provisioned new Workbench instance in NDS-Hackathon space on Nebula using the new "minimal" configuration (1 master, 2 compute nodes)
    • PI requested longer timeouts for services
  • 5/23 
    • Because course materials were available online, students began signing up almost immediately.
    • PI requested student accounts be imported from spreadsheet (this is something we also did for Phenome)
    • PI requested inclusion of OpenRefine container and BETYdb/PostGIS container (only recently developed for TERRA-REF project)
  • 5/24
    • Requested TERRA/ROGER NFS export to new Workbench instances. Required urgent attention, since TERRA's admin was on vacation and there wasn't a clear path to getting the filesystem exported.
  • 5/26
    • Bootcamp starts. PI established Slack organization and invited me.
    • Received Slack message reporting problem with OpenRefine application (problem=incorrect image pushed to Dockerhub).
    • 3 students had problems logging in (via Slack support channel)
  • 5/29: No bootcamp (holiday)
  • 5/30
    • See usage information below.  First day usage was apparently unproblematic
  • 5/31
    • ~9:30 AM received Slack from bootcamp TA reporting problems launching PostgresSQL
      • Kubernetes logs indicated scheduling issues due to insufficient CPU.  Added 3rd node.
    • Students ran Rstudio, PostgresQL
    • 10:30 AM 
      • Received email from co-instructor with new spec for Hadoop container and request to add shared data.
    • ~1 PM received Slack from bootcamp TA concerning access problems. 
      • Appeared to be network problem with node1, tried moving loadbalancer to node3 – no success.
      • NCSA security had blackholed the loadbalancer due to un-authenticated OpenRefine containers.
      • Problem identification and resolution took ~1 hour (Security doesn't notify when this happens, so we have to figure it out from symptoms)
      • We added authentication to OpenRefine and instructed students to restart.
      • Noticed that 3rd node added in AM didn't have access to ROGER data for tutorial. Sent message to help+roger to update export (still no response).  In hindsight, resizing the node might have been more effective than adding a node, given the mount requirements.
      • Students reporting problems with applications not starting or stopping.  ~3% of containers in hung Pending or Terminating state.  Possibly due to problems caused by steps to troubleshoot/resolve blackholing.
      • Notified students that Workbench would go offline at ~8pm
    • 8pm
      • Rebooted node1
      • Noticed imported accounts do not have quotas, logged NDS-925
      • Looked at current usage information and deemed that current allocation seems sufficient for tomorrow
      • Added co-instructor data to /data mount on all nodes (copied).
  • 6/1 Jupyter tutorial
    • 9am:  
      • 30 users/namespaces
      • 67 active containers (non-system)
      • 24 of TERRA jupyter/netcdf or jupyter/plantcv
    • 10:30am:
      • Actively monitoring Grafana and "watch kubectl get pods --all-namespaces  | grep -v kube-system | grep -v default"
      • Memory utilization climbed to ~70GB of 96GB available. Notified PI to ask students to shutdown unused applications.
      • !! We don't really understand the numbers in Grafana.
      • Considering provisioning 4th node
      • Spoke with Nebula team about re-sizing via OpenStack – which is apparently not good or untested.
    • 1:30pm
      • So far quiet.  RAM usage hovering around ~70GB according to Grafana. CPU much lower (as usual)
      • Students often complaining about 503 error after startup, which goes away after 10s or so.
    • 2:30pm
      • PI requested to run scripts in PostGIS container that is only available to admin
      • 50+ active non-system containers
      • Memory usage climbing again. Less than <500MB free on each node at this point.
      • Planning to deploy 4th node to extend resources for tomorrow/next week.  Need to resolve NFS mount problem.
    • 9:30pm
      • Unmounted ROGER shares (storage team will export to CIDR range)
      • Added 4th node to pi4 cluster (now running 20 VCPUs, 128GB RAM)
      • Changed shutdown timeout to 3 hours (was 24, but no services were shutting down)
  • 6/2
    • Added full Nagios monitoring (disk/load).  Previously just had basic UI/API monitor.
    • 9:15am
      • Students reporting problems accessing applications
      • I was able to repeat the problem – no errors in the log, Kubernetes services and pods all looked OK, but UI reported the instance as stopped but starting.
      • Restarted API server pod seems to have resolved it.
    • 10am
      • Re-mounted NFS mounts from ROGER on all four nodes
      • Student report of problem accessing Jupyter
      • Found a way to access private instances by adding my credentials to their htpasswd file in ILB container!
      • I'm able to access their Jupyter instance, but they aren't?
      • Student is able to access via Chrome.  Was using Safari 8.0 (100600.1.25) on MacOS 10.10 (Yosemite). This may be a bug in Jupyter.
    • 11:30 am
      • KDtree tutorial in Jupyter – very CPU intensive (seeing high load averages across all nodes, but particularly the new one).
      • In the future, if we have a better understanding of anticipated workload we'll need to provision accordingly. In the past, we've way-overprovisioned CPUs.  For this workshop, we're underprovisioned.
    • 6pm
      • Everything is/was very quiet this afternoon - nothing at all being reported from the instructors or users
      • So far, only a single restart on a single pgstudio container - everything else appears to be humming along 
  • 6/4
    • Added Analytics to track usage
  • 6/5
    • 4:30 am
      • Began getting Nagios alerts ~2:30am. It looks like node3 is in a bad state, but unclear why. This is the current loadbalancer.
      • Reporting I/O error, thousands of defunc "wc" processes?
      • wlinz2 user has rstudio instance in a bad state
      • No choice but to reboot instance
      • As usual, after reboot kubelet isn't started?
      • Rebooted node2, for grins
    • 8:30am
      • Looking at Kibana (for the first time) and realizing that we really don't understand how to troubleshoot via central logs.
    • 9:30am
      • Student reported Rstudio initialization error. Fortunately I'm also seeing the problem. The cause is permissions issue with the stack AppData directory.  Manually chmod a+w for the directory and restarting the service resolves it. However, it's unclear why this would happen, since all directories should be created with wide open permissions.
    • 1pm
      • Co-instructor requested addition of Hadoop container.  Initial tests revealed problems running MR process within Workbench, but cause so far unclear.  Confirmed with instructor that they hadn't verified MR was working in Workbench, only basic hadoop commands.  Decided to use AWS EMR for class. We might consider how Hadoop/MR sandboxes fit in Workbench going forward.
  • 6/6
  • 6/7
    • 9:00am - 10am
      • TA reported problems with OpenRefine containers not showing endpoint link.  It looks like a spec change from port 3333 to 8080 was never pushed and restarted instances had services trying to connect to 3333. Fixed the spec and manually fixed running instances.
      • Additional permissions problems, seemingly related to Nebula outage on Monday.  New directories are being created with umask 0022, which means no write permissions for non-root users (e.g., openrefine, rstudio, jupyter). The API server entrypoint specifically starts with umask 0 to overcome this. It's unclear why folders are being created with umask 0022 at this point.  A possible solution is to change the default value in /etc/profile in the container or in the root user bashrc to see if it solves things.  Will try this tonight after class ends.
      • On a positive note, Grafana has been incredibly useful for monitoring usage – and adding Analytics will also help understand more about user environments (browser/OS, etc). See screenshot below. Hello Windows...
    • 10:30am
  • 6/8

    • 5:30am
      • Increased Nginx body size to 50m, restarted API server (seems to have resolved permission denied error). In the end the permissions error was probably caused by mounting the TERRA data without restarting the API server.
    • 9:30am
      • Students using AWS EMR.  Instructor has asked them to sign-up for free account (github student account)
      • Process is certainly not straightforward. We shouldn't feel too bad:
        • do we sign up for company account or personal?
        •  If you get the github student account you get 110 h free.
        • there is a free tier
        • 9:30: want to lend me your cc info for this then? .... sure ...its 5555-5555-5555-5555
        • 10:00: Is anyone getting something like this:

          Error connecting to ec2-13-58-210-240.us-east-2.compute.amazonaws.com, reason:
          -> Connection timed out: no further information

          Did I do something wrong?

        • I got an error too

        • Failed to provision ec2 instances because 'The requested instance profile EMR_AutoScaling_DefaultRole is invalid'
        • 10:47: Has anyone been successful?
          • not me... terminated with errors
        • 11:00: Event with a severity of CRITICAL: "Amazon EMR Cluster j-1PMC42O8DPAHP (Justin's Cluster) has terminated at 2017-06-08 15:56 UTC with a reason of ." How informative
        • 11:24: Terminated with errorsService role EMR_AutoScaling_DefaultRole has insufficient EC2 permissions
        • 11:30: Did anyone successfully do the word count problem?
  • 6/9: Last day
    • 10:50am:
      • TA report: a lot of people are using RStudio and it seems that it keeps crashing
      • Here are the issues they're reporting:
        1. Taking an unreasonable amount of time to run basic R commands and plot things
        2. Regular disconnects
        3. Basic commands in the shell take a long time to execute (I noticed this even with running things like a git status or git stash)

      • I was able to repeat the problem.   Git clone was very slow in container and on Gluster-mounted volumes

      • Engaged Chris from Nebula team to help/advise with Gluster.
      • Tried several things
        • strace
        • Setting revocation locks
      • Finally, restarted gluster servers
        • Kill pod 1, wait for it to start, run gluster volume heal global and gluster volume heal global info. Wait for all files to heal
        • Kill pod 2, wait for it to start, run same
      • He noted possible configuration problems 
      • It looks like we may want to reconsider running separate dedicated gluster instances for intensive use
      • Problem finally solved by ~1pm.
      • Noting that load averages on instances running Gluster server are much higher than instances not running gluster.  
    • 2pm
      • Student sent Rmarkdown (below), reporting that "knitting" was failing.
      • I stepped through the code while watching "docker stats" for my container, and clearly they were exceeding the resource limits of 500m/2GB.
      • I upped the cores to 1000m and memory to 4GB, but this is likely causing a problem for other students as they work on their final projects.
    • 9pm
      • Workshop is over, most containers are shutdown, but still seeing load average warnings on node2
      • This points to a problem other than user traffic.  The td-agent and java processes are related to ELK and Grafana stacks
      • Perhaps we should make sure to have elastic search not running on the GLFS server nodes?
      • This again speaks to a separate instance for ELK stack.



top - 20:49:50 up 4 days, 16:09, 2 users, load average: 4.44, 5.29, 5.44
Tasks: 260 total, 1 running, 259 sleeping, 0 stopped, 0 zombie
%Cpu(s): 4.1 us, 10.7 sy, 0.0 ni, 65.0 id, 19.4 wa, 0.6 hi, 0.3 si, 0.0 st
KiB Mem: 32952392 total, 23527668 used, 9424724 free, 71540 buffers
KiB Swap: 0 total, 0 used, 0 free. 14416340 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25841 root 20 0 301944 155728 0 S 27.5 0.5 4:57.66 td-agent
5876 1000 20 0 5432596 626948 94568 S 11.6 1.9 604:44.74 java
2003 root 20 0 2510056 99904 37432 S 4.0 0.3 383:27.95 kubelet
5877 1000 20 0 5432508 659784 143856 S 2.0 2.0 606:37.72 java
901 root 20 0 42784 32508 17992 S 1.0 0.1 114:54.32 kube-proxy
1093 root 20 0 2939924 102256 29648 S 0.7 0.3 75:08.41 dockerd
20368 root 20 0 2433332 103080 7956 S 0.7 0.3 52:01.33 glusterfsd
826 root 20 0 2034064 37380 9044 S 0.3 0.1 11:22.71 containerd
3314 root 20 0 1443120 276420 17344 S 0.3 0.8 40:40.12 node
3822 core 20 0 20156 2808 2212 R 0.3 0.0 0:00.04 top
8814 root 20 0 795120 273580 20112 S 0.3 0.8 5:22.40 influxd
16407 999 20 0 191380 10360 9140 S 0.3 0.0 0:06.74 rserver
20424 root 20 0 546376 33192 5344 S 0.3 0.1 0:01.42 glusterfs
26532 root 20 0 0 0 0 S 0.3 0.0 0:00.79 kworker/u8:0
32640 root 20 0 0 0 0 S 0.3 0.0 0:01.85 kworker/u8:2
1 root 20 0 133656 8360 5880 S 0.0 0.0 0:15.86 systemd

/var/log/fluentd.log


2017-06-10 01:30:37 +0000 [warn]: Could not push logs to Elasticsearch, resetting connection and trying again. read timeout reached
2017-06-10 01:30:41 +0000 [info]: Connection opened to Elasticsearch cluster => {:host=>"elasticsearch-logging", :port=>9200, :scheme=>"http"}
2017-06-10 01:31:11 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-06-10 01:29:35 +0000 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Could not push logs to Elasticsearch after 2 retries. read timeout reached" plugin_id="object:3fce62d15360"

Usage

Screenshot of PI4 Grafana (5/25 - 6/6/2017)



Sample R project

```{r, echo = FALSE}
library(stringr)
library(dplyr)
bety_src <- src_postgres(dbname = "bety", 
                password = 'bety', 
                host = 'terra-bety.default', 
                user = 'bety', 
                port = 5432)
# to see all available columns in traits table
original_traits <- tbl(bety_src, 'traits') %>%
  collect(n=1)
# local version of variables for reference
variables_local <- tbl(bety_src, 'variables', n = Inf) %>%
  mutate(variable_id = id, variable_name = name) %>%
  dplyr::select(variable_id, variable_name, description, units) %>%
  collect()
traits <- tbl(bety_src, 'traits', n = Inf) %>%
  mutate(trait_id = id) %>%
  dplyr::select(trait_id, site_id, specie_id, cultivar_id, date, mean, variable_id, method_id, treatment_id, entity_id)
variables <- tbl(bety_src, 'variables', n = Inf) %>%
  mutate(variable_id = id, variable_name = name) %>%
  dplyr::select(variable_id, variable_name)
cultivars <- tbl(bety_src, 'cultivars', n = Inf) %>%
  mutate(cultivar_id = id, cultivar = name) %>%
  dplyr::select(cultivar_id, cultivar)
entities <- tbl(bety_src, 'entities', n = Inf) %>%
  mutate(entity_name = name, entity_id = id) %>%
  dplyr::select(entity_name, entity_id)
sites <- tbl(bety_src, 'sites', n = Inf) %>%
  mutate(site_id = id) %>%
  dplyr::select(site_id, city, state, country, notes, sitename)
treatments <- tbl(bety_src, 'treatments', n = Inf) %>%
  mutate(treatment_id = id, treatment_definition = definition, treatment_name = name) %>%
  dplyr::select(treatment_id, treatment_name, treatment_definition) 
joined_table <- traits %>%
  left_join(variables, by = 'variable_id') %>%
  left_join(cultivars, by = 'cultivar_id') %>%
  left_join(entities, by = 'entity_id') %>%
  left_join(sites, by = 'site_id') %>%
  left_join(treatments, by = 'treatment_id') %>%
  dplyr::select(trait_id, date, mean, variable_name, sitename, treatment_name, cultivar)
filtered_table <- filter(joined_table, variable_name %in% c("height", "canopy_cover", "canopy_height", "perimeter", "aboveground_dry_biomass", "leaf_length", "leaf_width", "plant_height", "aboveground_fresh_biomass", "growth_respiration_coefficient", "germination_score", "stem_diameter", "emergence_count")) %>%
  collect(n = Inf)
#Separating the filtered table into the outdoor and the indoor table
outdoortable = filtered_table %>%
  filter(str_detect(sitename, "Field Scanner"))
indoortable = filtered_table %>%
  filter(str_detect(sitename, "Danforth Plant Science Center")) 
```
#Outdoor Analysis
We first check if any of the variables in the outdoor table are missing values.
```{r}
sum(is.na(outdoortable$date))
sum(is.na(outdoortable$trait_id))
sum(is.na(outdoortable$mean))
sum(is.na(outdoortable$variable_name))
sum(is.na(outdoortable$treatment_name))
```
Since $treatment\_name$ is missing 697039 values, it is not considered in any of the modeling part.
As $date$ is missing 7020 values, we create a new dataset, $outdata$ that does not consider those rows which have missing date entries.
```{r}
a = which(is.na(outdoortable$date))
outdata = outdoortable[-a,]
```
```{r}
model1 = lm(log(mean + 1) ~ variable_name + trait_id + date, data = outdata)
summary(model1) 
summary(model1)$adj.r.squared
```
$trait\_id$ is insignificant while $variable\_name$ and $date$ are. The $Adjusted R^2$ value is 0.9464145 indicating this model is pretty good.
Considering interaction terms, we get,
```{r}
model2 = lm(log(mean + 1) ~ variable_name * trait_id * date, data = outdata)
summary(model2) 
summary(model2)$adj.r.squared
```
This has a higher $Adjusted R^2$ value of 0.9592466.
```{r}
model1b = lm(log((mean + 1)^2) ~ variable_name + trait_id + date, data = outdata)
model2b = lm(log((mean + 1)^2) ~ variable_name * trait_id * date, data = outdata)
summary(model1b)$adj.r.squared
summary(model2b)$adj.r.squared
```
```{r}
plot(model1, which = c(2))
```
```{r}
plot(model2, which = c(2))
```
```{r}
plot(model1b, which = c(2))
```
All of the above models do not obey normal distribution.
```{r}
model3 = lm(log(mean + 1) ~ variable_name * date, data = outdata)
summary(model3) 
summary(model3)$adj.r.squared
```
```{r}
plot(model3, which = c(2))
```
#Indoor Analysis
```{r}
modeli1 = lm(mean ~ variable_name + traits_id + date + treatment_name, data = indoortable)
summary(modeli1)
```
```{r}
modeli2 = lm(mean ~ 1 + date * treatment_name, data = indoortable, subset = variable_name == 'perimeter')
plot(modeli2, which = c(2))
summary(modeli2)
modeli3 = lm(mean ~ 1 + date * treatment_name, data = indoortable, subset = variable_name =='plant_height')
plot(modeli3, which = c(2))
library(ggplot2) 
ggplot(indoortable, aes(date, mean, color = treatment_name))+
  geom_point() +
  geom_smooth(se = FALSE) +
  facet_wrap(~variable_name, scales = 'free')
```
```{r}
modeli2box = lm((((mean ^ 0.15) - 1) / 0.15) ~  1 + date * treatment_name, data = indoortable, subset = variable_name == 'perimeter')
plot(modeli2box, which = c(2))
summary(modeli2box)
```
```{r}
modeli3box = lm((((mean ^ 0.45) - 1) / 0.45) ~ 1 + date * treatment_name, data = indoortable, subset = variable_name =='plant_height')
plot(modeli3, which = c(2))
```
```{r}
total4 = which(cooks.distance(modeli2box) < 4 / length(cooks.distance(modeli2box)))
lev4 = which(hatvalues(modeli2box) > 2 * mean(hatvalues(modeli2box)))
indoordata4 = indoortable[-total4 -lev4,]
modeli2newbox = lm((((mean ^ 0.15) - 1) / 0.15) ~  1 + date * treatment_name, data = indoordata4, subset = variable_name == 'perimeter')
plot(modeli2newbox, which = c(2))
```
```{r}
plot(modeli2, which = c(2))
```
So, we next try to remove the influential points.
#Remove influential points
##Outdoor
```{r}
total = which(cooks.distance(model2) < 4 / length(cooks.distance(model2)))
lev = which(hatvalues(model2) > 2 * mean(hatvalues(model2)))
outtotdata = outdata[-total,]
model2new = lm(log(mean + 1) ~ variable_name * traits_id * date, data = outtotdata)
plot(model2new, which = c(2))
```
```{r}
total2 = which(cooks.distance(model3) < 4 / length(cooks.distance(model3)))
lev2 = which(hatvalues(model3) > 2 * mean(hatvalues(model3)))
outtotdata2 = outdata[-total2 - lev2,]
model3new = lm(mean ~ variable_name * date, data = outtotdata2, subset = cooks.distance(model3) < 4 / length(cooks.distance(model3)))
plot(model3new, which = c(2))
```
```{r}
total3 = which(cooks.distance(modeli2) < 4 / length(cooks.distance(modeli2)))
lev3 = which(hatvalues(modeli2) > 2 * mean(hatvalues(modeli2)))
indoordata2 = indoortable[-total3 - lev3,]
modeli2new = lm(log(mean + 1) ~ variable_name * date * treatment_name, data = indoordata2)
plot(modeli2new, which = c(2))
plot(modeli2, which = c(2))
```
The $model2$ and $modeli2new$ are the best models when we consider the Outdoor table and the Indoor table respectively. This is because both of them have good $Adjusted\;R^2$ value and the QQ plots for both indicate that they almost follow a normal distribution.

  • No labels