Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: added "Other Considerations"

...

  1. Should an extractor type be managed by both an VM image and docker, or only one of them?
    A: only one for simplicity. So need to specify this piece of information in the config file.
  2. Support managing the extractors both using VM images and using Docker containers at the same time?
    A: Yes.
  3. Separate docker machines than the other machines? Or on the same machines?

    A: Separate. 

  4. Docker image storage: docker hub, or a private registry?
    A: Docker hub for now. Setting up a private registry takes time and a secure one requires getting a certificate from a CA. Can do it later when needed. Use ncsa/clowder-ocr, ncsa/clowder-python-base, ncsa/clowder-opencv-closeups, etc. for now. Done.

  5. How do we restart a docker container if the application crashed/stopped?
    A: docker run --restart=always ...
    This will retry indefinitely, but with a delay that doubles before each retry, starting from 100 ms (0.1 second), to avoid flooding the server. Can also consider using "--restart=on-failure[:max-retries]" to limit the number of retries, but then that could leave a container in the stopped state, without any component to restart it. Usually a RabbitMQ server restart would cause an error, and the error was observed to persist for about 2 minutes.
  6. How do we scale up an application managed by docker?
    A: see below.
  7. How do we scale down?
    A: see below.
    1. Do we suspend and resume docker VMs, or always keep them running?
      A: We suspend and resume docker VMs, but keep at least 1.
  8. A manually created and updated VM image is created and used to start "Docker machines" or "Docker VMs" , to host the docker containers. How do we start them the Docker machines – externally bootstrap, or start them using the elasticity module?
    A: need to add Use the elasticity module. Add the docker VM image info in the config file, so the module knows how to start a new docker VM. Can start one at the beginning of the elasticity. Later on as needed start moreset a min # of 1. Later on the scaling-up logic will start more as needed. Need special handling: the dockerized extractors depend on its existence, so if not already existing, a Docker machine needs to be started first.
  9. How do we detect idle extractors managed by docker?
    A: Same logic using the RabbitMQ API as before. After detection, perform docker-specific commands to stop the idle extractors.
  10. How do we detect idle docker machines if no container runs on them?
    A: add data structure for docker machines. Find docker VMs Since a Docker machine itself does not make RabbitMQ connections.If no extractor runs on it, it won't show up in the extr->VM or VM->extr maps, so need to maintain separate maps for the Docker machines. Find the Docker machines that have no extractors running on them, add them to the idle machine list, or somehow signal that they can be suspendedVM list.
  11. How do we specify mapping of docker images with extractors?
    A: add a [Docker] section in the config file, with line items such as1 to 1 mapping: "extr1: dockerimg1". When starting the elasticity module, load the config file, and check for errors: one extractor type should be managed only by one method: either docker or a VM image. If such configuration errors exist, print out, and use a default type such as docker – also make this choice a configurable item in the config file.
  12. Details of the Docker VM image?
    A: Ubuntu 14.04 base image + Docker installed.   
    In the config file [OpenStack Image Info] section:   
         docker-ubuntu-trusty = docker, m1.smalllarge, ubuntu, NCSA-Nebula, ''   
    Use a larger flavor (4 or 8 CPUs), since multiple containers would share one machine)one docker VM hosts multiple containers. Pull all needed docker images for the extractors for faster container start time at run time. Nice to have: ensure that the docker images specified are valid and available, pull them if not already – when starting the module.
  • Algorithm / Logic

    • Assumptions:
  1. The module needs to support managing manages the extractors using both using VM images and using Docker containers at the same time;
  2. A manually created and updated VM image is created and used to start "Docker machines" or "Docker VMs".
  3. Dockerized extractors run only in the Docker machines, the extractors managed by VM images do not run in the Docker machines.

...

If there is no RabbitMQ activity on a VM in a configurable time period (say, 1 hour), then there is no work for the extractors on it to do, so we can suspend the VM to save resources for other tasks. However, if suspending this VM would decrease the number of running instances for any extractor that runs on it below the minimum number configured for that extractor type, we do not suspend it and will leave it running.

  • Other Considerations

...

  1. Threads:
    This logic is suitable for a production environment. For a testing environment or a system that's not busy, this logic could suspend many or even all VMs since there is not much or no request, and lead to a slow start – only the next time the check is done (say, every minute), this module will notice that the number of extractors for the queues are 0 and resume the VMs. We could make it configurable whether or not to maintain at least one extractor running for each queue – it's a balance of potential waste of system resource vs. fast processing startup timeMay need to use multiple threads. When scaling up, after resuming a suspended VM or starting a new docker machine, needs to go into it to start a new container for the application, so need to block waiting for the resuming or starting to finish, and this may take a while.
  2. Remove stopped containers?
    Since starting a new container is fast, simplify the design of scaling down by removing the containers instead of leaving them around. Later can explore keeping them as stopped when scaling down, and start the stopped ones when scaling up -- possible future improvement.
  • Programming Language and Application Type

...