Page History

...

Should an extractor type be managed by both an VM image and docker, or only one of them?
A: only one for simplicity. So need to specify this piece of information in the config file.
Docker image storage: docker hub, or a private registry?
A: go with the docker hub for now. Setting up a private registry takes time and a secure one requires getting a certificate from a CA. Can do it later when needed. Use ncsa/clowder-ocr, ncsa/clowder-python-base, ncsa/clowder-opencv-closeups, etc. for now. Done.
How to do we restart a docker container if the application crashed/stopped?
A: docker run --restart=always ...
This will retry indefinitely, but with a delay that doubles before each retry, starting from 100 ms (0.1 second), to avoid flooding the server. Can also consider using using "--restart=on-failure[:max-retries]" to limit the number of of retries, but then that could leave a container in the stopped stopped state, without any component to restart it. Usually a RabbitMQ server restart would cause an error, and the error was observed to persist for about 2 minutes.
How to do we scale up an application managed by docker?
A: see below.
How to do we scale down?
A: see abovebelow.
1. Do we suspend and resume docker VMs, or always keep them running?
  A: We suspend and resume docker VMs, but keep at least 1.
We need to run some VMs exclusively to host the docker containers. How do we start them ? Externally – externally bootstrap, or start them using the elasticity module?
A: need to add the docker VM image info in the config file, so the the module knows how to start a new docker VM. Can start one at the the beginning of the elasticity. Later on as needed start more.
how to How do we detect idle extractors managed by docker?
A: use Same logic using the RabbitMQ API as before. After detection, perform docker-specific commands to stop the idle extractors.
How do we how to detect idle docker machines if no container runs on them?
A: add data structure for docker machines. Find docker VMs that that have no extractors running on them, add them to the idle machine machine list, or somehow signal that they can be suspended.
how to How do we specify mapping of docker images with extractors?
A: add a [Docker] section in the config file. extr1 -> dockerimg1. When starting , with line items such as: "extr1: dockerimg1". When starting the elasticity module, load the config file, and check check for errors: one extractor type should be managed only by one one method: either docker or a VM image. If such configuration errors errors exist, print out, and use docker (a default type such as docker – also make this choice configurable too)a configurable item in the config file.

Algorithm / Logic

Assumptions:

...

If the above criterion is reached for a given queue, say, q1, then use the data item 2 above, find the corresponding extractor (e1). Currently this is hardcoded in the extractors, so that queue name == extractor name.
Look up e1 to find the corresponding running VM list, say, (vm1, vm2, vm3).
Go through the list one by one. If there's an open slot in the VM, meaning its #vCPUs > loadavg + <cpu_buffer_room> (configurable, such as 0.5), for example, vm1 #vCPUs == 2, loadavg = 1.2, then start another instance of e1 on vm1. Finish working on this queue and go back to Step 1 for the next queue. If there's no open slot on vm1, look at the next VM in the list. Finish working on this queue and go back to Step 1 for the next queue if an open slot is found and another instance of e1 is started.
If we go through the entire list and there's no open slot, or the list is empty, then look up e1 to find the corresponding suspended VM list, say, (vm4, vm5). If the list is not empty, resume the first VM in the list. If unsuccessful, go to the next VM in the list. After a successful resumption, look up and find the other extractors running in the VM, and set a mark for them so that this scaling logic will skip these other extractors, as resuming this VM would also resume them. Finish working on this queue and go back to Step 1 for the next queue.
If the above suspended VM list is empty, then we need to start a new VM to have more e1 instances. Look up e1 to find a VM image that contains it. Start a new VM using that image. Similar to the above step, after this, look up and find the other extractors available in the VM, and set a mark for them so that this scaling logic will skip these other extractors, as starting this VM would also resume them.

...

This logic is suitable for a production environment. For a testing environment or a system that's not busy, this logic could suspend many or even all VMs since there is not much or no request, and lead to a slow start – only the next time the check is done (say, every minute), this module will notice that the number of extractors for the queues are 0 and resume the VMs. We could make it configurable whether or not to maintain at least one extractor running for each queue – queue – it's a balance of potential waste of system resource vs. fast processing startup time.
In the future, we could support migrating extractors. For example, if 4 extractors run on vm1, and only extractor 1 has work to do, and there is an open slot to run extractor 1 on vm2 (where extractor 1 is already running), we could migrate the extractor 1 instance from vm1 to vm2, and suspend vm1 – but this is considered lower priority.

...

Testing
Input
Use a script to generate high request rates with OCR, OpenCV extractors to test the scaling up part. Stop sending the requests to test the scaling down part.
Output
Use the OpenStack CLI / web UI for the VM part, RabbitMQ web UI and SSH into the docker machines for the extractor part, to verify that the docker containers are started/stopped, and docker VMs are started/resumed/suspended as expected.

...

Page tree

Versions Compared

Old Version 2

New Version 3

Key

Assumptions: