Design document

Design Goal / Requirement

To support auto-scaling of the system resources to adapt to the load of external requests to the Brown Dog Data Tiling Service. In general, this includes Medici, MongoDB, RabbitMQ, and the extractors, currently the design focuses only on auto-scaling the extractors. Specifically, the system needs to start or use more extractors when certain criterion is met, such as the number of outstanding requests exceed certain thresholds, and suspend or stop extractors when other criteria are met. The intention of scaling down part is mainly to save resources (CPU, memory, etc.) for other purposes.

Investigated technologies
- Olive (olivearchive.org): mainly developed at Carnegie Mellon University (CMU).

Three main ideas: Internet Suspend/Resume (allow a user to start using a VM before it is fully downloaded), indexing and searching of VM images, incrementally composing VM images. The last 2 have been done in Docker, below.

Impression: academic quality, feature and quality do not seem a good fit for Brown Dog project.

- OpenVZ (openvz.org): a community project, supported by the company Parallels, Inc. An OS-level virtualization technology based on the Linux kernel.

Impression: server consolidation, web hosting. Seems to be production quality. For Linux only, so at least does not seem a good fit for the high-level VM architecture technology.

- Docker (docker.com): automates the deployment of applications inside software containers, providing abstraction and automation of operating system–level virtualization on Linux.

Impression: Similar to OpenVZ, an OS-level virtualization technology: a container runs the same kernel as the host, so at least does not seem a good fit for the high-level VM architecture technology. OpenVZ functionality + distributed architecture to set up repositories and pull images from and push images to the repositories. Popular, under active development.

- OpenStack (openstack.org).

As a hardware virtualization technology, OpenStack supports multiple OSes, such as Linux and Windows.

- Current choice: OpenStack.

Brown Dog VM elasticity project needs to support multiple OSes, so OpenStack seems a viable high level solution. Currently considering using OpenStack. May consider using Docker on the VMs if needed.

Algorithm / Logic

Assumptions:

The following assumptions are made in the design:

the extractor is installed as a service on a VM, so when a VM starts, all the extractors that the VM contains as services will start automatically;
the resource limitation of using extractors to process input data is CPU processing, not memory, hard disk I/O, or network I/O, so the design is only for scaling for CPU usage;
we need to support multiple OS types, including both Linux and Windows;
we assume that the entire Brown Dog system will be using RabbitMQ as the messaging technology.

Algorithm:

This VM elasticity system / module maintains and uses the following data:

RabbitMQ queue lengths and the number of consumers for the queues;
Can be obtained using RabbitMQ management API.
for each queue, the corresponding extractor name;
Currently hard coded in the extractor code, so that queue name == extractor name.
for a given extractor, the list of running VMs where an instance of the extractor is running, and the list of suspended VMs where it was running;
Running VM list: can be obtained using RabbitMQ management API, queue --> connections --> IP.
Suspended VM list: when suspending a VM, update the mapping for the given extractor, remove the entry from the running VM list and add it to the suspended VM list.
the number of vCPUs of the VMs;
This info is fixed for a given OpenStack flavor. The flavor must be specified when starting a VM, and this data can be stored at that time.
the load averages of the VMs;
For Linux, can be obtained by executing a command ("uptime" or "cat /proc/loadavg") with ssh (a bit long, last testing took 12 seconds from devstack host to a ubuntu machine, using ssh public key).
for a given extractor, the list of VM images where the extractor is available.
This is manual and static data. Can be stored in a config file, a MongoDB collection, or using other ways.

In the above data, items 2, 4 and 6 are static (or near static), the others are dynamic, changing at run time.

Periodically (configurable, such as every minute), the system checks whether we need to scale up, and whether we need to scale down. These 2 checks can be done in parallel, but if so, the system needs to protect and synchronize the shared data, such as the list of running VMs.

Scaling up:

The criterion for the need of scaling up is: RabbitMQ queue lengths > pre-defined thresholds, such as 100 or 1000.

Get the list of running queues, and iterate through them:

If the threshold is reached for a given queue, say, q1, then use the data item 2 above, find the corresponding extractor (e1). Currently this is hardcoded in the extractors, so that queue name == extractor name.
Look up e1 to find the corresponding running VM list, say, (vm1, vm2, vm3).
Go through the list one by one. If there's an open slot in the VM, meaning its #vCPUs > loadavg + <cpu_buffer_room> (configurable, such as 0.5), for example, vm1 #vCPUs == 2, loadavg = 1.2, then start another instance of e1 on vm1. Return. If there's no open slot on vm1, look at the next VM in the list. Return if an open slot is found and another instance of e1 is started.
If we go through the entire list and there's no open slot, or the list is empty, then look up e1 to find the corresponding suspended VM list, say, (vm4, vm5). If the list is not empty, resume the first VM in the list. Return.
If the above suspended VM list is empty, then we need to start a new VM to have more e1 instances. Look up e1 to find a VM image that contains it. Start a new VM using that image.

Scaling down:

Get the list of IPs of the running VMs. Iterate through them:

If the number of messages in the past time period (configurable, say, 1 hour) is 0 for a given VM, summed across all extractors running on it, then it indicates that there is no work for the extractors on it to do, so we can suspend the VM to save resources for other tasks. Note that the threshold is 0. If there is any work for any extractor running on it, we'll keep the VM running.

In the future, we could make improvements to migrate extractors, for example, if 4 extractors run on vm1, and only extractor 1 has work to do, and there is an open slot to run extractor 1 on vm2 (where extractor 1 is already running), we could migrate the extractor 1 instance from vm1 to vm2, and suspend vm1 – but this is considered lower priority.

Programming Language
Testing

How to test.

Page tree

Design document

Assumptions:

Algorithm:

Scaling up:

Scaling down: