Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: added more details in the data structure maintained by the algorithm, the scaling up criteria, a note in the "scaling down" part, etc.

...

  1. RabbitMQ queue lengths and the number of consumers for the queues;
    Can be obtained using RabbitMQ management API. The number of consumers can be used to verify that the action to scale up/down succeeded.
  2. for each queue, the corresponding extractor name;
    Currently hard coded in the extractor code, so that queue name == extractor name.
  3. for a given extractor, the list of running VMs where an instance of the extractor is running, and the list of suspended VMs where it was running;
    Running VM list: can be obtained using RabbitMQ management API, queue --> connections --> IP.
    Suspended VM list: when suspending a VM, update the mapping for the given extractor, remove the entry from the running VM list and add it to the suspended VM list.
    Also maintain the data of mapping from a running/suspended VM to the extractors that it contains. This is useful in the scaling up part.
  4. the number of vCPUs of the VMs;
    This info is fixed for a given OpenStack flavor. The flavor must be specified when starting a VM, and this data can be stored at that time.
  5. the load averages of the VMs;
    For Linux, can be obtained by executing a command ("uptime" or "cat /proc/loadavg") with ssh. Learned a way to get it quickly in <1 second, usually 0.3 second. But if needed, can use a separate thread to get this data, instead of in-line in the execution flow.
  6. for a given extractor, the list of VM images where the extractor is available, and the entire command line (including arguments) to start another extractor instance, i.e., a pair of (VM image name, entire command line). The command line is needed only for running additional extractor instances, since the first instance of that extractor will be started automatically as a service.
    Also maintain the data of mapping from a given VM image to the extractors it contains. This is useful in the scaling up part.
    This is manual and static data. Can be stored in a config file, a MongoDB collection, or using other ways.
  7. the last times a request is processed by the VMs.
    Can be obtained using the RabbitMQ management API, /api/channels/: "idle_since" and "peer_host". Need to aggregate the channels that have the same peer_host IP, and skip the ones on the localhost. This info is used in the scaling down part.

In the above data, items 2, 4 and 6 are static (or near static), the others are dynamic, changing at run time.

...

The criterion for the need of scaling up – to start more extractors for a queue – is:

  1. the length of the RabbitMQ queue

...

  1. > pre-defined thresholds, such as 100 or 1000, or
  2. the number of consumers (extractors) for this queue is 0.

 

Get the list of running queues, and iterate through them:

  1. If the threshold is reached for a given queue, say, q1, then use the data item 2 above, find the corresponding extractor (e1). Currently this is hardcoded in the extractors, so that queue name == extractor name.
  2. Look up e1 to find the corresponding running VM list, say, (vm1, vm2, vm3).
  3. Go through the list one by one. If there's an open slot in the VM, meaning its #vCPUs > loadavg + <cpu_buffer_room> (configurable, such as 0.5), for example, vm1 #vCPUs == 2, loadavg = 1.2, then start another instance of e1 on vm1. Return. If there's no open slot on vm1, look at the next VM in the list. Return if an open slot is found and another instance of e1 is started.
  4. If we go through the entire list and there's no open slot, or the list is empty, then look up e1 to find the corresponding suspended VM list, say, (vm4, vm5).  If the list is not empty, resume the first VM in the list. After this, look up and find the other extractors running in the VM, and set a mark for them so that this scaling logic will skip these other extractors, as resuming this VM would also resume them. Return.
  5. If the above suspended VM list is empty, then we need to start a new VM to have more e1 instances. Look up e1 to find a VM image that contains it. Start a new VM using that image. Similar to the above step, after this, look up and find the other extractors available in the VM, and set a mark for them so that this scaling logic will skip these other extractors, as starting this VM would also resume them.

At the end of the above iterations, we could consider verifying whether the expected increase in the number of extractors actually occurred or not, and print the result out.

Scaling down:

Get the list of IPs of the running VMs. Iterate through them:

If the number of messages in the past time period (configurable, say, 1 hour) is 0 for a given VM, summed across all extractors running on it, then it indicates that there is no work for the extractors on it to do, so we can suspend the VM to save resources for other tasks. Note that the threshold is 0. If there is any work for any extractor running on it, we'll keep the VM running.

Notes:

  1. This logic is suitable for a production environment. For a testing environment or a system that's not busy, this logic could suspend many or even all VMs since there is not much or no request, and lead to a slow start – only the next time the check is done (say, every minute), this module will notice that the number of extractors for the queues are 0 and resume the VMs. We could make it configurable whether or not to maintain at least one extractor running for each queue – it's a balance of potential waste of system resource vs. fast processing startup time.
  2. In the future, we could make improvements to migrate extractors, for example, if 4 extractors run on vm1, and only extractor 1 has work to do, and there is an open slot to run extractor 1 on vm2 (where extractor 1 is already running), we could migrate the extractor 1 instance from vm1 to vm2, and suspend vm1 – but this is considered lower priority.
  • Programming Language and Application Type

...