There are several parallel efforts to capture information about Clowder metrics:
- User activity (backend)
- resources created or uploaded
- bytes added or removed
- extractors triggered & runtime
- User activity (frontend)
- page views
- System health
- Response time
The goal is to minimize number of moving parts to capture and store this data. Below is summary of our discussion from 12/7.
RabbitMQ Queue & Flask API
Use queue to store data points. Not an extractor queue, but a special new system queue.
- Clowder can write messages directly to RabbitMQ
- Lightweight Flask API we run in a python container that also connects to RabbitMQ
- Other code can post datapoints to this API, that get forwarded to RabbitMQ
Internal Clowder events service
For the user activity (Max's reporting part and Mike's Clickstream stuff basically) we can call an internal RabbitMQ service for the events that we want to capture, to generate datapoints.
Clowder health monitor(s)
Bing's external monitor can't call Clowder, because it has to operate even when Clowder is down. Instead the monitors in different regions can collect and post their datapoints to the Flask API, which can go around Clowder into RabbitMQ directly.
Finally, we need a service to actually pull the messages from RabbitMQ and write them into a database, whether that is MongoDB or InfluxDB or whatever. Maybe these could register with Clowder like extractors even, so that they each get a separate queue and multiples can log to different destinations at once.
Let's consider some different types of events. Assume user and timestamp for all data captured too.
|component||event type||data captured||notes|
|fileid, datasetid, spaceid, bytes|
|extractions||extraction event||message, type (queued or working)||do we care about data traffic downloaded to the extractor containers?|
|health||ping update||response time, queue length, other?|