There are several parallel efforts to capture information about Clowder metrics:
- User activity (backend)
- resources created or uploaded
- bytes added or removed
- extractors triggered & runtime
- User activity (frontend)
- page views
- System health
- Response time of ping Clowder website.
- Response time of download Clowder website and bytes of homepage.
- Uptime of Clowder website
The goal is to minimize number of moving parts to capture and store this data. Below is summary of our discussion from 12/7.
RabbitMQ Queue & Flask API
Use queue to store data points. Not an extractor queue, but a special new system queue.
- Clowder can write messages directly to RabbitMQ
- Lightweight Flask API we run in a python container that also connects to RabbitMQ
- Other code can post datapoints to this API, that get forwarded to RabbitMQ
Flask API design notes - ideally these endpoints also match the calls on the new backend SinkService:
- Seems lightweight enough for possibly a single generic endpoint to enqueue an item
- Can expand to an endpoint-per-event-type as API evolves or as needs grow
- Use Swagger from the start, as best practice... should make it easier to alter/scale/sync API server/clients when necessary
- Requires authentication via API key or some other mechanism to prevent spam or potentially malicious fake messages
- Cannot fetch / auth using Clowder (since this is for handling the case when Clowder is down)
Internal Clowder events service
For the user activity (Max's reporting part and Mike's Clickstream stuff basically) we can call an internal RabbitMQ service for the events that we want to capture, to generate datapoints.
Current (frontend) tracking:
- Allows for configuration of Amplitude API key
- If configured, tracking snippet added to every view (via
- Events tracked:
- Resource views (files, datasets, collections)
- Files / datasets submitted to extractor
- File uploads
- Allow for configuration of Amplitude API key (no change)
- If configured, tracking snippet added to every view (no change)
- For the tracked events above, call the new backend SinkService, which will check for configured integrations with Amplitude/Google Analytics/etc and delegate appropriately:
- This will be a new piece of code that will submit to our special RabbitMQ queue/exchange
- If Amplitude is configured, also send to Amplitude via the REST API
- If GA is configured, also send to Google Analytics (NOTE: may be difficult or impossible - see https://stackoverflow.com/questions/15530487/restful-api-and-google-analytics)
- Bonus points: add a backend action that automatically tracks API calls and sends to the SinkService
Clowder health monitor(s)
Bing's external monitor can't call Clowder, because it has to operate even when Clowder is down. Instead the monitors in different regions can collect and post their datapoints to the Flask API, which can go around Clowder into RabbitMQ directly.
We run a service as docker container periodically fetch the statistics data of Clowder service, e.g., the uptime, response time and a number of active connections to Clowder, etc. And those data will be stored in the backend services e.g., influxdata (this will need the extra endpoints of service), and grafana will retrieve those data and render them on the grafana website for the visualization.
The uptime of Clowder website can ensure we understand the liveness of Clowder service and this metric will be collected by sending ping to the target Clowder website with a certain timeout.
Response time: meanwhile, we collect the statistics of the response time of the ping command. And the elapsed time of downloading Clowder homepage.
The number of connections: It would be good to see how many connections to Clowder website. we can measure the number of connections within a period of time. We would analyze the NGINX log to get those information.
Finally, we need a service to actually pull the messages from RabbitMQ and write them into a database, whether that is MongoDB or InfluxDB or whatever. Maybe these could register with Clowder like extractors even, so that they each get a separate queue and multiples can log to different destinations at once.
Let's consider some different types of events. Assume user and timestamp for all data captured too.
|component||event type||data captured||notes|
|fileid, datasetid, spaceid, bytes|
|extractions||extraction event||message, type (queued or working)||do we care about data traffic downloaded to the extractor containers?|
|do we care about every page view? this is currently tracking which resources are being viewed but without the full url|
|health||ping update||response time, queue length, other?|