PART I: DATABASE DESIGN FOR TRACKING

Overall goal is to add a new group of Mongo collections to track usage/user activity on a per-resource basis:


With those goals in mind, one initial implementation:



PART II: NEW API ENDPOINTS FOR REPORTING

The other component building off of this feature is the reporting capability that lets privileged users download CSV reports summarizing usage.

Proposed API endpoints to support this:

These first three would provide a list of any datasets or collections that have views or downloads > 0, ideally in descending order but that could be done in Excel (for example) afterward if sorting has significant impact on API performance. These statistics are tracked per resource as totals and there is no identifying information associated in these reports. I imagine a report structure like:

resource_idresource_nameviewsdownloadslast_viewedlast_downloaded
12345678Max's Dataset312018-08-16 4:30:122018-08-14 10:20:12
...




I am going to explore whether it is feasible to also include a simple /metrics report without a resource type that would include all 3 categories, but I think it's also useful to have them available separately as well.

These reports would include additional categorization by specific user. These endpoints are a bit trickier because permissions become more of an issue - it's no longer just a question of whether the requesting user has permission to see each resource, but also defining some set of rules that dictate whether other users appear in the report or not. I am not sure whether we have other features that address this (e.g. a check whether you have permission to see another user's User page - if that is equivalent to being allowed to see another user's activity, perhaps not so tricky). 

In general it's important to point out that the user and non-user reports will not necessarily total to the same numbers. The totals will include public views and downloads for publicly available datasets that won't have any user associated with them, and depending on permissions and the requesting user, some users may be omitted from the user reports. With carefully defined permissions model and a user with appropriately elevated privileges, this can be mitigated.

Ultimately the goal is to support visualization of these results as well. One thought is that it could also be valuable to include parent resource information:

This could facilitate some cool aggregate charts or allow an interesting look at breakdowns at a higher level than a simple long list of resources, e.g. a pie chart for a collection showing which datasets in that collection are getting the most views. Including these relationships in the report would make this kind of chart much easier to generate in an environment like Excel.