Overall goal is to add a new group of Mongo collections to track usage/user activity on a per-resource basis:
- views & downloads
- files
- datasets
- datasets can be downloaded as a zip file which include complete contents of the dataset. it is important to decide whether dataset metrics correspond to the dataset itself (e.g. that zip file downloaded N times) vs. a metric indicating the sum of downloads for all files in the dataset (e.g. files in this dataset were downloaded N times individually). the database model below suggests the former option, but the latter statistic could be generated on-the-fly with a slightly more complex query if it is desired.
- collections
- see above - database would track views of the collection page itself, not act as a sum of all views of files within the collection.
- distinction between agents
- logged in users via GUI and API
- user API keys
- extractors
- essentially want to differentiate intentional "human" access (even via API, e.g. from external processes) vs. automated access (e.g. from an internal extractor such as preview generator) for accurate usage tracking & "last used" information
- one option would be to implement a separate download endpoint intended for extractors that does not increment metrics and encourage developers to use that endpoint for any extractors that are not relevant to behavior tracking
- ability to export simple report
- per-file statistics
- bytes
- path on disk
- access count (different from simple views)
- last accessed
- potentially filter report by "only objects that have been accessed > X months ago" to avoid massive reports
- start with a downloadable CSV output
- longer-term, could adapt aspects of search results page to show list of resources including files, datasets, etc. could introduce a more compact view than current list view for administrator viewing
With those goals in mind, one initial implementation:
- implement 2 new collections in Mongo
- StatisticTotals - total views & downloads for each resource, including timestamp for last viewed and last downloaded. for driving use case, downloads/"access" is more important than page views which don't necessarily represent engagement.
- StatisticUser - views & downloads on a per-user basis. this would include via the GUI and API calls using the user's API key, with ability to exclude automated extractors from these statistics even if they use a user API key to fetch data.
- Each collection will track views & downloads for files and datasets, and views for collections.