Once we have an ElasticSearch index of PubMed data, we will need a process to update the index when PubMed releases new data.
According to the website (https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/), pubmed updates the bulk packages on a weekly basis.
Your task is it outline our options for updating the PubMed index. A few options come to mind:
- Run a weekly cron job to download the complete bulk OA data, identify new/deleted documents, and update the index accordingly.
- Use the bulk data for initial indexing, then use the OAI-PMH service to query for any new documents since the last index (https://www.ncbi.nlm.nih.gov/pmc/tools/oai/). This means that we'll need a way to track the last index time (or to query from ES). We'll also need to confirm that the XML format from the OAI-PMH service is the same as from the bulk download.
Time permitting, you can also implement a proof-of-concept. For example, develop a python script to download all PMC articles after a certain date and add to ES index.