Uploaded image for project: 'National Data Service'
  1. National Data Service
  2. NDS-961

Define process for updating PubMed OA data

XMLWordPrintableJSON

    • Icon: Task Task
    • Resolution: Fixed
    • Icon: Normal Normal
    • None
    • None
    • None
    • NDS Sprint 28, NDS Sprint 29

      Once we have an ElasticSearch index of PubMed data, we will need a process to update the index when PubMed releases new data.

      According to the website (https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/), pubmed updates the bulk packages on a weekly basis.

      Your task is it outline our options for updating the PubMed index.  A few options come to mind:

      • Run a weekly cron job to download the complete bulk OA data, identify new/deleted documents, and update the index accordingly.
      • Use the bulk data for initial indexing, then use the OAI-PMH service to query for any new documents since the last index (https://www.ncbi.nlm.nih.gov/pmc/tools/oai/). This means that we'll need a way to track the last index time (or to query from ES).  We'll also need to confirm that the XML format from the OAI-PMH service is the same as from the bulk download.

      Time permitting, you can also implement a proof-of-concept.  For example, develop a python script to download all PMC articles after a certain date and add to ES index.

       

              willis8 Craig Willis
              willis8 Craig Willis
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: