Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Fixed
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- BioCADDIE

Sprint:
NDS Sprint 28, NDS Sprint 29
Epic Link:
BioCADDIE

Once we have an ElasticSearch index of PubMed data, we will need a process to update the index when PubMed releases new data.

According to the website (https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/), pubmed updates the bulk packages on a weekly basis.

Your task is it outline our options for updating the PubMed index. A few options come to mind:

Run a weekly cron job to download the complete bulk OA data, identify new/deleted documents, and update the index accordingly.
Use the bulk data for initial indexing, then use the OAI-PMH service to query for any new documents since the last index (https://www.ncbi.nlm.nih.gov/pmc/tools/oai/). This means that we'll need a way to track the last index time (or to query from ES). We'll also need to confirm that the XML format from the OAI-PMH service is the same as from the bulk download.

Time permitting, you can also implement a proof-of-concept. For example, develop a python script to download all PMC articles after a certain date and add to ES index.

Assignee:: Craig Willis

Reporter:: Craig Willis

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 11/Jul/17 1:54 PM

Updated:: 29/Jul/17 5:03 AM

Resolved:: 29/Jul/17 5:03 AM

Details

Description

Gliffy Diagrams

Attachments

Activity

People

Dates

Tasks