This process was done for a sample dataset of PubMed data (15 documents). The whole PubMed dataset includes more than 700k documents; however, similar process can be applied to index PubMed data with some minor changes in the script (Eg: path).
1. Create PubMed index using same settings.json (like biocaddie data)
Script: create-pubmed-index.sh in biocaddie/elasticsearch
thphan@biocaddie-dev:/data/thphan/biocaddie/elasticsearch$ ./create-pubmed-index.sh { "acknowledged" : true, "shards_acknowledged" : true }
Double-check if PubMed index is created.
thphan@biocaddie-dev:/data/thphan/biocaddie/elasticsearch$ curl -XGET 'localhost:9200/_cat/indices?v&pretty' health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open biocaddie 5CCZ1JQ8Ty2a3TWO6Cwdzw 1 0 794929 0 4.8gb 4.8gb green open pubmed 2RU56tAJTESXayIymZIbDw 1 0 0 0 130b 130b
2. Convert PubMed nxml data into json format for indexing in ElasticSearch.
The PubMed test dataset is saved in SDSC server /shared/pubmed/test and includes 15 documents.
To index PubMed documents using ElasticSearch, we need to convert the documents into json format.
Also ElasticSearch allows to load a dataset at one shot (https://www.elastic.co/guide/en/kibana/current/tutorial-load-dataset.html) using _bulk API, we don't need to load documents one by one.
However, first we still need to convert all documents into one big json file pubmeddata.json to apply _bulk API for indexing.
The pubmeddata.json has below format:
{"index":{"_id":"<pmcid 1>"}}
{"name":"<filename 1>","pmcid": <pmcid 1>,"text": "<document text 1>"}
{"index":{"_id":"<pmcid 2>"}}
{"name":"<filename 2>","pmcid": <pmcid 2>,"text": "<document text 2>"}
...etc...
We create pubmeddata.json using script: ~/biocaddie/scripts/xml2json-pubmed.sh. This script will extract pmcid value and use it for "_id" and "pmcid" values, filename for "name" value and document text (just stripped off xml tags and remove special characters & new line) for "text" value.
3. Index pubmeddata.json with ElasticSearch.
Run query: curl -H 'Content-Type: application/json' -XPOST 'localhost:9200/pubmed/dataset/_bulk?pretty' --data-binary "@/data/thphan/pubmed/json_test/pubmeddata.json"
thphan@biocaddie-dev:/data/thphan/biocaddie/elasticsearch$ curl -H 'Content-Type: application/json' -XPOST 'localhost:9200/pubmed/dataset/_bulk?pretty' --data-binary "@/data/thphan/pubmed/json_test/pubmeddata.json" { "took" : 405, "errors" : false, "items" : [ { "index" : { "_index" : "pubmed", "_type" : "dataset", "_id" : "2606187", "_version" : 1, "result" : "created", "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "created" : true, "status" : 201 } }, { "index" : { "_index" : "pubmed", "_type" : "dataset", "_id" : "4263610", "_version" : 1, "result" : "created", "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "created" : true, "status" : 201 } }, { "index" : { "_index" : "pubmed", "_type" : "dataset", "_id" : "2606182", "_version" : 1, "result" : "created", "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "created" : true, "status" : 201 } }, ... ] }
Rerunning the index check query, we will see 15 documents added under PubMed index.
thphan@biocaddie-dev:/data/thphan/biocaddie/elasticsearch$ curl -XGET 'localhost:9200/_cat/indices?v&pretty' health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open biocaddie 5CCZ1JQ8Ty2a3TWO6Cwdzw 1 0 794929 0 4.8gb 4.8gb green open pubmed 2RU56tAJTESXayIymZIbDw 1 0 15 0 1.3mb 1.3mb