Create ElasticSearch PubMed indexes (NDS-953)

This process was done for a sample dataset of PubMed data (15 documents). The whole PubMed dataset includes more than 700k documents; however, similar process can be applied to index PubMed data with some minor changes in the script (Eg: path).

1. Create PubMed index using same settings.json (like biocaddie data)

Script: create-pubmed-index.sh in biocaddie/elasticsearch

thphan@biocaddie-dev:/data/thphan/biocaddie/elasticsearch$ ./create-pubmed-index.sh
{
  "acknowledged" : true,
  "shards_acknowledged" : true
}

Double-check if PubMed index is created.

thphan@biocaddie-dev:/data/thphan/biocaddie/elasticsearch$  curl -XGET 'localhost:9200/_cat/indices?v&pretty'
health status index     uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   biocaddie 5CCZ1JQ8Ty2a3TWO6Cwdzw   1   0     794929            0      4.8gb          4.8gb
green  open   pubmed    2RU56tAJTESXayIymZIbDw   1   0          0            0       130b           130b

2. Convert PubMed nxml data into json format for indexing in ElasticSearch.

The PubMed test dataset is saved in SDSC server /shared/pubmed/test and includes 15 documents.

To index PubMed documents using ElasticSearch, we need to convert the documents into json format.

Also ElasticSearch allows to load a dataset at one shot (https://www.elastic.co/guide/en/kibana/current/tutorial-load-dataset.html) using _bulk API, we don't need to load documents one by one.

However, first we still need to convert all documents into one big json file pubmeddata.json to apply _bulk API for indexing.

The pubmeddata.json has below format:

{"index":{"_id":"<pmcid 1>"}}
{"name":"<filename 1>","pmcid": <pmcid 1>,"text": "<document text 1>"}

{"index":{"_id":"<pmcid 2>"}}
{"name":"<filename 2>","pmcid": <pmcid 2>,"text": "<document text 2>"}

...etc...

We create pubmeddata.json using script: ~/biocaddie/scripts/xml2json-pubmed.sh. This script will extract pmcid value and use it for "_id" and "pmcid" values, filename for "name" value and document text (just stripped off xml tags and remove special characters & new line) for "text" value.

3. Index pubmeddata.json with ElasticSearch.

Run query: curl -H 'Content-Type: application/json' -XPOST 'localhost:9200/pubmed/dataset/_bulk?pretty' --data-binary "@/data/thphan/pubmed/json_test/pubmeddata.json"

thphan@biocaddie-dev:/data/thphan/biocaddie/elasticsearch$ curl -H 'Content-Type: application/json' -XPOST 'localhost:9200/pubmed/dataset/_bulk?pretty' --data-binary "@/data/thphan/pubmed/json_test/pubmeddata.json"
{
  "took" : 405,
  "errors" : false,
  "items" : [
    {
      "index" : {
        "_index" : "pubmed",
        "_type" : "dataset",
        "_id" : "2606187",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 1,
          "successful" : 1,
          "failed" : 0
        },
        "created" : true,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "pubmed",
        "_type" : "dataset",
        "_id" : "4263610",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 1,
          "successful" : 1,
          "failed" : 0
        },
        "created" : true,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "pubmed",
        "_type" : "dataset",
        "_id" : "2606182",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 1,
          "successful" : 1,
          "failed" : 0
        },
        "created" : true,
        "status" : 201
      }
    },


   ...

  ]
}

Rerunning the index check query, we will see 15 documents added under PubMed index.

thphan@biocaddie-dev:/data/thphan/biocaddie/elasticsearch$ curl -XGET 'localhost:9200/_cat/indices?v&pretty'
health status index     uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   biocaddie 5CCZ1JQ8Ty2a3TWO6Cwdzw   1   0     794929            0      4.8gb          4.8gb
green  open   pubmed    2RU56tAJTESXayIymZIbDw   1   0         15            0      1.3mb          1.3mb

Space shortcuts

Page tree