Search: Dataset Search

"The NDS will allow users to easily search for data across disciplinary boundaries. As users hone in on data of interest, they can easily switch to discipline-specific tools."

Access the DataDNS UI to search across registered repositories and datasets.

Registered dataset metadata will be searchable up in the prototype DataDNS UI, allowing others to discover and work on your dataset more easily.

The primary link to the dataset is taken from the landing_url field in the metadata, if present. Links to cite the dataset and view its raw metadata are currently provided.

If the dataset has the "girder" field set in its metadata (i.e. it "has" the Girder + ythub "capability"), then the user can launch a notebook with the data mounted inside.

Any links included in the metadata could easily be scraped and emphasized explicitly in the UI (although they are not currently).


Publish: TBD

"The NDS will connect users to tools for building and sharing collections of data. It will help users find and deliver data to the best repository for data-publishing."

See Publishing Big Data

Link: Dataset Registration

"The NDS will create robust connections between data and published articles. When researchers reference an article, they have ready access to the underlying data."

Registering dataset metadata will allow others to discover and work on your dataset more easily. While this current "connection" is not quite "robust" yet, its a step in the right direction towards interoperability.

If an identifier is enterd into the search, all metadata records referencing that identifier will come up. This primitive "full-text search" was completely by accident, and could undoubtedly be improved further. (Thank you, AngularJS filters! (big grin))

sc16-dataset-registration

To opt-in to the service, one need only follow the Girder Deploy instructions where the data is located.

These instructions assume a Docker-enabled host that can access your dataset(s), and will walk you through:

  1. Running and configuring an instance of Girder at your site via Docker Compose
  2. Building and registering metadata for your dataset(s)

Benefits (respectively)

  1. Running Girder + ythub next to your data will allow users to launch Jupyter notebooks to perform real analysis on your data
  2. Registering your datasets allows them to show up in the prototype DataDNS UI, allowing others to discover and work on your dataset more easily

NOTE - There are two distinct objectives here, although in most cases both would be desirable:

  • Discovery: You can register your dataset(s) without running the Girder + ythub stack, but you will not be able to launch tools
  • Operability: You can run the Girder + ythub stack to launch Jupyter notebooks on your dataset(s) without registering them / making them discoverable externally

POST Body: Dataset Metadata

For more information, see Girder Deploy#MetadataFields

Example

Although multiple datasets can be posted in bulk, the POST body for a single posted metadata entry might look like the following:

{
  "id": "test2",
  "metadata": {
    "dataset": {
      "label": "Test Dataset 2",
      "landing_url": "http://landing.page.com/2/",
      "authors": [
        {
          "email": "test@author.com",
          "firstName": "Test",
          "lastName": "Author 1",
          "orcId": "XXXX-XXXX-XXXX-XXXX"
        },
        {
          "email": "test2@author.com",
          "firstName": "Test",
          "lastName": "Author 2"
        },
        ...
      ]
    },
    "girder": {
      "api_protocol": "http://",
      "api_host": "141.142.208.127",
      "api_port": ":8080",
      "api_suffix": "/api/v1",
      "tmpnb_proxy_port": "",
      "folder_id": "5814ec2830c4eb000199d09a",
      "guest_user": "username",
      "guest_pass": "password"
    }
  }
}

Possible Expansions

  • More dataset metadata
    • Are any obvious fields missing?
  • Include more tool-launching agents
    • Labs Workbench
    • The Clowder/Dataverse ToolManager (which is itself extensible)
    • Raw Kubernetes / Docker?
    • Others?
  • Include data repositories for ingest (potentially useful if this could be queued / automated)
    • Dataverse
    • DSpace
    • Globus Publish
    • Clowder
    • Others?
  • Include a section for arbitrary metadata - allows dataset maintainers to link:
    • source code respositories
    • running simulation instances
    • related documentation
    • published articles
    • domain-specific metadata that would help people find this metadata?

 

Example

 

{
  "id": "",
    "label": "",
    "landing_url": "",
    "authors": [ ],
         . . .
    "girder": {  },
    "toolmanager": {  },
    "other_tool_runner_agents": {  },
         . . .
    "dataverse": {  },
    "dpsace": {  },
    "globus": {  },
    "clowder": {  },
    "other_metadata_sources_to_ingest": {  },
         . . .
    "arbitrary_metadata_fields": {
      "relatedIdentifiers: ["paper_doi","id_of_the_parent_dataset", "etc"],
      "sourceCode": ["git_url", "svn_url"],
      "domain_specific": {  },
	     . . .
    }
  }
}

Reuse: Dataset Resolution

"The NDS will not only provide access to data for download, it will provide tools for transferring data to processing platforms or allow analysis to be attached to the data."

The DataDNS UI allows users to easily find and access datasets. The resolution REST endpoint then allows nearly any service to utilize the API to launch Jupyter notebooks next to eligible datasets.

One needs only to be able to create HTTP requests, which is relatively simple in nearly all technological disciplines.

"Eligible" datasets might include those that compute resources attached to them that allow launching of tools, and/or those that have transfer mechanisms to somehow import or mount data onto external compute.

For now, the only tool that the prototype can launch is a Jupyter notebook via Girder's tmpnb, but this can easily be expanded to cover a host of different tool-launching agents and allocation spaces via creating a new metadata entry for the agent (i.e. "girder", "toolmanager", etc)

Resolve

From the DataDNS UI

Click the "Launch Notebook" button next to your desired dataset to call the /resolve endpoint, and popping open your new tool in a new tab.

From any page in your browser via bookmarklet

Sick of traveling to the DataDNS UI just to launch tools when you already know your dataset's ID?

Just add the following as a bookmark in your browser:

bookmarklet
javascript:DATASET_id=""+(window.getSelection?window.getSelection():document.getSelection?document.getSelection():document.selection.createRange().text);DATASET_id=DATASET_id.replace(/\r\n|\r|\n/g," ,");if(!DATASET_id)DATASET_id=prompt("Enter a Dataset ID","");if(DATASET_id!=null)window.open('http://141.142.208.142:8082/api/resolve/'+DATASET_id,'_newtab');void 0

To use the bookmarklet:

  • highlight an ID with your cursor and click the bookmarklet to resolve it and launch a notebook
  • If no text is highlighted, you will be prompted to enter the ID to resolve

Response Data

After resolving the selected dataset, the response object (in the Girder + ythub case) will be the notebook object returned from the Girder API, wrapped up with the url to access that tool.

It is easy to see how this same pattern could be used to encapsulate calls to any REST API in this fashion, or to encapsulate any other protocol that provides a Python client library.

In the case of the DataDNS UI, the resulting tool url is automatically displayed in the UI and opened in a new tab.

In the bookmarklet example, the resulting url is printed to the screen and (currently) must be copied and pasted into your browser to access the tool.

Example

{
    "notebook": {
        "_accessLevel": 2, 
        "_id": "581516e8bd2af0000156de8d", 
        "_modelType": "notebook", 
        "containerId": "b81f6fd4a2ebd1088b29e34326410f61804ddbc573490e6ff947d38bbf7ea621", 
        "containerPath": "user/Fm28Rm4U0buF", 
        "created": "2016-10-29T21:38:46.046000+00:00", 
        "folderId": "5813c451bd2af0000156de85", 
        "lastActivity": "2016-10-29T21:38:46.046000+00:00", 
        "mountPoint": "/var/lib/docker/volumes/5813c451bd2af0000156de85_admin/_data", 
        "status": 0, 
        "userId": "581124c0bd2af000015c7e44", 
        "when": "2016-10-29T21:38:46.046000+00:00"
    }, 
    "url": "http://141.142.208.142/user/Fm28Rm4U0buF"
}

 

 

  • No labels