Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Polyglot

Anchor
polyglot
polyglot
:  The Brown Dog component responsible for file format conversions.  Utilizing Software Servers and Daffodil, Polyglot is a highly distributed and extensible service which brings together and manages conversion capabilities, both the needed computation and data movement, from other software, tools, and services under a broadly accessible REST API.

pyClowder2

Anchor
pyclowder2
pyclowder2
: Python utility library simplifying the process of adding an analysis tool as a an extractor, wrapping most interactions with Clowder as python functions.

Software Server

Anchor
softwareserver
softwareserver
: Light weight web utility used to add a REST API onto arbitrary software and tools. Component within the Polyglot framework/repository used in the creation of converters.

Uncurated Data

Anchor
uncurated
uncurated
: Think of a dump of some random hard drive.  Without meaningful file names and a meaningful directory structure it will be difficult to find information without examining each and every file.  File formats, in particular old and/or proprietary file formats, hinder the situation further by making it difficult to open a given file without the needed software to open it installed on your machine.  Metadata is another way of providing insight as to the contents of a file.  Consider a document tagged with keywords "paper, large dynamic groups" indicating a paper submission for a social science study looking into the behavior of large groups of people. Curated data is data that has been stored and diligently named, organized, and tagged so that others, both today and long in the future, can utilize the data.  Uncurated data on the other hand doesn't have much of this and is essentially a big mess for others to go through.  A significant amount of digital data, if not most, is uncurated.  In the scientific world this is sometimes referred to as "long tail" data, suggesting this is linked with the tail of the distribution of project sizes, with the vast majority of smaller projects not having the resources to properly manage the data they produce.   The bottom line is that curation is a cumbersome process and creating new data is both faster and more rewarding, at least in the short term, than going back and organizing old data.  As science hinges on reproducibility and building on past results, however, these problems must be addressed.

...