Date: Tue, 19 Mar 2024 07:43:29 -0500 (CDT) Message-ID: <1485035310.43.1710852209540@os-confluence.ncsa.illinois.edu> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_42_1011047537.1710852209540" ------=_Part_42_1011047537.1710852209540 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
Brown Dog Data Transformation Service (DTS):
Clowder: The Brown Dog = component responsible for extracting novel, often higher level, data from f= ile contents (e.g. metadata, tags, signatures, and other derived products) = in order to index, compare, and further analyze collections of data through= a broadly accessible REST API.&= nbsp; Clowder is a web based research da= ta management system designed to support multiple research domains and the = diverse data types utilized across those domains. In addition to data= sharing and organizational functionality it contains major extension point= s for the preprocessing, processing, previewing, and publication of data. &= nbsp;When new data is added to the system, whether it is via the web front-= end or through the REST API, preprocessing serving as a form of autocuratio= n is off-loaded to cloud based extraction services that analyze the data=E2= =80=99s contents to extract appropriate data and metadata. These extractors= triggered based on the type of the data analyze the contents of the data t= o tag it (e.g. found flood basins in images, trees in LIDAR) and/or create = lightweight web-accessible previews of large files (e.g. an image pyramid) = allowing users to examine and compare the contents of one or more datasets.= Complimented by a number of features supporting community based social cur= ation, this combined raw and derived metadata is presented to the user in t= he Clowder web interface and utilized to navigate stored collections.
Data Conversion: A transformation on digital data that= largely preserves the entirety of the data. An example in the case o= f Brown Dog would be a transformation of a file in one 3D file format to an= other 3D file format. As file formats typically vary slightly, and th= e transformations themselves can be imperfect, variations can occur in the = form of information loss. However, the intent is for the resulting da= ta to be as intact as possible. Conversions allow one to access data = more easily given that the original format is not understood or difficult t= o work with. This is analogous to translating languages.
Data Extraction: A transformation that creates new dat= a from the given data. An example in the case of Brown Dog would be t= he execution of analysis code on an image file's contents to determine if a= particular species of plant is present. We utilize extraction to aut= omatically generate metadata and/or signatures from a file's contents and p= rovide users with means of finding, relating, and utilizing data that may b= e difficult otherwise.
Metadata: Simply data about data (e= .g. tags or keywords).
Polyglot:&nb= sp; The Brown Dog component responsible for file format conversion= s. Utilizing Software Servers and Daffodi= l, Polyglot is a highly distributed and extensible service which brings= together and manages conversion capabilities, both the needed computation = and data movement, from other software, tools, and services under a broadly= accessible REST API.
pyClowd= er2: Python utility library simplifying the process of adding= an analysis tool as a an extractor, wrapping most interactions with Clowde= r as python functions.
Software Server: Light weight web utility used to =
add a REST API onto arbitrary software and tools. Component within the Poly=
glot framework/repository used in the creation of converters.
=
strong>
Uncurated Data: Think of a dump of some random hard dri= ve. Without meaningful file names and a meaningful directory structur= e it will be difficult to find information without examining each and every= file. File formats, in particular old and/or proprietary file format= s, hinder the situation further by making it difficult to open a given file= without the needed software to open it installed on your machine. Me= tadata is another way of providing insight as to the contents of a file.&nb= sp; Consider a document tagged with keywords "paper, large dynamic groups" = indicating a paper submission for a social science study looking into the b= ehavior of large groups of people. Curated data is data that has been store= d and diligently named, organized, and tagged so that others, both today an= d long in the future, can utilize the data. Uncurated data on the oth= er hand doesn't have much of this and is essentially a big mess for others = to go through. A significant amount of digital data, if not most, is = uncurated. In the scientific world this is sometimes referred to as "= long tail" data, suggesting this is linked with the tail of the distributio= n of project sizes, with the vast majority of smaller projects not having t= he resources to properly manage the data they produce. The bott= om line is that curation is a cumbersome process and creating new data is b= oth faster and more rewarding, at least in the short term, than going back = and organizing old data. As science hinges on reproducibility and bui= lding on past results, however, these problems must be addressed.
Unstructured Data: Data= that does not have a pre-defined data model or is not organized in a pre-d= efined manner. Unstructured data can be text based but can also invol= ve sensor data or data that quantifies some physical object or phenomenon (= e.g. images, video, audio, 3d models, etc.). Such data is typically&n= bsp;difficult to understand using traditional computer programs. Imag= es are a good example of this. To a computer images are nothing more = than an array of numbers representing pixel intensities or colors. Th= ough images are extremely informative to us as human beings, for a computer= to make any use of them some form of pre-processing must be run. An = example would be to use computer vision to recognize faces within the image= and then spit out their locations as numerical values and a textural tag i= dentifying these areas as faces. With information such as this a comp= uter is then more readily able to carry out a search or other process invol= ving the contents of such data.