You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

As part of the National Science Foundations data efforts Brown Dog aims to provide and preserve long term access to data within collections of unstructured and uncurated files.  Two services will be developed.  The Data Access Proxy (DAP), an extensible and distributed file format conversion service, which aims at making the web agnostic to the formats which information is stored allowing one to easily access the contents of a file.  The Data Tilling Service (DTS) will serve as an active repository for analysis tools and provide a service by which to automatically extract metadata and content based signatures from files over the web.  Together these two services stand as a building block between raw data collections and applications that would provide search capabilities of such collections, organize, relate, and curate their contents.

As a graduate student on the project your aim is first and foremost to carry out the research as directed by your adviser towards addressing a specific scientific question within your field which requires examining collections of unstructured and/or uncurated data.  Think of unstructured data as data types that don't involve text (e.g. images, video, audio, 3d models, etc.).  Images are a good example.  To a computer images are nothing more than an array of numbers representing pixel intensities or colors.  Though images are extremely informative to us as human beings, for a computer to make any use of them some form of pre-processing must be run on them.  An example would be to use computer vision to recognize faces within the image and then spit out their locations as numerical values and a textural tag identifying these areas as faces.  With information such as this a computer is  more readily able to carry out a search or other process involving the contents of such data.  With regards to uncurated data think of a dump of some random hard drive.  Without meaningful file file names and a meaningful directory structure the files on it will be difficult to find information without examining each and every file.  File formats, in particular old and/or proprietary file formats, hinder the situation further by making difficult to open a given file without the needed software to open it installed on your machine.  Metadata is another way of providing insight as to the contents of a file.  Consider a document tagged with keywords "paper, large dynamic groups" indicating a paper submission for a social science study looking into the behavior of large groups of people. Curated data is data that has been stored and diligently named, organized, and tagged so that others, both today and long in the future, can utilize the data.  Uncurated data on the other hand doesn't have much of this and is essentially a big mess for other to go through.  A significant amount of digital data, if not most, is uncurated.  In the scientific world this is sometimes referred to as "long tail". 

As part of your work you will do one or both of the following.  You will develop your own analysis tools for the specific collections you will be looking at, towards the specific scientific questions you are addressing.  These tools will be build such that they can be included into the DTS for preservation, reproducibility of your results, and reuse by other in the scientific community (possibly for things completely different from the manner in which you used it).  You will build applications to utilize the information within the specific collections you will be looking at that utilize the DAP and/or DTS services.

Kenton McHenry

Brown Dog PI

  • No labels