You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

This is a description of the extractors and converters that we plan to implement next at the UMD iSchool. These reflect priorities around curating mostly born digital archival collections.

action

input

  • data type
  • extensions
  • mime-type

output

format / fields

use casenotes
Extract
  • unknown
  • .xml
  • text/xml, application/xml

schema

schema-type (DTD/XSD)

well-formed?

schema retrievable?

valid?

Meaningful search over XML files in the archives will hinge on the schema employed. By extracting at the schema we can index it. This is analogous to file characterization within the XML world.
Extract
  • Geospatial Feature Data
  • .kml, .kmz
  • application/vnd.google-earth.kml+xml,
    application/vnd.google-earth.kmz

geospatial bounding box

Allows geospatial search and discovery of relevant archives.
Extract
  • Geospatial Feature Data
  • .shp, .shx
geospatial bounding boxAllows geospatial search and discovery of relevant archives.
Extract
  • HyperDocument
  • .html, .htm

title

hyperlinks (href, text)

Allows us to create an index of all of the web pages in a web archive of a site for a federal agency, etc.. Text of links can be used to describe the page referenced, becoming additional keywords.Let's us try pagerank scoring in archives.
Extract
  • Document
  • .txt, ??
  • text/plain

content based creation date

dates in content

Often files that have been moved through archival deposit workflows or have been moved from computer to computer prior to deposit will no longer have good metadata on the creation date of a document. The algorithm would produce a best guess at a document creation date, based on the various dates used in the text.


proposing the UMD build some of the following extractors, let me know what you think..

XML - is it well-formed? what is the schema/DTD? is it valid?

KML - basic geographic stuff (bounding box?) and embedded metadata fields

SHP/SHX (QGIS) - see above

HTML - page title, link text and hrefs

  • No labels