You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 12 Next »

This is a description of the extractors and converters that we plan to implement next at the UMD iSchool. These reflect priorities around curating mostly born digital archival collections.

toolaction

input

  • data type
  • extensions
  • mime-type

output

format / fields

use caseEffortnotes
XML indexingExtract
  • unknown
  • .xml
  • text/xml, application/xml
  • schema
  • schema-type (DTD/XSD)
  • is well-formed?
  • is schema retrievable?
  • is valid?
Meaningful search over XML files in the archives will hinge on the schema employed. By extracting at the schema we can index it. This is analogous to file characterization within the XML world.??
LocationsExtract
  • Geospatial Feature Data
  • .kml, .kmz, .shp
  • application/vnd.google-earth.kml+xml,
    application/vnd.google-earth.kmz, application/octet-stream
  • geospatial bounding box
Allows geospatial search and discovery of relevant archives.Medium
Web linksExtract
  • HyperDocument
  • .html, .htm
  • text/html
  • title
  • hyperlinks (href, text)
Allows us to create an index of all of the web pages in a web archive of a site for a federal agency, etc.. Text of links can be used to describe the page referenced, becoming additional keywords.LowLet's us try pagerank scoring in archives.
Creation dateExtract
  • Document
  • .txt, .pdf, .doc
  • text/plain, application/pdf, application/msword
  • content-based creation date
  • dates found
Often files that have been moved through archival deposit workflows or have been moved from computer to computer prior to deposit will no longer have good metadata on the creation date of a document. The algorithm would produce a best guess at a document creation date, based on the various dates used in the text.LowConcerned this won't work very often...
Provenance... As a result duplicates and/or revised versions of documents are present within an archive.  A Versus signature based on the File2Learn code will allow us to compare documents to see how similar they are based on text, images, and vector graphics.  Can be used to potetially reconstruct the order by which the documents were edited.Medium


Would need to get File2learn code from Rob Kooper.  Was in our SVN repo I believe...

Digitized documentsExtract
  • Scanned Document Image
  • *
  • image/*
  • confidence that image is a scanned form
  • metrics for document layout recognition
A federal agency will have legacy paper records and often these are scanned into digital form, but rarely become useful as structured data records. Layout recognition metrics, such as the offsets of horizontal and vertical lines, can be used to classify images as depicting a particular type of paper form. These might be tax forms, census forms, or any kind of routine paper record from the pre/post-digital era. Recognition of the layout lets you apply a template that identified document regions for OCR processing.Medium/High

Would probably depending on the layout of the document, i.e. would need a tempalte for each type?  Gregory Jansen is there a type of scanned document we might start with?  Sandeep Puthanveetil Satheesan I believe we could throw in the Census forms here (as they are in CIBER).  Any others though?

Sandeep: Yes, I can only think of those as well.


Converter
  • Project management documents
  • MPP, MPX
  • application/<various>
  • project name
  • team names
  • dates
  • keywords
The archive is filled with files that are not easily examined.  To help with this we will create converters from these difficult formats to formats that are more readily openable either on ones machine or better yet over the web.Low

Probably have to use the original or modern project management software, running in a VM, or else parse an XML file (within a ZIP)

AccessConverter
  • ??
The archive is filled with files that are not easily examined.  To help with this we will create converters from these difficult formats to formats that are more readily openable either on ones machine or better yet over the web.Low

Gregory Jansen can you put a few here?

AccessConverter
  • ??
The archive is filled with files that are not easily examined.  To help with this we will create converters from these difficult formats to formats that are more readily openable either on ones machine or better yet over the web.Low

Gregory Jansen can you put a few here?

  • No labels