Planned Tool Implementations at UMD

This is a description of the extractors and converters that we plan to implement next at the UMD iSchool. These reflect priorities around curating mostly born digital archival collections.

action	input data type extensions mime-type	output format / fields	use case	notes
Extract	unknown .xml text/xml, application/xml	schema schema-type (DTD/XSD) is well-formed? is schema retrievable? is valid?	Meaningful search over XML files in the archives will hinge on the schema employed. By extracting at the schema we can index it. This is analogous to file characterization within the XML world.
Extract	Geospatial Feature Data .kml, .kmz, .shp `application/vnd.google-earth.kml+xml`, `application/vnd.google-earth.kmz, application/octet-stream`	geospatial bounding box	Allows geospatial search and discovery of relevant archives.
Extract	HyperDocument .html, .htm text/html	title hyperlinks (href, text)	Allows us to create an index of all of the web pages in a web archive of a site for a federal agency, etc.. Text of links can be used to describe the page referenced, becoming additional keywords.	Let's us try pagerank scoring in archives.
Extract	Document .txt, .pdf, .doc text/plain, application/pdf, application/msword	content-based creation date dates within content	Often files that have been moved through archival deposit workflows or have been moved from computer to computer prior to deposit will no longer have good metadata on the creation date of a document. The algorithm would produce a best guess at a document creation date, based on the various dates used in the text.
Extract	Scanned Document Image * image/*	confidence that image is a scanned form metrics for document layout recognition	A federal agency will have legacy paper records and often these are scanned into digital form, but rarely become useful as structured data records. Layout recognition metrics, such as the offsets of horizontal and vertical lines, can be used to classify images as depicting a particular type of paper form. These might be tax forms, census forms, or any kind of routine paper record from the pre/post-digital era. Recognition of the layout lets you apply a template that identified document regions for OCR processing.

Page tree

Planned Tool Implementations at UMD