This is a description of the extractors and converters that we plan to implement next at the UMD iSchool. These reflect priorities around curating mostly born digital archival collections.

toolaction

input

  • data type
  • extensions
  • mime-type

output

format / fields

use caseEffortnotes
Creation dateExtract
  • Document
  • .txt, .pdf, .doc
  • text/plain, application/pdf, application/msword
  • content-based creation date
  • mean date
  • dates found
Often files that have been moved through archival deposit workflows or have been moved from computer to computer prior to deposit will no longer have good metadata on the creation date of a document. The algorithm would produce a best guess at a document creation date, based on the various dates used in the text.Low

Concerned this won't work very often...

That is certainly a concern for the creation date. An extract of the content dates would still be useful for indexing purposes.

Look at Bill Underwood's work.

Provenance... As a result duplicates and/or revised versions of documents are present within an archive.  A Versus signature based on the File2Learn code will allow us to compare documents to see how similar they are based on text, images, and vector graphics.  Can be used to potetially reconstruct the order by which the documents were edited.Medium


Would need to get File2learn code from Rob Kooper.  Was in our SVN repo I believe.  Take a look at TACC paper on paragraph alignment and clustering.

AccessConvertWP, WPD
  • MS Word
There is no converter to move documents out of these legacy WordPerfect formats. Once they can be converted to Word or to plain text, they will be accessible in many forms. Roughly 4000 documents in CI-BER match this description so far.Low

We can use an older Win VM to walk these files to Word..

LocationsExtract
  • Geospatial Feature Data
  • .kml, .kmz, .shp
  • application/vnd.google-earth.kml+xml,
    application/vnd.google-earth.kmz, application/octet-stream
  • geospatial bounding box
Allows geospatial search and discovery of relevant archives.Medium
File DependenciesExtract
  • shp
  • mdb
  • ...

Determine relationships/dependencies between files that contain one or more records. For example, it would be nice to know that a particular .dbf file was part of a Shapefile and should not automatically be converted using a tool like SIARD to keep the Shapefile usable. As another example, it would be helpful to know what .mdx or .idx  files were associated with a particular .mdb file. Another example would be to know the relationships between all of the files that make up a particular website to ensure you capture everything you want to capture.Low
Collection SummaryExtract

Help in getting the big picture across all our holdings. This shows up in places like helping patrons to find content across our holdings that is relevant to their queries, or trying to identify content that should not be released because of national security, personal privacy, etc. Having tools that could summarize/cluster the content without the archivists having to read every single page would be very helpful. If the results could be presented as a visualization or in some other form so that the user could quickly absorb the information that would be even better.Medium
XML indexingExtract
  • unknown
  • .xml
  • text/xml, application/xml
  • schema
  • schema-type (DTD/XSD)
  • is well-formed?
  • is schema retrievable?
  • is valid?
Meaningful search over XML files in the archives will hinge on the schema employed. By extracting at the schema we can index it. This is analogous to file characterization within the XML world.Medium
Web linksExtract
  • HyperDocument
  • .html, .htm
  • text/html
  • title
  • hyperlinks (href, text)
Allows us to create an index of all of the web pages in a web archive of a site for a federal agency, etc.. Text of links can be used to describe the page referenced, becoming additional keywords.LowLet's us try pagerank scoring in archives.
Digitized documentsExtract
  • Scanned Document Image
  • *
  • image/*
  • confidence that image is a scanned form
  • metrics for document layout recognition
A federal agency will have legacy paper records and often these are scanned into digital form, but rarely become useful as structured data records. Layout recognition metrics, such as the offsets of horizontal and vertical lines, can be used to classify images as depicting a particular type of paper form. These might be tax forms, census forms, or any kind of routine paper record from the pre/post-digital era. Recognition of the layout lets you apply a template that identified document regions for OCR processing.Medium/High

Would probably depending on the layout of the document, i.e. would need a tempalte for each type?  Gregory Jansen is there a type of scanned document we might start with?  Sandeep Puthanveetil Satheesan I believe we could throw in the Census forms here (as they are in CIBER).  Any others though?

Sandeep: Yes, I can only think of those as well.

Greg: Any form with boxes would work. I have some other forms from WWII that are interesting to us at UMD.

Map RecognizerExtract
  • Image
  • image/*
  • Identify when an image is a map (historical or modern)
  • Identify the geographic area depicted
Seems like a very useful tool for building geographic indexes over historical collections or government reports that include maps.HighI have only the slightest idea of how this would work. Shape recognition on edges of some kind. Recognizing particular areas would require an index of known areas...
Compelling document thumbnailsExtract
  • Document
  • application/pdf
  • page thumbnail that includes graphic or photo
Scholarly and mixed use digital repositories often generate document thumbnails that show the front page, which is usually devoid of images and boring. This extractor will generate an image for the most colorful page in a multi-page document, falling back to the front page strategy.Medium

A colleague implemented this as a one-off algorithm for the Carolina Digital Repository's "peek at the repository" feature: https://cdr.lib.unc.edu/#p

The result is much more interesting that the usual thumbnails, for almost any kind of document.

Prior art here: https://github.com/UNC-Libraries/peek-data

AccessConvert
  • Proprietary Databases
  • MDB, DB, DBF
  • SIARD Software-Independent Archiving of Relational Databases
2378 MDB (MS Access) files in CI-BER with no converter. 2013 DB files (Paradox / XTreeGold / dbvista / Oracle / XoftSpySE). 432802 DBF files.
Ensure that we have the means to access the database tables in all the CI-BER collections. Instrumenting SIARD migration will enable a vendor neutral access format and offer advantages for archives implementing pro-active database migration for long term access.
Medium
AccessConvert
  • Adobe Photoshop Images
  • PSD
  • TIFF (or similar)
PSD files are common in born digital archives, but currently they have no conversion path to standard image formats.LowAuto Hotkey for a Windows VM? Can we put it in a Docker container somehow?