You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

The method described in McHenry 2009 (pdf) does not scale as it:

  • requires a dataset exemplifying the data distribution of an archive,
  • requires a dataset made up of file types that can be directly opened for the before and after comparison,
  • and requires significant computation to fill in the weights of the I/O-graph.

We propose an alternative approach that can accommodate an unknown sample set and whose computation can be distributed out over time (i.e. with each incoming job request).  The approach works as follows:

  • Keep a registry of file types that can be directly opened by the comparison tool(s)
  • For each job request converting from format A to format B
    • Find a format alpha that can be reached from A that is within the set of loadable formats
    • Find a format beta that can be reach from B that is within the set of loadable formats
    • If both alpha and beta exists carry out the conversions from A to alpha and B to beta
    • Compare the files of type alpha and beta.  If the difference between alpha and beta is below some threshold record this edge as a good edge within the I/O-graph

The above algorithm assumes that a conversion to alpha and beta resulting in any information loss incurred from the conversion from A to B being undone is HIGHLY UNLIKELY (proof required).

alphabetaloss

We implement this means of measuring information loss as follows:

 

  • No labels