...
Java
The main class sets up the comparison, this is done by adding the two files that need to be compared, as well as the adapter to load the file, the extractor to extract a feature from the file, and a measurement to compare the two features.
Code Block |
---|
language | java |
---|
title | PDF AdapterMain |
---|
|
public class TextAdapter implementsstatic FileLoader,public HasText {
private Filevoid main(String[] args) {
file;
private List<String> words;
static public void main(String[] args) {
PairwiseComparison PairwiseComparison comparison = new PairwiseComparison();
comparison.setId(UUID.randomUUID().toString());
comparison.setFirstDataset(new File("data/test1.txt"));
comparison.setSecondDataset(new File("data/test2.txt"));
comparison.setAdapterId(TextAdapter.class.getName());
comparison.setExtractorId(TextHistogramExtractor.class.getName());
comparison.setMeasureId(LabelHistogramEuclidianDistanceMeasure.class.getName());
ExecutionEngine ee = new ExecutionEngine();
ee.submit(comparison, new ComparisonStatusHandler() {
@Override
public void onStarted() {
System.out.println("STARTED : ");
}
@Override
public void onFailed(String msg, Throwable e) {
System.out.println("FAILED : " + msg);
e.printStackTrace();
System.exit(0);
}
@Override
public void onDone(double value) {
System.out.println("DONE : " + value);
System.exit(0);
}
@Override
public void onAborted(String msg) {
System.out.println("ABORTED : " + msg);
System.exit(0);
}
});
} |
The text adapter will take a text file, and load all the file, splitting the text into words and return a list of all words in the text. The words are still in the right order, and it is possible to read the original information of the file by reading the words in the order as they are returned by getWords().
Code Block |
---|
language | java |
---|
title | Text Adapter |
---|
|
public class TextAdapter implements FileLoader, HasText {
private File }
})file;
private List<String> }words;
public TextAdapter() {}
// ----------------------------------------------------------------------
// FileLoader
// ----------------------------------------------------------------------
@Override
public void load(File file) {
this.file = file;
}
@Override
public String getName() {
return "Text Document";
}
@Override
public List<String> getSupportedMediaTypes() {
List<String> mediaTypes = new ArrayList<String>();
mediaTypes.add("text/*");
return mediaTypes;
}
// ----------------------------------------------------------------------
// HasText
// ----------------------------------------------------------------------
@Override
public List<String> getWords() {
if (words == null) {
words = new ArrayList<String>();
try {
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while((line = br.readLine()) != null) {
String[] w = line.split(" ");
words.addAll(Arrays.asList(w));
}
br.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return words;
}
} return words;
}
} |
The extractor will take the words returned by the adapter and count the occurrence of each word. At this point we are left with a histogram with all words and how often they occur in the text, we can no longer read the text since the information about the order of the words is lost.
Code Block |
---|
language | java |
---|
title | Text Histogram Extractor |
---|
|
public class TextHistogramExtractor implements Extractor
{
@Override
public Adapter newAdapter() {
throw (new RuntimeException("Not supported."));
}
@Override
public String getName() {
return "Text Histogram Extractor";
}
@Override
public Set<Class<? extends Adapter>> supportedAdapters() {
Set<Class<? extends Adapter>> adapters = new HashSet<Class<? extends Adapter>>();
adapters.add(HasText.class);
return adapters;
}
@Override
public Class<? extends Descriptor> getFeatureType() {
return LabelHistogramDescriptor.class;
}
@Override
public Descriptor extract(Adapter adapter) throws Exception {
if (adapter instanceof HasText) {
LabelHistogramDescriptor desc = new LabelHistogramDescriptor();
for (String word : ((HasText) adapter).getWords()) {
desc.increaseBin(word);
}
return desc;
} else {
throw new UnsupportedTypeException();
}
}
@Override
public boolean hasPreview(){
return false;
}
@Override
public String previewName(){
return null;
}
} |
To compare two texts we use the euclidian distance measure of two histograms. First we normalize each histogram, so we can compare a large text with a small text, next we compare each big of the two histograms. If the bin is missing from either histogram it is assumed to have a value of 0.
Code Block |
---|
language | java |
---|
title | Euclidian Distance Measure |
---|
|
public class LabelHistogramEuclidianDistanceMeasure implements Measure
{
@Override
public SimilarityPercentage normalize(Similarity similarity) {
return new SimilarityPercentage(1 - similarity.getValue());
}
@Override
public String getFeatureType() {
return LabelHistogramDescriptor.class.getName();
}
@Override
public String getName() {
return "Histogram Distance";
}
@Override
public Class<LabelHistogramEuclidianDistanceMeasure> getType() {
return LabelHistogramEuclidianDistanceMeasure.class;
}
// correlation
@Override
public Similarity compare(Descriptor desc1, Descriptor desc2) throws Exception {
if ((desc1 instanceof LabelHistogramDescriptor) && (desc2 instanceof LabelHistogramDescriptor)) {
LabelHistogramDescriptor lhd1 = (LabelHistogramDescriptor) desc1;
LabelHistogramDescriptor lhd2 = (LabelHistogramDescriptor) desc2;
// get all possible labels
Set<String> labels = new HashSet<String>();
labels.addAll(lhd1.getLabels());
labels.addAll(lhd2.getLabels());
// normalize
lhd1.normalize();
lhd2.normalize();
// compute distance
double sum = 0;
for (String s : labels) {
Double b1 = lhd1.getBin(s);
Double b2 = lhd2.getBin(s);
if (b1 == null) {
sum += b2 * b2;
} else if (b2 == null) {
sum += b1 * b1;
} else {
sum += (b1 - b2) * (b1 - b2);
}
}
return new SimilarityNumber(Math.sqrt(sum), 0, 1, 0);
} else {
throw new UnsupportedTypeException();
}
}
} |