Provenance

Tupelo provides an implementation of the Open Provenance Model (OPM). The OPM is a way of recording information about processes as they occur which includes constructs for representing causal and dependency relationships between sub-processes and the data items or other artifacts that they use or produce. Tupelo provides a way of reading and writing OPM information to a Context as RDF metadata, as well as a high-level representation of the OPM that can be extended to support other serializations of the OPM as they become available.

For details about the OPM, please familiarize yourself with the document describing it.

OPM models causality relationships as a graph where the nodes are of several different kinds including Processes, Artifacts, and Agents. It also includes the concepts of Accounts and Roles. Tupelo's API provides representations of these OPM concepts along with representations of the arcs that connect them in OPM graphs.

Creating OPM metadata

To create a provenance graph in Tupelo, use ProvenanceContextFacade and configure it with a backing Context that will be used for storing and retrieving the RDF metadata representing the provenance graph:

Context c = ...
ProvenanceContextFacade graph = new ProvenanceContextFacade();
graph.setContext(c);

A typical use of the provenance API is to record information about a process as it occurs. In the OPM, each causality relationship is represented by an arc that is associated with an Account, so before creating them you must instantiate and "assert" an account.

ProvenanceAccount account = graph.newAccount("My account");
graph.assertAccount(account);

In RDF, the account will be represented as an RDF subject. If you want to specify the subject URI reference, you can do that when you create the account.

Account account = graph.newAccount("My account", Resource.uriRef("http://example.org/account1"));

The same pattern applies to other OPM objects, such as Artifacts:

Artifact artifact = graph.newArtifact("input file 1");
graph.assertArtifact(artifact);

In this example, an XML document is processed with an XSLT stylesheet to produce a new XML document, and a graph is produced showing both input Artifacts (the document and the stylesheet), the XSLT transformation Process, and the output Artifact (the result document).

MemoryContext mc = new MemoryContext();
ResourceContext rc = new ResourceContext("http://example.org/data/","/provenanceExample/");
Context context = new UnionContext();
context.addChild(mc);
context.addChild(rc);
ProvenanceContextFacade pcf = new ProvenanceContextFacade(mc);
ProvenanceAccount account = pcf.newAccount("example account");
 
Resource sheet = Resource.uriRef("http://example.org/data/style.xsl");
Resource doc = Resource.uriRef("http://example.org/data/doc.xml");
 
ProvenanceArtifact docArtifact = pcf.newArtifact("source doc", doc);
ProvenanceArtifact sheetArtifact = pcf.newArtifact("stylesheet", sheet);
 
ByteArrayOutputStream outBuffer = new ByteArrayOutputStream();
Xml.transform(context.read(doc), context.read(sheet), outBuffer);
 
ByteArrayInputStream inBuffer = new ByteArrayInputStream(outBuffer.toByteArray());
Resource result = Resource.uriRef("http://example.org/data/result.xml");
context.write(result, inBuffer);
 
// the process has completed, now let's record the provenance
ProvenanceArtifact resultArtifact = pcf.newArtifact("transform result");
ProvenanceProcess transformProcess = pcf.newProcess("xslt transform");
 
pcf.assertAccount(account);
pcf.assertArtifact(sheetArtifact);
pcf.assertArtifact(docArtifact);
pcf.assertProcess(transformProcess);
pcf.assertArtifact(resultArtifact);
 
// the input document and stylesheet are two different kinds of inputs for the transform
// process, so each has its own role
ProvenanceRole inputDocumentRole = pcf.newRole("input document");
ProvenanceRole stylesheetRole = pcf.newRole("stylesheet");
ProvenanceRole outputRole = pcf.newRole("output");
 
pcf.assertUsed(transformProcess, docArtifact, inputDocumentRole, account);
pcf.assertUsed(transformProcess, sheetArtifact, stylesheetRole, account);
pcf.assertGeneratedBy(resultArtifact, transformProcess, outputRole, account);

Annotating OPM metadata

Provenance graphs can be made more useful by annotating them with descriptive metadata. For example if I want to record who wrote the stylesheet, I can do that by making an assertion about the stylesheet artifact. To do that I need to find out what subject Resource identifies the artifact:

Resource sheetSubject = ((RdfProvenanceArtifact)sheetArtifact).getSubject();

Now insert a triple on that subject indicating that the stylesheet was written by "Joe Futrelle":

context.addTriple(sheetSubject, Dc.AUTHOR, "Joe Futrelle");

Navigating an OPM graph

A typical use of provenance metadata is to discover something about the antecedent processes and intermediate artifacts that contributed to a given result. The causal links between these entities can be followed using the ReadableProvenanceGraph API. For example, we can use the API to follow the arc from the result artifact in the previous example to the XSLT transformation that generated it:

for(ProvenanceGeneratedArc generatedBy : pcf.getGeneratedBy(resultArtifact)) {
    if(generatedBy.getRole().getName().equals("output")) {
        // found the generatedBy arc
        ProvenanceProcess xsltTransform = generatedBy.getProcess();
        // ... now we can do something with that transformation process
    }
}

Searching for OPM nodes by annotation

Annotated provenance graphs can be searched using ordinary RDF queries. For example, since there's a dc:author annotation in our example on the stylesheet artifact, we can search for it using TripleMatcher:

for(Triple t : context.match(null, Dc.AUTHOR, "Joe Futrelle")) {
    ProvenanceArtifact artifact = pcf.getAtifact(t.getSubject());
    // ... now we can do something with the artifact we found
}

Child pages

Provenance

Creating OPM metadata

Annotating OPM metadata

Navigating an OPM graph

Searching for OPM nodes by annotation