View Source

This page is for discussion of ways to provide enhanced support for managing metadata terms, and annotating resources, with a new design. Given that a new design may/may not be able to support all desirable features simultaneously, the discussion should include design options and prioritization of features if/when conflicts occur. Similarly, some advanced functionality may be useful to contemplate in terms of design but might not be something that gets implemented.

Goals:

The redesign should retain the benefits of the current design (e.g. ability to add custom terms to a space, provenance regarding the origin of specific annotations, ability to add widgets for input/display, send notices of changes to the message bus) and makes it easier to do things such as :

add some or all terms from existing third-party vocabularies
provide definitions/best practice guidance associated with each term
allow metadata to be marked as recommended and/or indicate which metadata will be required during publishing, specified directly by space admins and/or based on available/preferred/default registered repositories)
organize metadata entries by vocabulary and/or alphabetically and/or by purpose
clearly distinguish extracted (and therefore ~redundant) metadata from provided values
allow some extracted terms to be propagated through publishing (right now only provided metadata is published, but space admins or repositories may want to keep high value/hard to reproduce extracted metadata as well)
emphasize the term/definition/value for metadata and de-emphasize/hide by default the provenance of who entered it when
support metadata that references external URLs/identifiers and/or internal identifiers (e.g. with Collections/Datasets/Folders/Files)
enable metadata values to be edited
deletions/edits of metadata should be captured in provenance
manage JSON-LD context(s) at the space level (i.e. assuring that all instances of a label refer to the same predicate, all predicates have one label, even as space admins make changes, and handling metadata entered via the API if/when it has conflicting context info)
support the exchange of contexts/metadata 'profiles' between spaces/ the use of metadata profiles from one space in a read-only manner from another (i.e. one space controls the label/URL/type/definition info)
support (entry/display) for structured metadata values (e.g. json/json-ld)
support cardinality constraints (number, order?)
support consistency of metadata for all publications in a space (for a given space config)
enable blacklisting of predicates/labels used internally in the software (e.g. for title, description, owner, dates, license, etc.)

Current Design:

Individual metadata entries are stored as Mongo documents and are associated with a file or dataset (comments say metadata can be attached to collections but this is not implemented in the GUI) and a JSON-LD context.
Entries through the GUI usually result in one term/value per entry, but some types (e.g. geospatial) produce more than one term/value/a json value that is displayed with structure. Entries through the API can include multiple term/values which can only be deleted as a unit.
JSON-LD contexts are stored as another document collection and can be referenced by multiple metadata documents, but these contexts are not aligned with the space metadata choices (e.g. existing metadata docs and contexts are not updated with changes to a space's choices for metadata terms. That only affects the terms available as input options in the GUI)
Metadata management allows space admins to add/edit/delete entries starting form an instance default list controlled by the overall sys/instance admins. Entries have a label, a formal URI predicate (which together form the context entry. If a URI is not given, a unique URI is generated), and a type which controls the input widget used for entry (but not display?). Some types connect to a 'third party' earth cube geosemantics server to retrieve controlled vocabulary lists or to invoke a geo name to coordinates mapping service, etc. Simple internal types include string and date. Metadata entries do not include definitions.
Metadata definitions are separate docs associated with a space

Proposed Design

Space-based Managed metadata doc

primary artifact is a json doc of the current metadata values referencing a space json-ld context and being associated by id with a dataset/folder/file. CRUD operations from the GUI or API would update this document via a service. Log docs would capture updates (documenting not only new adds, but edits, deletes as well).
Context/definitions would be maintained, one per space, as part of a doc containing all current terms (each term has label/URI and definition, type, and any cardinality, recommended/required info, etc.)

How it would work:

- Metadata mgmt GUI would edit context/definition doc, doc would include any new fields such as term definition, cardinality, etc. as needed. JSON-LD context would be generated from this doc when needed
- Metadata entry would use this doc from the global instance or the space the entry is in to populate the add metadata box. (with sharing on, a decision would be needed as to whether to allow annotation based on one spaces info or to allow annotation in terms of other spaces). Type/definition, cardinality and other info would be used to enhance the GUI.
- Display of existing terms would be changed ~per the GUi redesign and mockups done by Michael Iannaccone (see SEAD-1042 and https://opensource.ncsa.illinois.edu/jira/secure/attachment/21290/21290_metadata-GUI.jpg) - presenting metadata values by term alphabetically or by namespace/category, with a widget to expand the provenance of the entry (which would include any trail of edits).
- Edits to a metadata definition would tie to entries via the formal predicate, e.g. for an entry for 'method'/http://example.org/method, a change of the label for that term to "Experimental Method" would show in the display, whereas the attempt to create a new term "method": http:otherorg/method" would either be blocked due to the label conflict until the original label was updated or would trigger a confirm that the existing metadata should be updated.
- Changes to the space metadata definitions would not be tracked (could be tracked via update docs as with metadata values)
- Metadata added through the API would be matched to existing terms (required to be json-ld, the predicate would be matched to the space context and that label used (i.e. the api would not generate an alternate label for existing space definitions). New terms seen in incoming API requests would be added to the context doc as 'view-only' terms - this allows the API and extractors to add any terms they use. Admins could add additional info (a definition, type, etc. to such generated definitions and/or allow them to appear as add choices in the GUI.
- Movement of items between spaces would be handled as though the API had been used to enter the metadata as part of the transfer assuming no space sharing. With sharing on, the decision of how to handle annotation would have to handle being removed from one of the spaces)
- Metadata edits would invoke the same type-specific entry widget as adding a new value
- Cardinality of 1 would remove an option from the add menu if a value exists
- A new type of metadata would be a link/relationship and there would be a widget(s) that would recognize URLs and internal IDs/ID URLs and make them live links (URL might just become live, but a urn:<id> value would get replaced by a dataset name or path and be a link to the page for that item. Other GUIs could be added (e.g. once bulk ops exist, one could select 2 items and then create a link between them, etc.) - these could populate the same metadata field, i.e. relationships would not be a separate type/managed separately. (It would be possible to do the lookup from Clowder URN to item via javascript, which would enable them to be displayed by name in other apps as well if desired - this is what is done for ORCID IDs where we share a javascript across Clowder and repositories.)
- Required/recommended metadata would be identified by color/icon in the add metadata list (perhaps required fields could be shown with empty values and could be edited)
- Metadata mgmt GUI could be extended to allow viewing/selection from other vocabs (e.g. Dublin Core, from an external website (in their format or a json conversion we or geosemantic server maintain), or another space profile (e.g. reading that space's context/definition doc and listing the entries for selection - perhaps sortable by label/URI/type/or category, etc.). Once selected, there would be no live link to the original vocab/profile.

Pros

This design builds on the strengths of the existing one in terms of maintaining (and improving provenance), supporting typed metadata, and enabling space-level customization. Many of the GUI-level changes are design agnostic and could probably be used/adapted for other storage designs. However, changing the low-level design to focus on a single space-level context/config artifact and pre-integrating the metadata for a given item, provides a clear mechanism to handle edits, context differences between sources (different extractors, different spaces, changes in space config over time) that will avoid confusion going forward - cases where different entries for the same predicate get shown with different labels, where a label is show twice because it is used for two different predicates, where it would be unclear whether type or cardinality rules only apply to the correct label/predicate combo, matching labels, or matching predicates regardless of label. It would also clean-up these types of issues for publication, and would make it easier to share the context/config of one space with another. Aside from changing how information is stored, which would be best to do early, most of the other functionality could be pursued incrementally.

Cons

In some sense, the benefits of this redesign only accrue to projects that actually use the flexibility and customizability of the system. For instances with a single space or no customization of metadata per space, where all of the metadata comes from the GUI rather than the API, where metadata is added by an authoritative user and isn't edited, etc., the current design is OK - the issues pointed out above could happen, but probably won't in practice. (Over time, and with more independent groups using an instance, these issues can and do become real though.). Some groups may thus see this design as overkill/premature. Aside from the added work to implement it, it's not clear that there are big cons. Adding a new metadata value means editing a document rather than just storing a new one, but retrieving the existing values would be one document recall rather than a scan across separate docs. Extracted metadata could be managed (with the space determining the label and/or enforcing type/cardinality rules) which would make the interface look more consistent but raises the issue of what happens if extracted metadata doesn't meet constraints? (Similar for other uses of the API.) Most of these are relatively neutral, but they make it clear that this design would have impacts outside the core of user-entered metadata. To be fair, once spaces have access control and customizable metadata, the issues of how extractors and api users handle those customizations, and how features such as being able to share datasets across spaces interact with these features (if two spaces have different metadata terms, which terms show up in the add metadata list? is it affected by whether I have access to one space or the other? what happens for extractors that work on behalf of a user through their key (a capability in development I think), etc.). I think this design can help answer those questions over time, but it does favor/flavor the answers towards a space-centric view of management/control.

Other design ideas:

GUI-centric

Leave the current metadata and context storage ~as is and focus in the GUI changes:

info about the annotations of a given object would occur as they do now, but there would be more logic to organize the display alphabetically, etc.
metadata management page could be updated to support new types, cardinality, etc. Enforcing cardinality would require tracking existing entries (and deciding what constitutes a match)
editing would require some update to the model (is it delete and add? what happens when a user wants to edit one value and the original entry contains several (the API allows entry of several terms in one call which is stored as one doc))

Pros

potentially less work up front

Cons

Would still leave questions of consistent labels as metadata is edited open unless some design changes are made (e.g. when a label is updated, do exiting entries get scanned and changed? are they changed dynamically for display?) - Similarly, a means of tracking deletes is required if edit is delete/add and/or if we want to track provenance for deletes. And, if deletes are tracked, assembling the set of metadata to display involves playing the sequence of operations forward. One can keep going through all of the other requires/desirable features above and ask similar questions and ask whether this approach, by the time it addresses those requirements, remains simpler/easier/faster to implement than the proposed design. I.e., it's not clear that this design really avoids any issues in the proposed one - if we want the functionality listed some decisions and changes are needed and the question really becomes whether it is easier to implement given this underlying storage/service design.