The following are additional features proposed by T.M. for inclusion in the QC workflow. Short explanations of why each might be useful are included.
- Option to provide input data to QC workflow using a CSV file (e.g., from a DwC archive) as input.
- Currently the QC workflow uses a query provided as a command-line option to retrieve input data from a MongoDB database.
- Users may want to provide data in the form of CSV file so that loading input data into MongoDB instance is not needed.
- When MongoDB query is used to provide input to workflow, option to write out that input data set as a CSV file.
- When the workflow is run using data in MongoDB as input, no record is made of the actual data passed into the workflow.
- Given that the deta in MongoDB could change following the workflow run, provenance is being lost.
- It also could be useful for users to subset their input data set manually using a CSV file and then run the workflow again using this subset (see 1 above).
- Preservation of original data values in records passed between actors.
- Data validation actors in the QC workflow currently overwrite the original values in the record fields for which they propose updated values.
- Although comments added as new fields into the records record the original values, these are not as easily read programmatically, e.g. by a user of the report spreadsheet.
- Overwriting values also means that downstream actors cannot access the original values and propose alternative values based on the originals.
- Actor for outputting the results spreadsheet (or CSV file) automatically at the end of the workflow run.
- Currently the QC workflow writes its output to a MongoDB instance. A separate program is used to generate the report spreadsheet from these results in MongoDB.
- Users may want to use the results of a workflow run without having to query a MongoDB database to evaluate the results of a workflow run.
- Based on command line options the workflow could output a report spreadsheet, a CSV file, or both.
- Inclusion of original fields in results spreadsheet to ease direct comparison with input data.