Page History

...

There are two possible architectures:

Current "Standard" Approach: Create a Box client application that will handle the additional functionality not supported by Box Skills (e.g. polling, metadata mapping) involved with interacting with Fence, likely using bd.py:

draw.io Diagram

border	true
viewerToolbar	true
fitWindow	false
diagramName	BoxSkillsWithBD.py
simpleViewer	false
diagramWidth	899
revision	1

Box Tuned/Potentially Streamlined: Enhance The first one is to enhance pyclowder to enable it to download files and upload metadata to Box in addition to clowder (eliminating data movement that isn't required). Eventually support would be added to Google Drive and Dataverse, etc.

...

When a Skill is registered with a Box account, the invocation URL is provided. This URL will resolve to an endpoint in Fence.

The second architecture is to create a Box client application that will interact with Brown Dog using bd.py.

draw.io DiagrambordertrueviewerToolbartruefitWindowfalsediagramNameBoxSkillsWithBD.pysimpleViewerfalsediagramWidth899revision1

Comparison of Approaches

.

Current "Standard" Approach

Box Tuned/Potentially Streamlined

Box in Pyclowder

bd.py

File is downloaded once from box to the extractor container

File is deleted at end of process_message

If using the /extractions/file endpoint the file is transferred three times:

From Box to BoxClient
From BoxClient to Clowder (via fence)
From Clowder to extractor container

If using the /extractions/url endpoint the file is trasferred transferred ?? times: (not sure we could use the URL endpoint since the Box file won't be publicly accessible, Has to be downloaded via the Box API.

??

File lives in clowder until the cleanup script is run

File is downloaded once from box to the extractor container

File lives in extractor and deleted at end of process_message by PyClowder (is this correct, i.e. by PyClowder?)

Box SDK only lives in the BoxClient service. No changes are required elsewhere, however, this client will need to be maintained (by us or external party)

Box SDK has to be introduced into Pyclowder library. Any other repos we want to support would also have to be included

Box SDK only lives in the BoxClient service. No changes are required for pyclowder

in the future

Burden of maintenance/adding new supported external services is squarely on our end (vs on the client's end)

Question: Is Pyclowder the right place for this? PyClowder is just a convenient wrapper mechanism for creating "some" extractors that happen to be written/wrapped in python. This will make it more heavy weight and leave out other languages

Custom metadata structure for box would be implemented in the extractor

.

An automated translation of clowder metadata to box skills cards would have to be developed., which may be difficult, or may not (sounds like there are only 4 or 5 types)

Custom metadata structure for box would be implemented in the extractor

What happens when an extractor doesn't support a specific service (e.g. Box, Dataverse)?

Potential Bottlenecks for massive scaling:

Fence
RabbitMQ
Extractors

Notes:

Everything apart from Rabbit is stateless and can be horizontally scaled.

Potential Bottlenecks for massive scaling:

Fence
RabbitMQ
Extractors
BoxClient
Clowder
MongoDB

Notes:

We can't rely on threading in the BoxClient to do the polling since we would risk running out of threads.

We can run a small experiment to get some numbers here (e.g. average time per request, cpu hours/memory utilized, network I/O)

Potential Bottlenecks for massive scaling:

Fence
RabbitMQ
Extractors

Notes:

Everything apart from Rabbit is stateless and can be horizontally scaled.

We can run a small experiment to get some numbers here (e.g. average time per request, cpu hours/memory utilized, network I/O)

Would need to add some endpoints to fence

Would need to deploy a new service and proxy it behind Apache.

BoxClient would need to be allocated a service account and handle Brown Dog tokens

Would need to add some endpoints to fence

Errors can be reported in Clowder (not visible to the Box user)

The BoxClient could potentially retry

Limited error logging and reporting

Unsure about retry if an extraction fails

Eventually we could create an app where user logs in via their Box credentials to see the history of extractions

Errors can be reported in Clowder (not visible to the Box user)

The BoxClient could potentially retry

Skills Invocation

When a file is uploaded to a Box folder that has an attached skill, the following payload is POSTed to the fence endpoint:

...

Wireframe

initialResourceID	2278E287-509B-183B-1098-2EC38DDDB7D8
platformArchiveID	132842932
Alignment	Center
view	grid
platformArchiveName	BalsamiqProject_132842923
initialBranchID	Master

The tools catalog will rely on Leverage ideas from binder to facilitate community extractor development/registering. Allow tools catalog to utilize an underlying Git repo for storing extractor_info documents and keep track of versions, issues, branches, and pull requests. It will download the extractor_info.json file to populate information on the page.It This will additionally furnish Box enterprise admins with URLs that expose the tool as a skill. Initially they will have to copy and paste the URL. Once Box exposes management of Skills through an API this can be further automated.

...

Langid
DBPedia
Census From Cell
Handwritten Decimals
Killed Photos
Mean Grey
Faces
Eyes
Profiles
Closeups
NLTK Summary
Stanford CoreNLP
Tesseract
Tika
Versus
VLFeat
Generalized exif/image metadata extractor

Scientific Communities to be Seeded in Tools Catalog

OpenCV - Grad student to curate?
Critical Zone Observatory (via ESIP?)
Data Driven Ag
Bisque - Counting Cells in a microscope image??
Cosmology

Page tree

Versions Compared

Old Version 4

New Version Current

Key

Comparison of Approaches

Skills Invocation

Scientific Communities to be Seeded in Tools Catalog