Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Current "Standard" ApproachBox Tuned/Potentially Streamlined

If using the /extractions/file endpoint the file is transferred three times:

  1. From Box to BoxClient
  2. From BoxClient to Clowder (via fence)
  3. From Clowder to extractor container

If using the /extractions/url endpoint the file is transferred ?? times: (not sure we could use the URL endpoint since the Box file won't be publicly accessible, Has to be downloaded via the Box API.

  1. ??

File lives in clowder until the cleanup script is run

File is downloaded once from box to the extractor container

File lives in extractor and deleted at end of process_message by PyClowder (is this correct, i.e. by PyClowder?)

Box SDK only lives in the BoxClient service. No changes are required  elsewhere, however, this client will need to be maintained (by us or external party)

Box SDK has to be introduced into Pyclowder library. Any other repos we want to support would also have to be included in the future

Burden of maintenance/adding new supported external services is squarely on our end (vs on the client's end)

Question: Is Pyclowder the right place for this?  PyClowder is just a convenient wrapper mechanism for creating "some" extractors that happen to be written/wrapped in python.  This will make it more heavy weight and leave out other languages.

An automated translation of clowder metadata to box skills cards would have to be developed, which may be difficult, or may not (sounds like there are only 4 or 5 types)

Custom metadata structure for box would be implemented in the extractor

What happens when an extractor doesn't support a specific service (e.g. Box, Dataverse)?

Potential Bottlenecks for massive scaling:

  1. Fence
  2. RabbitMQ
  3. Extractors
  4. BoxClient
  5. Clowder
  6. MongoDB

Notes:

We can't rely on threading in the BoxClient to do the polling since we would risk running out of threads.

We can run a small experiment to get some numbers here (e.g. average time per request, cpu hours/memory utilized, network I/O)

Potential Bottlenecks for massive scaling:

  1. Fence
  2. RabbitMQ
  3. Extractors

Notes:

Everything apart from Rabbit is stateless and can be horizontally scaled.

We can run a small experiment to get some numbers here (e.g. average time per request, cpu hours/memory utilized, network I/O)

Would need to deploy a new service and proxy it behind Apache

BoxClient would need to be allocated a service account and handle Brown Dog tokens

Would need to add some endpoints to fence

Errors can be reported in Clowder (not visible to the Box user)

The BoxClient could potentially retry

Limited error logging and reporting

Unsure about retry if an extraction fails

Eventually we could create an app where user logs in via their Box credentials to see the history of extractions

...