Proposal submitted: XSEDE16TutorialProposal_v1.docx
Abstract
When: July 18th
Where
Tutorial Session Design
Introduction to Brown Dog (30m)
This is a presentation and demo of Brown Dog project and service
How to use Brown Dog Services (1h 30m)
20-30 mins setup
2 problems + 1 optional problem
1 extraction - Problem 1 -face and ocr, audio extractor
1 conversion - Problem 3 - collection of old file formats of images,audios
1 combined - Problem 6 - a single script combining together
1 optional - problem 5
This is a session to teach how to use Brown Dog Services
- Participant will use his/her own laptop for this part
- We will provide a VM with everything pre-installed in it through Nebula.
- Rob Kooper will talk to Doug for this if we can spawn 50 VMs on Nebula for the tutorial session. (DONE) We will get 50 VMs on Nebula.
- Smruti PadhyOrder 50+ flash drive for back up that will contain the VMs
- Create a VM with everything installed in it and take a snapshot which will then be deployed within Nebula. Approx. time required - 2 days
- Make a list of all softwares required and the directory structure for the tutorial
- local installation of fence and local authentication.
- 50 concurrent users to perform conversion/extractions tests
- Luigi Marini Max size on File to be uploaded
- Not sure of Jetstream yet.
- Provide clear instructions as how to access VMs in Nebula with proper credentials.
- Clear Instruction of how to access the VMs (e.g., through ssh), from different OSes.
- Do we need training accounts on nebula?
- (Before tutorial - wiki pages with clear instructions) Installs Python/R/MATLAB/cURL to use BD Service along with the library required in case any one interested in using the BD services in future.
- Create wiki pages with clear instructions
- We will provide a VM with everything pre-installed in it through Nebula.
- Demonstration of use of BD Fiddle
- Sign up for Brown Dog Service
- Obtain a key/token using curl or Postman or use of IPython notebook
- Use token and bd fiddle interface to obtain to see BD in action.
- Copy paste the python code snippet and use it the application to be explained next.
- Create a document for the demo with step-by-step screenshots
- Fix the CORS error for file url option (I think it is a known issue)
- Delay when file is uploaded from local directory
- Create an applications using BD services
Three applications:- Problem 1 : Given a collection of images with text embedded in it, try to search images based on its content. (Emphasizes on extraction on unstructured data, indexing and content-based retrieval)
- One can upload images from local directory to obtain images or use external web service.
- Create an example dataset with images with interesting query
- Provide a code snippet of using externel service to obtain images. e.g. Flicker API.
- This will only be provided as an example and will not be used for the rest of the code.
- Let the participant use the python library of BD to obtain key/token and submit request to BD-API gateway
- Provide the link to the current BD REST API and create a document/wiki page showing step-by-step screenshots of obtaining a key/token using python library.
- Write a Python script that will serve as a stub for the BD client
- The participant will fill in the code to BD REST API call to submit their requests.
- Make sure OCR and face extractor are running before starting the demo
- Python
- Make sure the Elasticsearch is started before the example files are submitted to BD service
- Provide Instructions to start Elasticsearch and start a webclient to it for visualization.
- Make sure the cluster name in the config.yml differs for each participant.
- Provide Instructions to start Elasticsearch and start a webclient to it for visualization.
- Once technical metadata is obtained from BD, index it tags and technical metadata in an locally running Elasticsearch.
- Write a python script that will index the technical metadata in ES
- Search for the image using ES query
- Provide ES query for search
- One can upload images from local directory to obtain images or use external web service.
- Problem 2 : Given a collection of text files from a survey or reviews for a book/movie, use sentiment analysis extractor to calculate the sentiment value for each file and group similar values together. (Emphasizes on extraction on unstructured data and useful analysis )
- A collection of text files with reviews
- Obtain an examples dataset from the web.
- Let the participant use the python library of BD to obtain key/token and submit request to BD-API gateway
- Provide the link to the current BD REST API and create a document/wiki page showing step-by-step screenshots of obtaining a key/token using python library.
- Write a Python script that will serve as a stub for the BD client
- The participant will fill in the code to BD REST API call to submit their requests.
- Make sure the Sentiment Analysis extractor is running
- Saves the results for each text file in a single file with corresponding values
- Provide code for this in stub script
- Create separate folders and move the file based on the sentiment value
- Provide a code that will do the above action in the stub
- (Optional) Index text files along with the sentiment values and use ES visualization tool to search for documents with sentiment value less than some number.
- A collection of text files with reviews
- Problem 3: Use BD conversion to convert a collection of images/ps/odp files to png/pdf/ppt. This will demonstrates that if you have a directory with files in old file formats, just use BD to get it all converted. (Emphasies on conversion)
- Provide a Python script for this and let Participant use python library to use the BD service
- Problem 4: Given a collection of *.xlsx files, obtain some results based on some columns value. (Emphasizes on extraction and analysis on scientific data)
- Problem 1 : Given a collection of images with text embedded in it, try to search images based on its content. (Emphasizes on extraction on unstructured data, indexing and content-based retrieval)
An example could be - Given a *.xlsx file with max and min temperature for each day of a month. Calculate average temperature max/min/standard deviation for each month.
- Convert *.xlsx file to *.csv using conversion API so that you can see the content of the file on the VM. We are not installing any office software on the VM.
- use extraction API to extract columns from the file and
- Perform some analysis and add to the technical metadata
- Write an extractor
/converter for this problem- This should be an enticing yet simple problem that can handle many spreadsheets and get a result.
- Ideas
- An algebra 101, traveling trains problem. 2 trains leave 2 different stations on tracks heading toward a junction. Given a spreadsheet with departure times, distances, velocities, etc., upload all the spreadsheets and determine if they will crash.
- This problem is simple and would provide the user an easily understood problem that can clearly be scaled to much more involved traffic problems.
- However, it doesn't really present a cool new idea. It may be preferable to think of something more cutting edge technology
- A bacterial growth model. Given a culture with varied conditions, eg. pH, stored in multiple spreadsheets, determine the growth rate. Might be able to base this on http://mathinsight.org/bacteria_growth_initial_model
- This would require a few minutes of explanation of the model and would require some learning by the developer of the extractor.
- Still maybe not that enticing.
- Better Ideas?
- An algebra 101, traveling trains problem. 2 trains leave 2 different stations on tracks heading toward a junction. Given a spreadsheet with departure times, distances, velocities, etc., upload all the spreadsheets and determine if they will crash.
- Provide a Python script for obtaining the input files and use BD REST API for to obtain the result.
- Problem 5: Obtaining Ameriflux data and converting into *.clim format (similar to csv format but tab separated) for SNIPET model. Calculate average air temperature and its standard deviation. (This will emphasize both conversion and analysis)
- Write a R/Python script to call BD conversion API and get data in *clim format and calculate average air temperature. Also plot a graph of the data.
- Installation of Rstudio server version.
- (Optional) to calculate average temperature, call BD extraction service. For this write an extractor that accepts *.clim file and outputs average temperature.
- Write a R/Python script to call BD conversion API and get data in *clim format and calculate average air temperature. Also plot a graph of the data.
- Problem 6: Audio converter , Speech to text extractor
How to add Your Tool to Brown Dog Services (1h 30m)
This is a session to teach how to add user's tool to Brown Dog Services.
- Part 1: Write an extractor
- Start with the bd-template extractor, which is the word count extractor.
- Ask participant to modify the extractor, which would use 'grep' to find a specific pattern within the file.
- Include yes/no in the metadata if the pattern is found or not found.
- Give a brief description on Json-ld support.
- Provide intuition behind the idea json-ld and an example
- As another example - Write the extractor that will be used for problem 4.
- Provide Step-by-step instructions/screenshots of updating the extractor and the output as seen at the Clowder GUI.
- This needs to be more simplified than what we have for users at beginner level/intermediate one
- Part 2: Write a converter
- Start with the bd-template for converter- imagemagick
- Ask the participant to modify the converter input/output formats in the comment section. And see the result using the polyglot web UI for post and get
- Think of another software for which creating a converter is easy and interesting.
- Provide step-by-step instructions/screenshots of modifying imagemagick
- Part 3: Uploading a converter or an extractor to locally installed Tools Catalog.
- Step-by-step procedure to upload a tool, an input file and an output file without a docker file
- Part 4 (Optional - For advanced user): Dockerize the tool
Use Contributors Landing Page for this part of the session.
Participant will be provided with a VM with all required setup so that they can create their own tool.
Wrap up
- Tutorial feedback form
- Announcement of next user workshop