This page captures information about our support of Stephen Downie's faculty fellowship.
Overview
Title: Modeling the Massive HathiTrust Corpus: Creating Concept-based Representations of 15 Million Volumes
PI: J. Stephen Downie
Co-PIs: Peter Organisciak (HTRC, U. Denver); Boris Capitanu (HTRC); Craig Willis (NCSA)
In short, the fellowship is exploring the creation of reduced-dimensional term-topic matrices for the HathiTrust collection. This includes the exploration of scalable methods for dimension reduction/topic modeling (LSA/pLSA, LDA, autoencoders) for the full collection.
Updates
12/14/2017
- BW access finally in place as of 12/12, can start transfer process but need to enable Globus endpoint for HT data.
- Allocation will be used for two different projects related to HTRC – faculty fellowship and ngramming of HT data. Will meet with both projects teams on 12/15 to coordinate.
12/6/2017
- BW allocation approved, still waiting for access.
- Will work with Capitanu on sync'ing initial data for evaluation of deeplearning4j by end of week.
- Will meet with Co-PI Bhattacharyya 12/11 about BW project we are piggy-backing on
11/27/2017
- Conference call (Willis, Capitanu)
- Still waiting for BW allocation
- Boris explored deploying TensorFlow on TORQUE cluster and concluded that it's too complicated given that the deeplearning4j Spark already has a variational autoencoder implementation
- Will focus on deeplearning4j for now. Craig to request update on BW access.
11/20/2017
- Conference call (Willis, Capitanu)
- Discussed Tensorflow v deeplearning4j for scalable autoencoder implementations
- Spark has support for SVD and LDA. Deeplearning4j add autoencoders for Spark.
- Both can use GPUs
- Autoencoders
- Proposing to use Sparse autoencoders
- Hinton paper appears to be the motivation for applying autoencoders to text
Hinton and Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks
Lecture on youtube: https://www.youtube.com/watch?v=ARQ6PZh8vgE
Compare results to LSA only (on Reuters collection)
- TensorFlow has VariationalAutoEncoder implementation as does deeplearning4j
- For next meeting, will prepare the following:
- Shared access to either BW, ROGER, or IU (HTRC) cluster
- Download and prepare Ted's 100K english volumes (need collection information)
- Preliminary scaling of Tensorflow and deeplearning4j autoencoder with either Ted's or other collection
- Access to BW allocation, if possible
11/17/2017
- Peter delivered sample autocoder implementation (set of Jupyter Notebooks)
- BW allocation approved. Will need to send project information to initiate accounts.
11/13/2017
- Conference call (Willis, Capitanu)
- Spark has scalable SVD implementation (also LDA)
- IU has cluster with identical architecture to BW
- Ran original feature extraction code on BW
- Will review options and check in next week
11/12/2017
- Conference call (Downie, Organisciak, Willis)
- Most succesful autoencoder implementation is from Google/Tensorflow
- LSA may be possible at scale
- Peter has example running small set of cookbooks on HTRC server
- Trimmed vocabulary
- Might keep subset (e.g., ~4 million most common words)
- Run over sub-batches of extracted features to determine which words are more useful
- Should run at page (not volume) level
- Published extracted feature dataset
- https://wiki.htrc.illinois.edu/display/COM/Extracted+Features+Dataset
- Does not have extracted features
- Extracted features
- rsync just volume IDs
- Get Peter's code running in that environment
- Will need to efficiently sample from those books
- out of the box autoencoder code
11/7/2017
- Attended Faculty Fellows reception at NCSA.
9/17/2017
- Submitted BlueWaters proposal (in cooperation with Sayan Bhattacharyya and José Eduardo González)
- Title: Text Analysis of Books from the HathiTrust Digital Library to Characterize Descriptivity in Writing
7/31/2017
Kick off
By Aug 2 - 'To Read' list (Peter)
This week - Peter and Boris touch base on initial workflow -> tractability of Peter's initial tinkering
By Aug 11 - Wrap up lit review
Aug 14 - Tentative meeting
References
- Geoffrey Hinton, Ruslan Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Science, Volume 313, 2006.
- Ruslan Salakhutdinov, Geoffrey Hinton, Semantic hashing, In International Journal of Approximate Reasoning, Volume 50, Issue 7, 2009, Pages 969-978, ISSN 0888-613X, https://doi.org/10.1016/j.ijar.2008.11.006.
- Dipayn Dev. Deep Learning with Hadoop. http://proquest.safaribooksonline.com.proxy2.library.illinois.edu/book/databases/hadoop/9781787124769