This page captures information about our support of Stephen Downie's faculty fellowship.

Overview

Title: Modeling the Massive HathiTrust Corpus: Creating Concept-based Representations of 15 Million Volumes

Link: http://www.ncsa.illinois.edu/about/fellows_awardees/modeling_the_massive_hathitrust_corpus_creating_concept_based_representatio

PI: J. Stephen Downie

Co-PIs: Peter Organisciak (HTRC, U. Denver); Boris Capitanu (HTRC); Craig Willis (NCSA)

In short, the fellowship is exploring the creation of reduced-dimensional term-topic matrices for the HathiTrust collection. This includes the exploration of scalable methods for dimension reduction/topic modeling (LSA/pLSA, LDA, autoencoders) for the full collection.

Updates

12/14/2017

  • BW access finally in place as of 12/12, can start transfer process but need to enable Globus endpoint for HT data.
  • Allocation will be used for two different projects related to HTRC – faculty fellowship  and ngramming of HT data. Will meet with both projects teams on 12/15 to coordinate.

12/6/2017

  • BW allocation approved, still waiting for access.
  • Will work with Capitanu on sync'ing initial data for evaluation of deeplearning4j by end of week.
  • Will meet with Co-PI Bhattacharyya 12/11 about BW project we are piggy-backing on

11/27/2017

  • Conference call (Willis, Capitanu)
  • Still waiting for BW allocation
  • Boris explored deploying TensorFlow on TORQUE cluster and concluded that it's too complicated given that the deeplearning4j Spark already has a variational autoencoder implementation
  • Will focus on deeplearning4j for now.  Craig to request update on BW access.

11/20/2017

  • Conference call (Willis, Capitanu)
  • Discussed Tensorflow v deeplearning4j for scalable autoencoder implementations
    • Spark has support for SVD and LDA.  Deeplearning4j add autoencoders for Spark.
    • Both can use GPUs
  • Autoencoders
  • For next meeting, will prepare the following:
    • Shared access to either BW, ROGER, or IU (HTRC) cluster
    • Download and prepare Ted's 100K english volumes (need collection information)
    • Preliminary scaling of Tensorflow and deeplearning4j autoencoder with either Ted's or other collection
    • Access to BW allocation, if possible


11/17/2017


11/13/2017

  • Conference call (Willis, Capitanu)
  • Spark has scalable SVD implementation (also LDA)
  • IU has cluster with identical architecture to BW
  • Ran original feature extraction code on BW
  • Will review options and check in next week


11/12/2017

  • Conference call (Downie, Organisciak, Willis)
  • Most succesful autoencoder implementation is from Google/Tensorflow
  • LSA may be possible at scale
  • Peter has example running small set of cookbooks on HTRC server
    • Trimmed vocabulary
    • Might keep subset (e.g., ~4 million most common words)
    • Run over sub-batches of extracted features to determine which words are more useful
    • Should run at page (not volume) level

11/7/2017

  • Attended Faculty Fellows reception at NCSA.


9/17/2017

  • Submitted BlueWaters proposal (in cooperation with Sayan Bhattacharyya and José Eduardo González)
  • Title: Text Analysis of Books from the HathiTrust Digital Library to Characterize Descriptivity in Writing


7/31/2017 

  • Kick off

  • By Aug 2 - 'To Read' list (Peter)

  • This week - Peter and Boris touch base on initial workflow -> tractability of Peter's initial tinkering

  • By Aug 11 - Wrap up lit review

  • Aug 14 - Tentative meeting



References



  • No labels