Date: Thu, 28 Mar 2024 06:34:26 -0500 (CDT)
Message-ID: <434331873.112.1711625666609@os-confluence.ncsa.illinois.edu>
Subject: Exported From Confluence
MIME-Version: 1.0
Content-Type: multipart/related;
boundary="----=_Part_111_1111808720.1711625666609"
------=_Part_111_1111808720.1711625666609
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Content-Location: file:///C:/exported.html
This page captures information about our support of Stephen Down=
ie's faculty fellowship.
Overview
Title: Modeling the Massive HathiTrust Corpus: Creating Concept-based Re=
presentations of 15 Million Volumes
Link: http://www.ncsa.illinois.edu/a=
bout/fellows_awardees/modeling_the_massive_hathitrust_corpus_creating_conce=
pt_based_representatio
PI: J. Stephen Downie
Co-PIs: Peter Organisciak (HTRC, U. Denver); Boris Capitanu (HTRC); Crai=
g Willis (NCSA)
In short, the fellowship is exploring the creation of reduced-dimensiona=
l term-topic matrices for the HathiTrust collection. This includes the expl=
oration of scalable methods for dimension reduction/topic modeling (LSA/pLS=
A, LDA, autoencoders) for the full collection.
Updates
12/14/2017
- BW access finally in place as of 12/12, can start transfer process but =
need to enable Globus endpoint for HT data.
- Allocation will be used for two different projects related to HTRC =E2=
=80=93 faculty fellowship and ngramming of HT data. Will meet wi=
th both projects teams on 12/15 to coordinate.
12/6/2017
- BW allocation approved, still waiting for access.
- Will work with Capitanu on sync'ing initial data for evaluation of deep=
learning4j by end of week.
- Will meet with Co-PI Bhattacharyya 12/11 about BW project we are p=
iggy-backing on
11/27/2017
- Conference call (Willis, Capitanu)
- Still waiting for BW allocation
- Boris explored deploying TensorFlow on TORQUE cluster and concluded tha=
t it's too complicated given that the deeplearning4j Spark already has a va=
riational autoencoder implementation
- Will focus on deeplearning4j for now. Craig to request update on =
BW access.
11/20/2017
- Conference call (Willis, Capitanu)
- Discussed Tensorflow v deeplearning4j for scalable autoencoder implemen=
tations=20
- Spark has support for SVD and LDA. Deeplearning4j add autoencoder=
s for Spark.
- Both can use GPUs
- Autoencoders=20
- For next meeting, will prepare the following:=20
- Shared access to either BW, ROGER, or IU (HTRC) cluster
- Download and prepare Ted's 100K english volumes (need collection inform=
ation)
- Preliminary scaling of Tensorflow and deeplearning4j autoencoder w=
ith either Ted's or other collection
- Access to BW allocation, if possible
11/17/2017
- Peter delivered sample autocoder implementation (set of Jupyter No=
tebooks)=20
- BW allocation approved. Will need to send project information to initia=
te accounts.
11/13/2017
- Conference call (Willis, Capitanu)
- Spark has scalable SVD implementation (also LDA)
- IU has cluster with identical architecture to BW
- Ran original feature extraction code on BW
- Will review options and check in next week
11/12/2017
- Conference call (Downie, Organisciak, Willis)
- Most succesful autoencoder implementation is from Google/Tensorflow
- LSA may be possible at scale
- Peter has example running small set of cookbooks on HTRC server=20
- Trimmed vocabulary
- Might keep subset (e.g., ~4 million most common words)
- Run over sub-batches of extracted features to determine which words are=
more useful
- Should run at page (not volume) level
- Published extracted feature dataset
- Will need to efficiently sample from those books=20
- out of the box autoencoder code
11/7/2017
- Attended Faculty Fellows reception at NCSA.
9/17/2017
- Submitted BlueWaters proposal (in cooperation with Sayan Bhattacha=
ryya and Jos=C3=A9 Eduardo Gonz=C3=A1lez)
- Title: Text Analysis of Books from the HathiTrust Digital Library =
to Characterize Descriptivity in Writing
7/31/2017
Kick off
By Aug 2 - 'To Read' list (Peter)
This week - Peter and Boris touch base on initial workflow -> tra=
ctability of Peter's initial tinkering
By Aug 11 - Wrap up lit review
Aug 14 - Tentative meeting
References
- Geoffrey Hinton, Ruslan Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Science, V=
olume 313, 2006.
- Ruslan Salakhutdinov, Geoffrey Hinton, Semantic hashing, In Internation=
al Journal of Approximate Reasoning, Volume 50, Issue 7, 2009, Pages 969-97=
8, ISSN 0888-613X, https://doi.org/10.1016/j.ijar.2008=
.11.006.
- Dipayn Dev. Deep Learning with Hadoop. http://proquest.safari=
booksonline.com.proxy2.library.illinois.edu/book/databases/hadoop/978178712=
4769
------=_Part_111_1111808720.1711625666609--