TORCHLITE Hackathon Handbook

About TORCHLITE

Tools for Open Research and Computation with HathiTrust: Leveraging Intelligent Text Extraction (TORCHLITE)

The TORCHLITE project extends volume-level, page-level, and fine-grain access to our Extracted Features (EF) dataset, which contains metadata and statistical information extracted from the full-text data representing 17.5 million digitized volumes in the HathiTrust Digital Library. As part of TORCHLITE, HTRC has developed a well-documented API to allow our user community to develop its own tools for interacting with data. We are also developing a web-based, interactive visualization dashboard using the API.

Through this hackathon, we aim to develop and promote these tools and the API. The EF dataset contains nearly 3 trillion tokens representing over 6 billion pages, making it arguably the largest open dataset of its kind readily available to digital humanities and other scholars in the world.

Potential Use Cases

There are many possible uses for the Extracted Features API. It will allow for retrieval of highly targeted subsets of the EF dataset, down to the level of individual volumes or even pages, if desired. The API supports retrieval of specific volume-level metadata elements—such as title, publisher, date of publication, genre. It also supports volume-level and page-level token counts, parts of speech, lines, and sentences.

Widgets and code notebooks that make use of HathiTrust data via this API can potentially be embedded in blog posts and articles, enabling authors to give their readers access to live, interactive illustrations of their analyses. Library catalogs or other public-facing databases might likewise embed a lightweight data visualization (for example, a word cloud) that represents in a new, rich way the very text described in a particular bibliographic record.

Community members can develop against the API in a variety of ways, including using code notebooks such as Jupyter, Colab, or Observable to create data visualizations and analysis. More advanced users might develop their own widgets or applications that consume data from the EF API.

With the help of our user community, we hope to expand our own hosted set of TORCHLITE widgets and tools, and present them in the TORCHLITE dashboard. But perhaps the main purpose of the TORCHLITE EF APIs and frameworks is to allow for the community to imagine and create its own ways to make use of HathiTrust’s extensive open-access data.