2  Extracted Features

2.1 HTRC Extracted Features (EF) Dataset

2.2 What are “Extracted Features”?

Extracted features are computationally accessible data elements “extracted” (i.e., “derived”) from the HathiTrust Digital Library. They include:

  • Volume- and page-level metadata
  • Textual and statistical data and word-level metadata
  • Extracted from the raw full text of volumes in the Digital Library

EF positions researchers to access the data they need and begin their analysis of text with some standard natural language & statistical preprocessing already done for them.

2.3 Per-volume features

Excerpted from catalog metadata, including:

  • Title
  • Author
  • Language
  • Publication information
  • Identifiers
  • [in future releases: Subjects]

Diagram of an EF file.

JSON pretty print view of an EF file

Newly available as part of the TORCHLITE framework (but not yet included in older EF versions): - Token counts aggregated by volume.

2.4 Extracted Features API

The Extracted Features API allows programmatic access to the Extracted Features Files and not only enables HTRC’s TORCHLITE dashboard, but also allows users to connect their own code notebooks, widgets, or applications.

The EF API Documentation fully documents the available API calls and allows users to request sample calls in a number of different coding languages.

2.4.1 API CALLS

  • GET EF data for a volume by volume id

    • https://data.htrc.illinois.edu/ef-api/volumes/{clean-htid}
  • Check if a volume exists (HEAD)

    • https://data.htrc.illinois.edu/ef-api/volumes/{clean-htid}
  • GET volume metadata by volume id

    • https://data.htrc.illinois.edu/ef-api/volumes/{clean-htid}/metadata
  • GET subset of pages of volume by volume id

    • https://data.htrc.illinois.edu/ef-api/volumes/{clean-htid}/pages
  • Create workset (POST)

    • https://data.htrc.illinois.edu/ef-api/worksets
  • DELETE workset by workset id

    • https://data.htrc.illinois.edu/ef-api/worksets/{workset-id}
  • GET workset

    • https://data.htrc.illinois.edu/ef-api/worksets/{workset-id}
  • GET workset volumes by workset id

    • https://data.htrc.illinois.edu/ef-api/worksets/{workset-id}/volumes
  • GET workset volumes(aggregated) by workset id

    • https://data.htrc.illinois.edu/ef-api/worksets/{workset-id}/volumes/aggregated
  • GET workset volumes metadata by workset id o

    • https://data.htrc.illinois.edu/ef-api/worksets/{workset-id}/metadata

2.5 OBSERVABLE Notebooks

OBSERVABLE Documentation

OBSERVABLE Documentation (Shorter)