2 Extracted Features
2.1 HTRC Extracted Features (EF) Dataset
- Open access under a Creative Commons Attribution 4.0 International License, and freely downloadable
- Structured data consisting of human-created (catalog) metadata and algorithmically-derived features
- Representing 17.1 million volumes, including those still under copyright (i.e., not quite in sync with the HathiTrust Digital Library )
- Linked-data compliant (JSON-LD)
- Complete documentation available at https://go.illinois.edu/EF20_documentation
2.2 What are “Extracted Features”?
Extracted features are computationally accessible data elements “extracted” (i.e., “derived”) from the HathiTrust Digital Library. They include:
- Volume- and page-level metadata
- Textual and statistical data and word-level metadata
- Extracted from the raw full text of volumes in the Digital Library
EF positions researchers to access the data they need and begin their analysis of text with some standard natural language & statistical preprocessing already done for them.
2.3 Per-volume features
Excerpted from catalog metadata, including:
- Title
- Author
- Language
- Publication information
- Identifiers
- [in future releases: Subjects]
Diagram of an EF file.
JSON pretty print view of an EF file
Newly available as part of the TORCHLITE framework (but not yet included in older EF versions): - Token counts aggregated by volume.
2.4 Extracted Features API
The Extracted Features API allows programmatic access to the Extracted Features Files and not only enables HTRC’s TORCHLITE dashboard, but also allows users to connect their own code notebooks, widgets, or applications.
The EF API Documentation fully documents the available API calls and allows users to request sample calls in a number of different coding languages.
2.4.1 API CALLS
GET EF data for a volume by volume id
- https://data.htrc.illinois.edu/ef-api/volumes/{clean-htid}
Check if a volume exists (HEAD)
- https://data.htrc.illinois.edu/ef-api/volumes/{clean-htid}
GET volume metadata by volume id
- https://data.htrc.illinois.edu/ef-api/volumes/{clean-htid}/metadata
GET subset of pages of volume by volume id
- https://data.htrc.illinois.edu/ef-api/volumes/{clean-htid}/pages
Create workset (POST)
- https://data.htrc.illinois.edu/ef-api/worksets
DELETE workset by workset id
- https://data.htrc.illinois.edu/ef-api/worksets/{workset-id}
GET workset
- https://data.htrc.illinois.edu/ef-api/worksets/{workset-id}
GET workset volumes by workset id
- https://data.htrc.illinois.edu/ef-api/worksets/{workset-id}/volumes
GET workset volumes(aggregated) by workset id
- https://data.htrc.illinois.edu/ef-api/worksets/{workset-id}/volumes/aggregated
GET workset volumes metadata by workset id o
- https://data.htrc.illinois.edu/ef-api/worksets/{workset-id}/metadata
2.5 OBSERVABLE Notebooks
OBSERVABLE Documentation (Shorter)