HTRC Workset Toolkit Development Library¶
HTRC Data Capsule Service¶
The HTRC Data Capsule Service provisions virtual machines (VMs) to researchers within the HTRC secure environment. The VM and software environment (including the SDK) together form a Capsule. Each researcher has exclusive use of the Capsule for a period of weeks or months during which they can configure their own environment for performing research on HathiTrust Digital Library texts, including both in-copyright and public domain volumes.
Each Capsule has both a maintenance mode and a secure mode. In secure mode, network access is restricted to the HTRC Data API and some HTDL resources, allowing text and image data to be downloaded to the Capsule.
Any changes made on the non-secure volumes are reverted when leaving secure
mode, so persistent code changes must occur in maintenance mode. The SDK
addresses these connectivity issues with the htrc.mock
library.
Mock Testing¶
Mock testing uses simulated objects or functions to mimic the behavior of real code in controlled ways.
The HTRC Workset Toolkit implements a mock of the Data API access layer in
htrc.mock.volumes
. The Data API server is only accessible via a Capsule
in secure mode. By implementing a function with the same call signature
that returns the same data types, workflows that rely on the Data API can be
tested either in Capsule maintenance mode or on a user’s own computer.
An easy way to use this pattern is shown below.
Example¶
if __debug__:
# This code will execute when running `python script.py`
import htrc.mock.volumes as volumes
else:
# This code will execute when running `python -O script.py`
# The -O argument turns on optimizations, setting __debug__ = False.
import htrc.volumes as volumes
# The following is just to make a running script
volume_ids = ['htrc.testid'] # any list will do
output_dir = 'htrc_data' # any path will do
# download volumes
volumes.download(volume_ids, output_dir)
This script leverages use of the python -O
switch, which controls the
__debug__
global variable:
When run in the development environment, which does not have secure access to the Data API, the program is run with
python script.py
, setting__debug__ = True
. This means thatvolumes.download(volume_ids, output_dir)
utilizes the functionhtrc.mock.volumes.download(volume_ids, output_dir)
.When run in secure mode of the data capsule, the program is executed with
python -O script.py
, setting__debug__ = False
. The statementvolumes.download(volume_ids, output_dir)
utilizes the functionhtrc.mock.volumes.download(volume_ids, output_dir)
.
Modules¶
htrc.metadata¶
-
htrc.metadata.
get_bulk_metadata
(ids, marc=False)[source]¶ Retrieve item metadata from the HathiTrust Bibliographic API.
Params: :param ids: HTIDs for the volumes to be retrieved :param marc: Retrieve MARC-XML within JSON return value.
-
htrc.metadata.
get_metadata
(ids, output_file=None)[source]¶ Retrieves metadata for a folder of folders, where each subfolder is named for a HathiTrust ID. This structure is the default structure extracted from a Data API request (:method htrc.volumes.get_volumes:).
-
htrc.metadata.
get_volume_metadata
(id, marc=False)[source]¶ Retrieve item metadata from the HathiTrust Bibliographic API.
Params: :param id: HTID for the volume to be retrieved :param marc: Retrieve MARC-XML within JSON return value.
-
htrc.metadata.
safe_bulk_metadata
(ids, marc=False, sleep_time=1)[source]¶ Retrieve bulk item metadata from the HathiTrust Bibliographic API.
Unlike :method get_bulk_metadata:, this function returns an empty dictionary, rather than an error when metadata is missing.
Params: :param ids: HTIDs for the volumes to be retrieved :param marc: Retrieve MARC-XML within JSON return value.
-
htrc.metadata.
safe_volume_metadata
(id, marc=False, sleep_time=1)[source]¶ Retrieve item metadata from the HathiTrust Bibliographic API.
Unlike :method volume_metadata:, this function returns an empty dictionary, rather than an error when metadata is missing.
Params: :param id: HTID for the volume to be retrieved :param marc: Retrieve MARC-XML within JSON return value.
-
htrc.metadata.
volume_solr_metadata
(id, sleep_time=0.1)[source]¶ Retrieve metadata from HTRC Solr API.
The HTRC Solr instance is used only for certain extracted features unavailable in the main HathiTrust Bibliographic API. If you are a recipient of a HTRC Advanced Collaborative Support (ACS) grant, then you may have to use this function.
htrc.mock¶
htrc.mock.volumes¶
htrc.mock.volumes
Contains functions to test the volume retrieval from the HTRC Data API. The download functions will return a sample zip file.
See the core documentation for an example of how to use this library.
-
htrc.mock.volumes.
credentials_from_config
(path)[source]¶ Retrieves the username and password from a config file for the Data API. DOES NOT raise an EnvironmentError if path is invalid. See also: credential_prompt
-
htrc.mock.volumes.
get_pages
(token, page_ids, concat=False)[source]¶ Returns a ZIP file containing specfic pages.
Parameters: :token: An OAuth2 token for the app. :volume_ids: A list of volume_ids :concat: If True, return a single file per volume. If False, return a single file per page (default).
-
htrc.mock.volumes.
get_volumes
(token, volume_ids, concat=False)[source]¶ Returns volumes from the Data API as a raw zip stream.
Parameters: :token: An OAuth2 token for the app. :volume_ids: A list of volume_ids :concat: If True, return a single file per volume. If False, return a single file per page (default).
htrc.volumes¶
htrc.volumes
Contains functions to retrieve volumes from the HTRC Data API.
The functions in this package will not operate unless they are executed from an HTRC Data Capsule in Secure Mode. The module htrc.mock.volumes contains Patch objects for testing workflows.
-
htrc.volumes.
get_pages
(data_api_config: htrc.config.HtrcDataApiConfig, page_ids, concat=False, mets=False, buffer_size=128)[source]¶ Returns a ZIP file containing specfic pages.
Parameters: :data_api_config: The configuration data of the DataAPI endpoint. :volume_ids: A list of volume_ids :concat: If True, return a single file per volume. If False, return a single file per page (default).
-
htrc.volumes.
get_volumes
(data_api_config: htrc.config.HtrcDataApiConfig, volume_ids, concat=False, mets=False, buffer_size=128)[source]¶ Returns volumes from the Data API as a raw zip stream.
Parameters: :token: An OAuth2 token for the app. :volume_ids: A list of volume_ids :concat: If True, return a single file per volume. If False, return a single file per page (default). :host: Data API host :port: Data API port