HTRC Workset Toolkit Development Library

HTRC Data Capsule Service

The HTRC Data Capsule Service provisions virtual machines (VMs) to researchers within the HTRC secure environment. The VM and software environment (including the SDK) together form a Capsule. Each researcher has exclusive use of the Capsule for a period of weeks or months during which they can configure their own environment for performing research on HathiTrust Digital Library texts, including both in-copyright and public domain volumes.

Each Capsule has both a maintenance mode and a secure mode. In secure mode, network access is restricted to the HTRC Data API and some HTDL resources, allowing text and image data to be downloaded to the Capsule.

Any changes made on the non-secure volumes are reverted when leaving secure mode, so persistent code changes must occur in maintenance mode. The SDK addresses these connectivity issues with the htrc.mock library.

Mock Testing

Mock testing uses simulated objects or functions to mimic the behavior of real code in controlled ways.

The HTRC Workset Toolkit implements a mock of the Data API access layer in htrc.mock.volumes. The Data API server is only accessible via a Capsule in secure mode. By implementing a function with the same call signature that returns the same data types, workflows that rely on the Data API can be tested either in Capsule maintenance mode or on a user’s own computer.

An easy way to use this pattern is shown below.

Example

if __debug__:
    # This code will execute when running `python script.py`
    import htrc.mock.volumes as volumes
else:
    # This code will execute when running `python -O script.py`
    # The -O argument turns on optimizations, setting __debug__ = False.
    import htrc.volumes as volumes

# The following is just to make a running script
volume_ids = ['htrc.testid']    # any list will do
output_dir = 'htrc_data'        # any path will do

# download volumes
volumes.download(volume_ids, output_dir)

This script leverages use of the python -O switch, which controls the __debug__ global variable:

  • When run in the development environment, which does not have secure access to the Data API, the program is run with python script.py, setting __debug__ = True. This means that volumes.download(volume_ids, output_dir) utilizes the function htrc.mock.volumes.download(volume_ids, output_dir).

  • When run in secure mode of the data capsule, the program is executed with python -O script.py, setting __debug__ = False. The statement volumes.download(volume_ids, output_dir) utilizes the function htrc.mock.volumes.download(volume_ids, output_dir).

Modules

htrc.metadata

htrc.metadata.get_bulk_metadata(ids, marc=False)[source]

Retrieve item metadata from the HathiTrust Bibliographic API.

Params: :param ids: HTIDs for the volumes to be retrieved :param marc: Retrieve MARC-XML within JSON return value.

htrc.metadata.get_metadata(ids, output_file=None)[source]

Retrieves metadata for a folder of folders, where each subfolder is named for a HathiTrust ID. This structure is the default structure extracted from a Data API request (:method htrc.volumes.get_volumes:).

htrc.metadata.get_volume_metadata(id, marc=False)[source]

Retrieve item metadata from the HathiTrust Bibliographic API.

Params: :param id: HTID for the volume to be retrieved :param marc: Retrieve MARC-XML within JSON return value.

htrc.metadata.record_metadata(id, sleep_time=1)[source]

Retrieve metadata for a HathiTrust Record.

htrc.metadata.safe_bulk_metadata(ids, marc=False, sleep_time=1)[source]

Retrieve bulk item metadata from the HathiTrust Bibliographic API.

Unlike :method get_bulk_metadata:, this function returns an empty dictionary, rather than an error when metadata is missing.

Params: :param ids: HTIDs for the volumes to be retrieved :param marc: Retrieve MARC-XML within JSON return value.

_ https://www.hathitrust.org/bib_api

htrc.metadata.safe_volume_metadata(id, marc=False, sleep_time=1)[source]

Retrieve item metadata from the HathiTrust Bibliographic API.

Unlike :method volume_metadata:, this function returns an empty dictionary, rather than an error when metadata is missing.

Params: :param id: HTID for the volume to be retrieved :param marc: Retrieve MARC-XML within JSON return value.

_ https://www.hathitrust.org/bib_api

htrc.metadata.volume_solr_metadata(id, sleep_time=0.1)[source]

Retrieve metadata from HTRC Solr API.

The HTRC Solr instance is used only for certain extracted features unavailable in the main HathiTrust Bibliographic API. If you are a recipient of a HTRC Advanced Collaborative Support (ACS) grant, then you may have to use this function.

htrc.mock

htrc.mock.volumes

htrc.mock.volumes

Contains functions to test the volume retrieval from the HTRC Data API. The download functions will return a sample zip file.

See the core documentation for an example of how to use this library.

htrc.mock.volumes.credentials_from_config(path)[source]

Retrieves the username and password from a config file for the Data API. DOES NOT raise an EnvironmentError if path is invalid. See also: credential_prompt

htrc.mock.volumes.get_oauth2_token(username, password)[source]

Returns a sample token for oauth2

htrc.mock.volumes.get_pages(token, page_ids, concat=False)[source]

Returns a ZIP file containing specfic pages.

Parameters: :token: An OAuth2 token for the app. :volume_ids: A list of volume_ids :concat: If True, return a single file per volume. If False, return a single file per page (default).

htrc.mock.volumes.get_volumes(token, volume_ids, concat=False)[source]

Returns volumes from the Data API as a raw zip stream.

Parameters: :token: An OAuth2 token for the app. :volume_ids: A list of volume_ids :concat: If True, return a single file per volume. If False, return a single file per page (default).

htrc.volumes

htrc.volumes

Contains functions to retrieve volumes from the HTRC Data API.

The functions in this package will not operate unless they are executed from an HTRC Data Capsule in Secure Mode. The module htrc.mock.volumes contains Patch objects for testing workflows.

htrc.volumes.get_pages(data_api_config: htrc.config.HtrcDataApiConfig, page_ids, concat=False, mets=False, buffer_size=128)[source]

Returns a ZIP file containing specfic pages.

Parameters: :data_api_config: The configuration data of the DataAPI endpoint. :volume_ids: A list of volume_ids :concat: If True, return a single file per volume. If False, return a single file per page (default).

htrc.volumes.get_volumes(data_api_config: htrc.config.HtrcDataApiConfig, volume_ids, concat=False, mets=False, buffer_size=128)[source]

Returns volumes from the Data API as a raw zip stream.

Parameters: :token: An OAuth2 token for the app. :volume_ids: A list of volume_ids :concat: If True, return a single file per volume. If False, return a single file per page (default). :host: Data API host :port: Data API port

htrc.util

htrc.util.split_items(seq, split_size)[source]

Returns a generator that returns portions of seq up to split_size. Useful when chunking requests to bulk endpoints.

Parameters
  • seq – A sequence to split.

  • split_size – The maximum size of each split.