HTRC Workset Toolkit

The HTRC Workset Toolkit povides a command line interface for interacting with and analyzing volumes in the HathiTrust Digital Library:

  • Volume Download (htrc download)

  • Metadata Download (htrc metadata)

  • Pre-built Analysis Workflows (htrc run)

  • Export of volume lists (htrc export)

Workset Path

Each of these commands takes a workset path. Valid types of workset paths and examples of each are:

Identifier Type

Example

HathiTrust ID

mdp.39015078560078

HathiTrust Catalog ID

001423370

HathiTrust URL

https://babel.hathitrust.org/cgi/pt?id=mdp.39015078560078;view=1up;seq=13

Handle.org Volume URL

https://hdl.handle.net/2027/mdp.39015078560078

HathiTrust Catalog URL

https://catalog.hathitrust.org/Record/001423370

HathiTrust Collection Builder URL

https://babel.hathitrust.org/shcgi/mb?a=listis;c=696632727

Local volumes file

/home/dcuser/Downloads/collections.txt

Volume Download

The htrc download command retrieves volumes from the HTRC Data API to the secure mode of the HTRC Data Capsule Service.

Note

This command will return an error when run on a non-HTRC computer or on a Capsule running in maintenance mode.

Arguments

usage: htrc download [-h] [-f] [-o OUTPUT] [-hf] [-hfc] [-w N] [-msr N] [-s] [--parallelism N] [--batch-size N] [-c] [-m] [-pg] [-t TOKEN] [-dh DATAHOST] [-dp DATAPORT] [-de DATAEPR] [-dc DATACERT] [-dk DATAKEY] [file]

Positional Arguments

file

Workset path[s]

Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’utf-8’>

Named Arguments

-f, --force

Remove folder if exists

Default: False

-o, --output

Output directory

Default: “/media/secure_volume/workset/”

-hf, --remove-headers-footers

Remove headers and footers from individual pages and save in a separate csv file for inspection

Default: False

-hfc, --remove-headers-footers-and-concat

Remove headers and footers from individual pages and save in a separate csv file for inspection then concatenate pages

Default: False

-w, --window-size

How many pages ahead does the header/footer extractor algorithm look to find potential matching headers/footers (higher value gives potentially more accurate results on lower quality OCR volumes at the expense of runtime)

Default: 6

-msr, --min-similarity-ratio

The minimum string similarity ratio required for the Levenshtein distance fuzzy-matching algorithm to declare that two headers are considered ‘the same’ (the higher the value, up to a max of 1.0, the more strict the matching has to be; lower values allow for more fuzziness to account for OCR errors)

Default: 0.7

-s, --skip-removed-hf

Skip creating a saved report of the removed headers and footers for each page for inspection

Default: False

--parallelism

The max number of concurrent tasks to start when downloading or removing headers/footers

Default: 10

--batch-size

The max number of volumes to download at a time from DataAPI

Default: 250

-c, --concat

Concatenate a volume’s pages in to a single file

Default: False

-m, --mets

Add volume’s METS file

Default: False

-pg, --pages

Download given page numbers of a volumes.

Default: False

-t, --token

JWT for volumes download.

-dh, --datahost

Data API host.

-dp, --dataport

Data API port.

-de, --dataepr

Data API EPR.

-dc, --datacert

Client certificate file for mutual TLS with Data API.

-dk, --datakey

Client key file for mutual TLS with Data API.

Bibliographic API Access

htrc metadata retrieves metadata from the HathiTrust Bibliographic API. This command has no limitations on which computer or network executes it.

Arguments

usage: htrc metadata [-h] path [path ...]

Positional Arguments

path

Workset path[s]

Analysis Workflows

The HTRC Workset Toolkit also provides the command line tool htrc run. Like volume download, the

Topic Modeling

There are two implementations of LDA topic modeling supported by the

Arguments

usage: htrc run mallet [-h] -k K [--iter ITER] [--workset-path WORKSET_PATH] [path]

Positional Arguments

path

Default: “/media/secure_volume/workset/”

Named Arguments

-k

number of topics

--iter

number of iterations

Default: 200

--workset-path

Location to store workset download.

Default: “/media/secure_volume/workset/”

Use Cases and Examples

Following are the use cases and examples of htrc commands inside the HTRC Data Capsule.

command: htrc download

capsule mode: secure

  • Download volumes of volume id list to default path : (/media/secure_volume/workset)

    htrc download /home/dcuser/HTRC/htrc-id

  • Download volumes of hathi collection url to default path : (/media/secure_volume/workset)

    htrc download “https://babel.hathitrust.org/cgi/mb?a=listis&c=1337751722”

  • Download volumes to specific location :

    htrc download /home/dcuser/HTRC/htrc-id -o /media/secure_volume/my-workset

  • Download volumes to specific location with concatenation option - (This will concatenate all the pages of the volume into one txt file.) :

    htrc download /home/dcuser/HTRC/htrc-id -o /media/secure_volume/my-workset -c

  • Download specific pages from a single volume :

    htrc download -pg coo.31924089593846[5,10,15,20,25,30]

  • Download volumes and then extract headers/footers from the volumes :

    htrc download -hf /home/dcuser/HTRC/htrc-id

  • Download volumes, extract headers/footers from the volume pages then concatenate the pages - (This will concatenate all the pages of the volume into one txt file.) :

    htrc download -hfc /home/dcuser/HTRC/htrc-id

  • Download volumes, extract headers/footers from the volumes, skip downloading the .csv files containing removed headers and footers :

    htrc download -hf -s /home/dcuser/HTRC/htrc-id

  • Download volumes, extract headers/footers from volumes, change window of pages in extractor algorithm (The default is 6, lower numbers increase speed, but are less accurate) :

    htrc download -hf -w 3 /home/dcuser/HTRC/htrc-id

  • Download volumes, extract headers/footers from volumes, change minimum similarity rate for lines on pages to be considered a header or footer (Default is .7 or 70%, so if a line is 70% the same as other lines on other pages within the window of pages it is labeled a header or footer and removed) :

    htrc download -hf -msr .9 /home/dcuser/HTRC/htrc-id

  • Download volumes, extract headers/footers from volumes, change the max number of concurrent tasks (note that the only options are 1 or 2):

    htrc download -hf --parallelism 2 /home/dcuser/HTRC/htrc-id


command: htrc metadata

capsule mode: secure and maintenance

  • Download the metadata of volumes by giving hathi collection url :

    htrc metadata "https://babel.hathitrust.org/cgi/mb?a=listis&c=1853042514"

  • Download the metadata of volumes by giving volume id list :

    htrc metadata /home/dcuser/HTRC/htrc-id

  • Download the metadata associated with volume id : volume 1 of The Works of Jonathan Swift

    htrc metadata mdp.39015078560078

    Note that this would only retrieve the first volume. If you want to download metadata for all 8 volumes, the catalog identifier would be used:

    htrc metadata 001423370

    Each command can be used with the URL as well (note the quote marks around each URL):

    htrc metadata "https://babel.hathitrust.org/cgi/pt?id=mdp.39015078560078;view=1up;seq=13"

    htrc metadata "https://catalog.hathitrust.org/Record/001423370"

    This URL support makes it easy to browse hathitrust.org and copy links for computational analysis using the SDK.


command: htrc metadata

capsule mode: secure

  • Download the metadata of volumes by giving already downloaded volumes path :

    htrc metadata /media/secure_volume/workset


command: htrc metadata

capsule mode: maintenance

  • Download the metadata of volumes by giving already downloaded volumes path - (Sample volumes are available in capsules created with ubuntu-16-04-with-sample-volumes image. Those sample volumes are available as zip files. Please unzip before use them because the metadata function gets volume ids from volume directory names.) :

    mkdir /home/dcuser/unzipped_volumes

    cp /home/dcuser/HTRC/data/sample_volumes/fiction/<zip_files> /home/dcuser/unzipped_volumes

    unzip /home/dcuser/unzipped_volumes/’*.zip’ | rm /home/dcuser/unzipped_volumes/*.zip

    htrc metadata /home/dcuser/unzipped_volumes


command: htrc export

capsule mode: secure and maintenance

  • Export volume ids from downloaded hathi collection and create workset with only volume ids :

    Go to the following link in the browser

    https://babel.hathitrust.org/cgi/mb?a=listis&c=1853042514

    Download metadata in tab separated format (Download Item Metadata: Tab-Delimited Text (TSV)), then -

    htrc export mb-9.txt > volumes.tx

  • Export volume ids from hathi collection url and create workset with only volume ids (works in both secure and maintenance modes) :

    htrc export "https://babel.hathitrust.org/cgi/mb?a=listis&c=1853042514" > volumes.txt


command: htrc run mallet

capsule mode: secure

  • Run mallet on already downloaded volumes :

    htrc run mallet /media/secure_volume/workset -k 20

  • Run mallet on volume id list :

    htrc run mallet /home/dcuser/HTRC/htrc-id -k 20

  • Run mallet on hathi collection :

    htrc run mallet "https://babel.hathitrust.org/cgi/mb?a=listis&c=1853042514" -k 20


command: htrc run mallet

capsule mode: maintenance

  • Run mallet on already downloaded volume - (Sample volumes are available in capsules created with ubuntu-16-04-with-sample-volumes image. Those sample volumes are available as zip files. Please unzip before use them because the metadata function gets volume ids from volume directory names).

    htrc mallet /home/dcuser/unzipped_volumes -k 20