HTRC Workset Toolkit

The HTRC Workset Toolkit povides a command line interface for interacting with and analyzing volumes in the HathiTrust Digital Library:

  • Volume Download (htrc download)
  • Metadata Download (htrc metadata)
  • Pre-built Analysis Workflows (htrc run)
  • Export of volume lists (htrc export)

Workset Path

Each of these commands takes a workset path. Valid types of workset paths and examples of each are:

Identifier Type Example
HathiTrust ID mdp.39015078560078
HathiTrust Catalog ID 001423370
HathiTrust URL https://babel.hathitrust.org/cgi/pt?id=mdp.39015078560078;view=1up;seq=13
Handle.org Volume URL https://hdl.handle.net/2027/mdp.39015078560078
HathiTrust Catalog URL https://catalog.hathitrust.org/Record/001423370
HathiTrust Collection Builder URL https://babel.hathitrust.org/shcgi/mb?a=listis;c=696632727
Local volumes file /home/dcuser/Downloads/collections.txt

Volume Download

The htrc download command retrieves volumes from the HTRC Data API to the secure mode of the HTRC Data Capsule Service.

Note

This command will return an error when run on a non-HTRC computer or on a Capsule running in maintenance mode.

Arguments

usage: htrc download [-h] [-f] [-o OUTPUT] [-c] [file]

Positional Arguments

file

workset path[s]

Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’>

Named Arguments

-f, --force

remove folder if exists

Default: False

-o, --output

output directory

Default: “/media/secure_volume/workset/”

-c, --concat

concatenate a volume’s pages in to a single file

Default: False

Bibliographic API Access

htrc metadata retrieves metadata from the HathiTrust Bibliographic API. This command has no limitations on which computer or network executes it.

Arguments

usage: htrc metadata [-h] path [path ...]

Positional Arguments

path workset path[s]

Analysis Workflows

The HTRC Workset Toolkit also provides the command line tool htrc run. Like volume download, the

Topic Modeling

There are two implementations of LDA topic modeling supported by the

Arguments

usage: htrc run mallet [-h] -k K [--iter ITER] [--workset-path WORKSET_PATH]
                       [path]

Positional Arguments

path Default: “/media/secure_volume/workset/”

Named Arguments

-k number of topics
--iter

number of iterations

Default: 200

--workset-path

Location to store workset download.

Default: “/media/secure_volume/workset/”

usage: htrc run topicexplorer [-h] -k K [K ...] [--iter ITER]
                              [--workset-path WORKSET_PATH]
                              [path]

Positional Arguments

path Default: “/media/secure_volume/workset”

Named Arguments

-k number of topics
--iter

number of iterations

Default: 200

--workset-path

Location to store workset download.

Default: “/media/secure_volume/workset”

Use Cases and Examples

Following are the use cases and examples of htrc commands inside the HTRC Data Capsule.

command: htrc download capsule mode: secure
  • Download volumes of volume id list to default path : (/media/secure_volume/workset)

    htrc download /home/dcuser/HTRC/htrc-id

  • Download volumes of hathi collection url to default path : (/media/secure_volume/workset)

    htrc download “https://babel.hathitrust.org/cgi/mb?a=listis&c=1337751722”

  • Download volumes to specific location :

    htrc download /home/dcuser/HTRC/htrc-id -o /media/secure_volume/my-workset

  • Download volumes to specific location with concatenation option - (This will concatenate all the pages of the volume into one txt file.) :

    htrc download /home/dcuser/HTRC/htrc-id -o /media/secure_volume/my-workset -c


command: htrc metadata capsule mode: secure and maintenance
  • Download the metadata of volumes by giving hathi collection url :

    htrc metadata "https://babel.hathitrust.org/cgi/mb?a=listis&c=1853042514"

  • Download the metadata of volumes by giving volume id list :

    htrc metadata /home/dcuser/HTRC/htrc-id

  • Download the metadata associated with volume id : volume 1 of The Works of Jonathan Swift

    htrc metadata mdp.39015078560078

    Note that this would only retrieve the first volume. If you want to download metadata for all 8 volumes, the catalog identifier would be used:

    htrc metadata 001423370

    Each command can be used with the URL as well (note the quote marks around each URL):

    htrc metadata "https://babel.hathitrust.org/cgi/pt?id=mdp.39015078560078;view=1up;seq=13"

    htrc metadata "https://catalog.hathitrust.org/Record/001423370"

    This URL support makes it easy to browse hathitrust.org and copy links for computational analysis using the SDK.


command: htrc metadata capsule mode: secure
  • Download the metadata of volumes by giving already downloaded volumes path :

    htrc metadata /media/secure_volume/workset


command: htrc metadata capsule mode: maintenance
  • Download the metadata of volumes by giving already downloaded volumes path - (Sample volumes are available in capsules created with ubuntu-16-04-with-sample-volumes image. Those sample volumes are available as zip files. Please unzip before use them because the metadata function gets volume ids from volume directory names.) :

    mkdir /home/dcuser/unzipped_volumes

    cp /home/dcuser/HTRC/data/sample_volumes/fiction/<zip_files> /home/dcuser/unzipped_volumes

    unzip /home/dcuser/unzipped_volumes/’*.zip’ | rm /home/dcuser/unzipped_volumes/*.zip

    htrc metadata /home/dcuser/unzipped_volumes


command: htrc export capsule mode: secure and maintenance
  • Export volume ids from downloaded hathi collection and create workset with only volume ids :

    Go to the following link in the browser

    https://babel.hathitrust.org/cgi/mb?a=listis&c=1853042514

    Download metadata in tab separated format (Download Item Metadata: Tab-Delimited Text (TSV)), then -

    htrc export mb-9.txt > volumes.tx

  • Export volume ids from hathi collection url and create workset with only volume ids (works in both secure and maintenance modes) :

    htrc export "https://babel.hathitrust.org/cgi/mb?a=listis&c=1853042514" > volumes.txt


command: htrc run mallet capsule mode: secure
  • Run mallet on already downloaded volumes :

    htrc run mallet /media/secure_volume/workset -k 20

  • Run mallet on volume id list :

    htrc run mallet /home/dcuser/HTRC/htrc-id -k 20

  • Run mallet on hathi collection :

    htrc run mallet "https://babel.hathitrust.org/cgi/mb?a=listis&c=1853042514" -k 20


command: htrc run mallet capsule mode: maintenance
  • Run mallet on already downloaded volume - (Sample volumes are available in capsules created with ubuntu-16-04-with-sample-volumes image. Those sample volumes are available as zip files. Please unzip before use them because the metadata function gets volume ids from volume directory names).

    htrc mallet /home/dcuser/unzipped_volumes -k 20


command: htrc run topicexplorer capsule mode: secure
  • Run topicexplorer on already downloaded volumes :

    htrc run topicexplorer /media/secure_volume/workset -k 20

  • Run topicexplorer on volume id list :

    htrc run topicexplorer /home/dcuser/HTRC/htrc-id -k 20

  • Run topicexplorer on hathi collection :

    htrc run topicexplorer "https://babel.hathitrust.org/cgi/mb?a=listis&c=1853042514" -k 20


command: htrc run topicexplorer capsule mode: maintenance
  • Run topicexplorer on already downloaded volume - (Sample volumes are available in capsules created with ubuntu-16-04-with-sample-volumes image. Those sample volumes are available as zip files. Please unzip before use them because the metadata function gets volume ids from volume directory names).

    htrc topicexplorer /home/dcuser/unzipped_volumes -k 20