HTRC Workset Toolkit¶
The HTRC Workset Toolkit povides a command line interface for interacting with and analyzing volumes in the HathiTrust Digital Library:
Volume Download (
htrc download
)Metadata Download (
htrc metadata
)Pre-built Analysis Workflows (
htrc run
)Export of volume lists (
htrc export
)
Workset Path¶
Each of these commands takes a workset path. Valid types of workset paths and examples of each are:
Identifier Type |
Example |
---|---|
HathiTrust ID |
mdp.39015078560078 |
HathiTrust Catalog ID |
001423370 |
HathiTrust URL |
https://babel.hathitrust.org/cgi/pt?id=mdp.39015078560078;view=1up;seq=13 |
Handle.org Volume URL |
|
HathiTrust Catalog URL |
|
HathiTrust Collection Builder URL |
|
Local volumes file |
|
Volume Download¶
The htrc download
command retrieves volumes from the HTRC Data API
to the secure mode of the HTRC Data Capsule Service.
Note
This command will return an error when run on a non-HTRC computer or on a Capsule running in maintenance mode.
Arguments¶
usage: htrc download [-h] [-f] [-o OUTPUT] [-hf] [-hfc] [-w N] [-msr N] [-s] [--parallelism N] [--batch-size N] [-c] [-m] [-pg] [-t TOKEN] [-dh DATAHOST] [-dp DATAPORT] [-de DATAEPR] [-dc DATACERT] [-dk DATAKEY] [file]
Positional Arguments¶
- file
Workset path[s]
Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’utf-8’>
Named Arguments¶
- -f, --force
Remove folder if exists
Default: False
- -o, --output
Output directory
Default: “/media/secure_volume/workset/”
- -hf, --remove-headers-footers
Remove headers and footers from individual pages and save in a separate csv file for inspection
Default: False
- -hfc, --remove-headers-footers-and-concat
Remove headers and footers from individual pages and save in a separate csv file for inspection then concatenate pages
Default: False
- -w, --window-size
How many pages ahead does the header/footer extractor algorithm look to find potential matching headers/footers (higher value gives potentially more accurate results on lower quality OCR volumes at the expense of runtime)
Default: 6
- -msr, --min-similarity-ratio
The minimum string similarity ratio required for the Levenshtein distance fuzzy-matching algorithm to declare that two headers are considered ‘the same’ (the higher the value, up to a max of 1.0, the more strict the matching has to be; lower values allow for more fuzziness to account for OCR errors)
Default: 0.7
- -s, --skip-removed-hf
Skip creating a saved report of the removed headers and footers for each page for inspection
Default: False
- --parallelism
The max number of concurrent tasks to start when downloading or removing headers/footers
Default: 10
- --batch-size
The max number of volumes to download at a time from DataAPI
Default: 250
- -c, --concat
Concatenate a volume’s pages in to a single file
Default: False
- -m, --mets
Add volume’s METS file
Default: False
- -pg, --pages
Download given page numbers of a volumes.
Default: False
- -t, --token
JWT for volumes download.
- -dh, --datahost
Data API host.
- -dp, --dataport
Data API port.
- -de, --dataepr
Data API EPR.
- -dc, --datacert
Client certificate file for mutual TLS with Data API.
- -dk, --datakey
Client key file for mutual TLS with Data API.
Bibliographic API Access¶
htrc metadata
retrieves metadata from the HathiTrust Bibliographic API.
This command has no limitations on which computer or network executes it.
Arguments¶
usage: htrc metadata [-h] path [path ...]
Positional Arguments¶
- path
Workset path[s]
Analysis Workflows¶
The HTRC Workset Toolkit also provides the command line tool htrc run
. Like volume
download, the
Topic Modeling¶
There are two implementations of LDA topic modeling supported by the
Arguments¶
usage: htrc run mallet [-h] -k K [--iter ITER] [--workset-path WORKSET_PATH] [path]
Positional Arguments¶
- path
Default: “/media/secure_volume/workset/”
Named Arguments¶
- -k
number of topics
- --iter
number of iterations
Default: 200
- --workset-path
Location to store workset download.
Default: “/media/secure_volume/workset/”
Use Cases and Examples¶
Following are the use cases and examples of htrc
commands inside the HTRC Data Capsule.
command: |
capsule mode: secure |
Download volumes of volume id list to default path : (/media/secure_volume/workset)
htrc download /home/dcuser/HTRC/htrc-id
Download volumes of hathi collection url to default path : (/media/secure_volume/workset)
htrc download “https://babel.hathitrust.org/cgi/mb?a=listis&c=1337751722”
Download volumes to specific location :
htrc download /home/dcuser/HTRC/htrc-id -o /media/secure_volume/my-workset
Download volumes to specific location with concatenation option - (This will concatenate all the pages of the volume into one txt file.) :
htrc download /home/dcuser/HTRC/htrc-id -o /media/secure_volume/my-workset -c
Download specific pages from a single volume :
htrc download -pg coo.31924089593846[5,10,15,20,25,30]
Download volumes and then extract headers/footers from the volumes :
htrc download -hf /home/dcuser/HTRC/htrc-id
Download volumes, extract headers/footers from the volume pages then concatenate the pages - (This will concatenate all the pages of the volume into one txt file.) :
htrc download -hfc /home/dcuser/HTRC/htrc-id
Download volumes, extract headers/footers from the volumes, skip downloading the .csv files containing removed headers and footers :
htrc download -hf -s /home/dcuser/HTRC/htrc-id
Download volumes, extract headers/footers from volumes, change window of pages in extractor algorithm (The default is 6, lower numbers increase speed, but are less accurate) :
htrc download -hf -w 3 /home/dcuser/HTRC/htrc-id
Download volumes, extract headers/footers from volumes, change minimum similarity rate for lines on pages to be considered a header or footer (Default is .7 or 70%, so if a line is 70% the same as other lines on other pages within the window of pages it is labeled a header or footer and removed) :
htrc download -hf -msr .9 /home/dcuser/HTRC/htrc-id
Download volumes, extract headers/footers from volumes, change the max number of concurrent tasks (note that the only options are 1 or 2):
htrc download -hf --parallelism 2 /home/dcuser/HTRC/htrc-id
command: |
capsule mode: secure and maintenance |
Download the metadata of volumes by giving hathi collection url :
htrc metadata "https://babel.hathitrust.org/cgi/mb?a=listis&c=1853042514"
Download the metadata of volumes by giving volume id list :
htrc metadata /home/dcuser/HTRC/htrc-id
Download the metadata associated with volume id : volume 1 of The Works of Jonathan Swift
htrc metadata mdp.39015078560078
Note that this would only retrieve the first volume. If you want to download metadata for all 8 volumes, the catalog identifier would be used:
htrc metadata 001423370
Each command can be used with the URL as well (note the quote marks around each URL):
htrc metadata "https://babel.hathitrust.org/cgi/pt?id=mdp.39015078560078;view=1up;seq=13"
htrc metadata "https://catalog.hathitrust.org/Record/001423370"
This URL support makes it easy to browse hathitrust.org and copy links for computational analysis using the SDK.
command: |
capsule mode: secure |
Download the metadata of volumes by giving already downloaded volumes path :
htrc metadata /media/secure_volume/workset
command: |
capsule mode: maintenance |
Download the metadata of volumes by giving already downloaded volumes path - (Sample volumes are available in capsules created with ubuntu-16-04-with-sample-volumes image. Those sample volumes are available as zip files. Please unzip before use them because the metadata function gets volume ids from volume directory names.) :
mkdir /home/dcuser/unzipped_volumes
cp /home/dcuser/HTRC/data/sample_volumes/fiction/<zip_files> /home/dcuser/unzipped_volumes
unzip /home/dcuser/unzipped_volumes/’*.zip’ | rm /home/dcuser/unzipped_volumes/*.zip
htrc metadata /home/dcuser/unzipped_volumes
command: |
capsule mode: secure and maintenance |
Export volume ids from downloaded hathi collection and create workset with only volume ids :
Go to the following link in the browser
https://babel.hathitrust.org/cgi/mb?a=listis&c=1853042514
Download metadata in tab separated format (Download Item Metadata: Tab-Delimited Text (TSV)), then -
htrc export mb-9.txt > volumes.tx
Export volume ids from hathi collection url and create workset with only volume ids (works in both secure and maintenance modes) :
htrc export "https://babel.hathitrust.org/cgi/mb?a=listis&c=1853042514" > volumes.txt
command: |
capsule mode: secure |
Run mallet on already downloaded volumes :
htrc run mallet /media/secure_volume/workset -k 20
Run mallet on volume id list :
htrc run mallet /home/dcuser/HTRC/htrc-id -k 20
Run mallet on hathi collection :
htrc run mallet "https://babel.hathitrust.org/cgi/mb?a=listis&c=1853042514" -k 20
command: |
capsule mode: maintenance |
Run mallet on already downloaded volume - (Sample volumes are available in capsules created with ubuntu-16-04-with-sample-volumes image. Those sample volumes are available as zip files. Please unzip before use them because the metadata function gets volume ids from volume directory names).
htrc mallet /home/dcuser/unzipped_volumes -k 20