HTRC Workset Toolkit ====================== The HTRC Workset Toolkit povides a command line interface for interacting with and analyzing volumes in the HathiTrust Digital Library: - Volume Download (``htrc download``) - Metadata Download (``htrc metadata``) - Pre-built Analysis Workflows (``htrc run``) - Export of volume lists (``htrc export``) Workset Path -------------- Each of these commands takes a *workset path*. Valid types of workset paths and examples of each are: ================================== ============================================================================== Identifier Type Example ================================== ============================================================================== HathiTrust ID mdp.39015078560078 HathiTrust Catalog ID 001423370 HathiTrust URL https://babel.hathitrust.org/cgi/pt?id=mdp.39015078560078;view=1up;seq=13 Handle.org Volume URL https://hdl.handle.net/2027/mdp.39015078560078 HathiTrust Catalog URL https://catalog.hathitrust.org/Record/001423370 HathiTrust Collection Builder URL https://babel.hathitrust.org/shcgi/mb?a=listis;c=696632727 Local volumes file ``/home/dcuser/Downloads/collections.txt`` ================================== ============================================================================== Volume Download -------------------- The ``htrc download`` command retrieves volumes from the `HTRC Data API`_ to the secure mode of the :ref:`HTRC Data Capsule Service`. .. note:: This command will return an error when run on a non-HTRC computer or on a Capsule running in maintenance mode. .. _HTRC Data API: https://wiki.htrc.illinois.edu/display/COM/HTRC+Data+API+Users+Guide Arguments ''''''''''' .. argparse:: :module: htrc.__main__ :func: download_parser :prog: htrc download Bibliographic API Access -------------------------- ``htrc metadata`` retrieves metadata from the `HathiTrust Bibliographic API`_. This command has no limitations on which computer or network executes it. .. _HathiTrust Bibliographic API: https://www.hathitrust.org/bib_api Arguments ''''''''''' .. argparse:: :module: htrc.__main__ :func: add_workset_path :prog: htrc metadata Analysis Workflows -------------------- The HTRC Workset Toolkit also provides the command line tool ``htrc run``. Like `volume download`_, the Topic Modeling '''''''''''''''' There are two implementations of LDA topic modeling supported by the Arguments ''''''''''' .. argparse:: :module: htrc.tools.mallet :func: populate_parser :prog: htrc run mallet Use Cases and Examples -------------------------------------------- Following are the use cases and examples of ``htrc`` commands inside the HTRC Data Capsule. +---------------------------------+---------------------------+ | command: ``htrc download`` | capsule mode: **secure** | +---------------------------------+---------------------------+ * Download volumes of volume id list to default path : (/media/secure_volume/workset) ``htrc download /home/dcuser/HTRC/htrc-id`` * Download volumes of hathi collection url to default path : (/media/secure_volume/workset) ``htrc download “https://babel.hathitrust.org/cgi/mb?a=listis&c=1337751722”`` * Download volumes to specific location : ``htrc download /home/dcuser/HTRC/htrc-id -o /media/secure_volume/my-workset`` * Download volumes to specific location with concatenation option - (This will concatenate all the pages of the volume into one txt file.) : ``htrc download /home/dcuser/HTRC/htrc-id -o /media/secure_volume/my-workset -c`` * Download specific pages from a single volume : ``htrc download -pg coo.31924089593846[5,10,15,20,25,30]`` * Download volumes and then extract headers/footers from the volumes : ``htrc download -hf /home/dcuser/HTRC/htrc-id`` * Download volumes, extract headers/footers from the volume pages then concatenate the pages - (This will concatenate all the pages of the volume into one txt file.) : ``htrc download -hfc /home/dcuser/HTRC/htrc-id`` * Download volumes, extract headers/footers from the volumes, skip downloading the .csv files containing removed headers and footers : ``htrc download -hf -s /home/dcuser/HTRC/htrc-id`` * Download volumes, extract headers/footers from volumes, change window of pages in extractor algorithm (The default is 6, lower numbers increase speed, but are less accurate) : ``htrc download -hf -w 3 /home/dcuser/HTRC/htrc-id`` * Download volumes, extract headers/footers from volumes, change minimum similarity rate for lines on pages to be considered a header or footer (Default is .7 or 70%, so if a line is 70% the same as other lines on other pages within the window of pages it is labeled a header or footer and removed) : ``htrc download -hf -msr .9 /home/dcuser/HTRC/htrc-id`` * Download volumes, extract headers/footers from volumes, change the max number of concurrent tasks (note that the only options are 1 or 2): ``htrc download -hf --parallelism 2 /home/dcuser/HTRC/htrc-id`` | +---------------------------------+-----------------------------------------------+ | command: ``htrc metadata`` | capsule mode: **secure** and **maintenance** | +---------------------------------+-----------------------------------------------+ * Download the metadata of volumes by giving hathi collection url : ``htrc metadata "https://babel.hathitrust.org/cgi/mb?a=listis&c=1853042514"`` * Download the metadata of volumes by giving volume id list : ``htrc metadata /home/dcuser/HTRC/htrc-id`` * Download the metadata associated with volume id : volume 1 of `The Works of Jonathan Swift`_ ``htrc metadata mdp.39015078560078`` Note that this would only retrieve the first volume. If you want to download metadata for all 8 volumes, the catalog identifier would be used: ``htrc metadata 001423370`` Each command can be used with the URL as well (*note the quote marks around each URL*): ``htrc metadata "https://babel.hathitrust.org/cgi/pt?id=mdp.39015078560078;view=1up;seq=13"`` ``htrc metadata "https://catalog.hathitrust.org/Record/001423370"`` This URL support makes it easy to browse `hathitrust.org`_ and copy links for computational analysis using the SDK. .. _The Works of Jonathan Swift: https://hdl.handle.net/2027/mdp.39015078560078 .. _hathitrust.org: https://www.hathitrust.org/ | +---------------------------------+-----------------------------------------------+ | command: ``htrc metadata`` | capsule mode: **secure** | +---------------------------------+-----------------------------------------------+ * Download the metadata of volumes by giving already downloaded volumes path : ``htrc metadata /media/secure_volume/workset`` | +---------------------------------+-----------------------------------------------+ | command: ``htrc metadata`` | capsule mode: **maintenance** | +---------------------------------+-----------------------------------------------+ * Download the metadata of volumes by giving already downloaded volumes path - (Sample volumes are available in capsules created with ubuntu-16-04-with-sample-volumes image. Those sample volumes are available as zip files. Please unzip before use them because the metadata function gets volume ids from volume directory names.) : ``mkdir /home/dcuser/unzipped_volumes`` ``cp /home/dcuser/HTRC/data/sample_volumes/fiction/ /home/dcuser/unzipped_volumes`` ``unzip /home/dcuser/unzipped_volumes/’*.zip’ | rm /home/dcuser/unzipped_volumes/*.zip`` ``htrc metadata /home/dcuser/unzipped_volumes`` | +---------------------------------+-----------------------------------------------+ | command: ``htrc export`` | capsule mode: **secure** and **maintenance** | +---------------------------------+-----------------------------------------------+ * Export volume ids from downloaded hathi collection and create workset with only volume ids : Go to the following link in the browser https://babel.hathitrust.org/cgi/mb?a=listis&c=1853042514 Download metadata in tab separated format (Download Item Metadata: Tab-Delimited Text (TSV)), then - ``htrc export mb-9.txt > volumes.tx`` * Export volume ids from hathi collection url and create workset with only volume ids (works in both secure and maintenance modes) : ``htrc export "https://babel.hathitrust.org/cgi/mb?a=listis&c=1853042514" > volumes.txt`` | +---------------------------------+-----------------------------------------------+ | command: ``htrc run mallet`` | capsule mode: **secure** | +---------------------------------+-----------------------------------------------+ * Run mallet on already downloaded volumes : ``htrc run mallet /media/secure_volume/workset -k 20`` * Run mallet on volume id list : ``htrc run mallet /home/dcuser/HTRC/htrc-id -k 20`` * Run mallet on hathi collection : ``htrc run mallet "https://babel.hathitrust.org/cgi/mb?a=listis&c=1853042514" -k 20`` | +-----------------------------------+-----------------------------------------------+ | command: ``htrc run mallet`` | capsule mode: **maintenance** | +-----------------------------------+-----------------------------------------------+ * Run mallet on already downloaded volume - (Sample volumes are available in capsules created with ubuntu-16-04-with-sample-volumes image. Those sample volumes are available as zip files. Please unzip before use them because the metadata function gets volume ids from volume directory names). ``htrc mallet /home/dcuser/unzipped_volumes -k 20``