Workset Creation through Image Analysis of Document Pages
Texas A&M University
PI: Keith Biggers
The four main technical achievements of the image analysis project are: an extensible framework for configuring and executing image analysis workflows; implementations of several algorithms from the research literature for image cleaning, manipulation and segmentation; a library that encapsulates access to HathiTrust APIs and data structures for ease of use within Java applications; and an application to run on HTRC servers that reads a list of items, loads the corresponding page images and executes an image analysis workflow. The core technical component of our system is DataTrax, a framework for executing user- configurable image analysis workflows. The DataTrax framework, the library of document image analysis algorithms and the HathiTrust Software Development Kit (for connecting connect new technology to the existing infrastructure and resources provided by HathiTrust) are integral to the efforts of the prototype grant but are implemented as separate libraries that can be used (and are being used) independently. The overall system developed through this work provides a framework for working with the large and diverse collection of page images housed at the HathiTrust digital library, and the library-oriented development process will maximize the impact of this work for use in a broad range of projects beyond the scope of WCSA.
A copy of the project’s final report is available via IDEALS.