I have a couple of downloadable PDF versions available now - the rest of the set will be available later on. Here are the links:Herons and Cobblestones
, 11MB downloadOakland Township: Two Hundred Years, Volume I
, 34MB download
The files are a bit bigger than I'd like, but they're within reason and have full OCR text.
I produced these using a script I wrote that yokes together a few different tools to produce layered PDFs - that helps keep the filesizes down, by using an efficient bitonal compression on the text while downscaling the illustrations to 100DPI and compressing them with medium JPEG. I've received permission from my employer to release the script as GPL, so I'll make that available soon. I just have a few improvements to make before I do that. The steps it does are:
- Separate Scan Tailor images into separate TIFF files for bitonal text and images, using ST Separator (Note: Currently this isn't a part of the script, but I'm hoping to be able to integrate that functionality into it)
- Make white background in illustration files transparent, and convert to PDF with medium quality JPEG
- Encode to DJVU and back to TIFF, for symbol merging
- Encode text to PDF (currently using Group4 - I'm not sure if there's an open-source solution for JBIG2 encoding)
- Merge text and illustration into a single page with two layers
- Merge all pages into a single document
I currently don't have an open-source solution to provide OCR, unfortunately. I've been using Acrobat for that. If there's an acceptable open-source way to do that, I'd love to integrate that into the script too.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.