Help!
Do you have a similar post-processing-heavy small-batch workflow? In the plan outlined below, have I underestimated how much work will be required, and/or am in danger of burdening my organization with a project no-one will want to continue after me? What cross-platform, open-source software would best suit the project described below?
The guestbook:
The guestbook is a waterproof, top-bound booklet housed with writing implements in an enclosure at a loop junction & popular resting place near the start of the trail system. We provide trail users with suggestions for things to put in their entries that we might find helpful, but the rest is up to them. So, in among the more mundane and workaday tallies of number of hikers encountered, off-leash dogs, trash, and so forth, there are some really creative, intriguing, or just bizarre entries. More than just a conduit of information back to us from our users, it is a creative outlet our users seem to enjoy that is qualitatively different from the way they relate to us online. Because of this, I think has potential to be turned into something really great to share with the world which I intend to explore through this project.
Figure 1: The logbook front cover. Note that it is written on. It is desirable to preserve this in the dataset and e-book, too.
The e-book:
Earlier this year, we had even considered dropping the physical guestbook entirely, as it requires the manpower to maintain & process it, and it was argued that people can submit comments more efficiently online. But I strongly defended keeping it, arguing successfully that we would almost certainly never hear from certain people who barely go near a computer without it, and that the actual work required to keep it up wouldn't be too burdensome. Besides which, these handwritten entries, poems, sketches, and the like put an appealingly human face on our efforts. Making it all into an e-book is something nice I'd like to do in order to connect with this diverse, eager, & observant community centered around our beloved trail. If it is well received, I'd like to make the process as painless as possible so that I'll be able to hand it off to others to do once we have a process for it that is not burdensome.
Figure 2: A representative page from the book, showing typical orientation and tidy, chronologically sequenced entries
Extracting the data:
I have just finished the first of the following intermediate work products, and am seeking advice from this community about tackling the rest using all open source software (on Linux, but other platforms are desirable to support):
- The raw scans of page pairs: 51 scans including cover. Note that page orientation flips vertically on alternating pages. Finished! See fig. 3.
- The scans sliced into pages with orientation corrected. See fig. 2 & 3.
- The pages sliced into individual cropped fragments belonging to an entry, anticipating such problems as: non-rectangular entries, overlapping, continued on next page. See fig. 3 for a particularly challenging page.
- A data set with all of the raw data, tabulated into one row per entry, with:
- One or more cropped image fragments per entry.
- Full text of entry. I anticipate no OCR will be possible, so this will all be hand entered from the scans.
- Metadata, e.g. name(s) & # of people in party, date/time, hometown, loops/distance hiked, counts of people, dogs, trash, and whatever tags we feel are helpful, such as: 'poem', 'sketch', 'thanks', 'report', etc.
Putting it all together:
Ultimately, I aim to assemble all of this into these final products:
- A spreadsheet document containing the data set that can be used by our organization to help produce reports, etc.
- An e-book.
- Preserve all, or substantially all of the image content from the guestbook.
- Follow some reasonable ordering, which is probably not the original order of the entries, since they tend to get jumbled up as the book fills and people look for partial pages to write new entries on. Chronological ordering seems like a natural choice.
- Suitable for viewing on the web & on mobile devices.
- Searchable.
- Accessible (i.e. screen-reader-friendly).
- Preserve look-and-feel of the original guestbook.
My hardware:
Figure 3 is a flatbed-scanner image of that page, produced with xsane. Once I got the hang of making small adjustments to the position of the slightly-too-big-for-the-scanning-area book without losing anything valuable off the top/bottom margins, I took about 8 minutes per 10 pages to do. Given the infrequency of this job & small number of pages to scan, as well as having zero dollars budget for the project, I'm not really interested in optimizing this at this time, so I'll stick with the hardware I have and want to focus mainly on the post-processing workflow.
My software:
So far, software I'm using, or have used in the past for this kind of work:
- Debian 9 "stretch".
- xsane - the most efficient way to do a batch of numbered page images with exactly the level of control I require
- digikam - my photo management software (using this more for trail photo documentation than this project, but I also have an album of the scans in digikam)
- gimp - in the past, I've used this for photo & scan editing, but these days I'm doing more with digikam's image editor, which is a bit simpler for routine photo editing jobs
- gocr / tesseract - not very useful for this job, as noted above (though I've used it before to scan a book for a blind friend)
- calibre - I've used this before for certain kinds of scripted epub processing; no idea if it will be any help on this project
- evince - my preferred PDF viewer
- libreoffice - Calc for the spreadsheet
Outcomes:
I'm after more than just a single work product as the outcome, here. I want a repeatable process. I want something that will not just help me to produce this tiny book, but to guide "future me" and others in my organization in making next year's edition, and for others undertaking similar work to revise & improve upon. I did a cursory scan of the web to try to find something like that I could follow, and after an hour or so of effort, the most promising lead was this website.
I spent some time browsing the forums, especially the HOWTO's, and after satisfying myself that nothing quite matched, I started drafting this post. In spite of not having found any similar stories in the forums (yet), it looks like many of you may have already covered at least some of the ground I'll need to cover soon. I'd like to pull together from your collective experiences whatever I can that would aid in my success. Any pointers to articles / threads here I may have looked, guides here or elsewhere that may have already been written, or comments on any of the ideas outlined above would be greatly appreciated!