100-page handwritten guestbook workflow?

Don't know where to start, or stuck on a certain problem? Drop by and tell us about it. Feel like helping others? Start here.

Moderator: peterZ

Post Reply
User avatar
Ben Armstrong
Posts: 5
Joined: 17 Jun 2018, 07:20
E-book readers owned: Moon+Reader Pro (Android); Kindle 7
Number of books owned: 100
Country: Canada
Contact:

100-page handwritten guestbook workflow?

Post by Ben Armstrong »

My first book scanning project this year is a very post-processing-heavy job (and therefore no special hardware needed): take a wilderness trail's 100-page guestbook, scan it, catalog the data, and pull it all together into an e-book.

Help!

Do you have a similar post-processing-heavy small-batch workflow? In the plan outlined below, have I underestimated how much work will be required, and/or am in danger of burdening my organization with a project no-one will want to continue after me? What cross-platform, open-source software would best suit the project described below?

The guestbook:

The guestbook is a waterproof, top-bound booklet housed with writing implements in an enclosure at a loop junction & popular resting place near the start of the trail system. We provide trail users with suggestions for things to put in their entries that we might find helpful, but the rest is up to them. So, in among the more mundane and workaday tallies of number of hikers encountered, off-leash dogs, trash, and so forth, there are some really creative, intriguing, or just bizarre entries. More than just a conduit of information back to us from our users, it is a creative outlet our users seem to enjoy that is qualitatively different from the way they relate to us online. Because of this, I think has potential to be turned into something really great to share with the world which I intend to explore through this project.

Figure 1:
IMG_20180617_101407_v1.jpg
The logbook front cover. Note that it is written on. It is desirable to preserve this in the dataset and e-book, too.

The e-book:

Earlier this year, we had even considered dropping the physical guestbook entirely, as it requires the manpower to maintain & process it, and it was argued that people can submit comments more efficiently online. But I strongly defended keeping it, arguing successfully that we would almost certainly never hear from certain people who barely go near a computer without it, and that the actual work required to keep it up wouldn't be too burdensome. Besides which, these handwritten entries, poems, sketches, and the like put an appealingly human face on our efforts. Making it all into an e-book is something nice I'd like to do in order to connect with this diverse, eager, & observant community centered around our beloved trail. If it is well received, I'd like to make the process as painless as possible so that I'll be able to hand it off to others to do once we have a process for it that is not burdensome.

Figure 2:
IMG_20180617_101514_v2.jpg
A representative page from the book, showing typical orientation and tidy, chronologically sequenced entries

Extracting the data:

I have just finished the first of the following intermediate work products, and am seeking advice from this community about tackling the rest using all open source software (on Linux, but other platforms are desirable to support):
  • The raw scans of page pairs: 51 scans including cover. Note that page orientation flips vertically on alternating pages. Finished! See fig. 3.
  • The scans sliced into pages with orientation corrected. See fig. 2 & 3.
  • The pages sliced into individual cropped fragments belonging to an entry, anticipating such problems as: non-rectangular entries, overlapping, continued on next page. See fig. 3 for a particularly challenging page.
  • A data set with all of the raw data, tabulated into one row per entry, with:
    • One or more cropped image fragments per entry.
    • Full text of entry. I anticipate no OCR will be possible, so this will all be hand entered from the scans.
    • Metadata, e.g. name(s) & # of people in party, date/time, hometown, loops/distance hiked, counts of people, dogs, trash, and whatever tags we feel are helpful, such as: 'poem', 'sketch', 'thanks', 'report', etc.
Figure 3:
page0003.png
page0003.png (163.83 KiB) Viewed 16254 times
A particularly challenging page, illustrating some of my anticipated problems.

Putting it all together:

Ultimately, I aim to assemble all of this into these final products:
  • A spreadsheet document containing the data set that can be used by our organization to help produce reports, etc.
  • An e-book.
The spreadsheet is pretty straightforward. The e-book will require some more thought as to layout & features. Here are some early ideas about what I'd like to see:
  • Preserve all, or substantially all of the image content from the guestbook.
  • Follow some reasonable ordering, which is probably not the original order of the entries, since they tend to get jumbled up as the book fills and people look for partial pages to write new entries on. Chronological ordering seems like a natural choice.
  • Suitable for viewing on the web & on mobile devices.
  • Searchable.
  • Accessible (i.e. screen-reader-friendly).
  • Preserve look-and-feel of the original guestbook.

My hardware:

Figure 3 is a flatbed-scanner image of that page, produced with xsane. Once I got the hang of making small adjustments to the position of the slightly-too-big-for-the-scanning-area book without losing anything valuable off the top/bottom margins, I took about 8 minutes per 10 pages to do. Given the infrequency of this job & small number of pages to scan, as well as having zero dollars budget for the project, I'm not really interested in optimizing this at this time, so I'll stick with the hardware I have and want to focus mainly on the post-processing workflow.

My software:

So far, software I'm using, or have used in the past for this kind of work:
  • Debian 9 "stretch".
  • xsane - the most efficient way to do a batch of numbered page images with exactly the level of control I require
  • digikam - my photo management software (using this more for trail photo documentation than this project, but I also have an album of the scans in digikam)
  • gimp - in the past, I've used this for photo & scan editing, but these days I'm doing more with digikam's image editor, which is a bit simpler for routine photo editing jobs
  • gocr / tesseract - not very useful for this job, as noted above (though I've used it before to scan a book for a blind friend)
  • calibre - I've used this before for certain kinds of scripted epub processing; no idea if it will be any help on this project
  • evince - my preferred PDF viewer
  • libreoffice - Calc for the spreadsheet
Notably absent from this list is any software that would help assemble the images into a PDF (or other format, but considering how image-rich it is, I have reservations about anything else) e-book, along with full text of the entries, index by tag, etc. I'm also keeping in mind that not everyone runs Debian on their desktop, and also that I may have personal biases that won't be shared by people who follow in my footsteps, so it's good to have alternatives catering to those differing needs & desires.

Outcomes:

I'm after more than just a single work product as the outcome, here. I want a repeatable process. I want something that will not just help me to produce this tiny book, but to guide "future me" and others in my organization in making next year's edition, and for others undertaking similar work to revise & improve upon. I did a cursory scan of the web to try to find something like that I could follow, and after an hour or so of effort, the most promising lead was this website.

I spent some time browsing the forums, especially the HOWTO's, and after satisfying myself that nothing quite matched, I started drafting this post. In spite of not having found any similar stories in the forums (yet), it looks like many of you may have already covered at least some of the ground I'll need to cover soon. I'd like to pull together from your collective experiences whatever I can that would aid in my success. Any pointers to articles / threads here I may have looked, guides here or elsewhere that may have already been written, or comments on any of the ideas outlined above would be greatly appreciated!
BillGill
Posts: 139
Joined: 18 Dec 2016, 17:13
E-book readers owned: Calibre, FBReader
Number of books owned: 7000
Country: USA

Re: 100-page handwritten guestbook workflow?

Post by BillGill »

I don't have any real help for you. I just wanted to say that it sounds like a great project. As far as I can see you have the process well planned and you should be able to get a good output.

The biggest problem that I can see will be keeping the momentum going after your first book. Things will always come up that need to be done and you (or somebody) will have to put it off 'until I get caught up.'

Bill
dpc
Posts: 379
Joined: 01 Apr 2011, 18:05
Number of books owned: 0
Location: Issaquah, WA

Re: 100-page handwritten guestbook workflow?

Post by dpc »

If I was doing this I'd just save it as a PDF. Almost every device has a PDF viewer and it will allow the finished product to accurately represent the original.

Have you looked through the forums at Mobileread.com? This site is more about the hardware side of book scanning and producing a collection of quality images that can be used for creating a variety of digital documents, while the folks at Mobileread seem to focus on a variety of tools and processes that turn those images into ebooks (and others).
L.Willms
Posts: 134
Joined: 21 Sep 2016, 10:51
E-book readers owned: Tolino Shine
Country: Germany
Location: Frankfurt/Main, Germany

Re: 100-page handwritten guestbook workflow?

Post by L.Willms »

I am missing Scan Tailor in the list of your software. See the section on Scan Tailor in the "Software and processing" group of this forum.

This is very useful for de-skewing the images and finding the actual borders of the paper.

As to the form of publication, I allow myself to list some thoughts of mine:

1. Avoid Scan Tailor's offer to reduce the images to binary black and white, and keep the color. In gray scale images the shadows look awful, as one can see in the examples shown above. Color is quite probable a feature of various entries.

2. Keep the order of the pages, but add a secondary chain of links in chronological order, which might jump back and forth thru the physical pages. This can be done in PDF files, at least when using Adobe's Acrobat Pro.
User avatar
Ben Armstrong
Posts: 5
Joined: 17 Jun 2018, 07:20
E-book readers owned: Moon+Reader Pro (Android); Kindle 7
Number of books owned: 100
Country: Canada
Contact:

Re: 100-page handwritten guestbook workflow?

Post by Ben Armstrong »

BillGill wrote: 19 Jun 2018, 09:25 The biggest problem that I can see will be keeping the momentum going after your first book. Things will always come up that need to be done and you (or somebody) will have to put it off 'until I get caught up.'
I agree momentum may be an issue. I just picked up the 2nd book last week, for instance, and have yet to even start processing my scans from the first book. Then again, it has been a nice summer, diverting my energy and attention since scanning the 1st book mostly to outdoor pursuits, ignoring entirely the desk work that is better suited to the winter months ... and forgetting to check back here again after 2 days of no responses! Sorry about that ... I really do appreciate every response I got.

Thanks for your kind words of encouragement!
Last edited by Ben Armstrong on 19 Sep 2018, 17:10, edited 1 time in total.
User avatar
Ben Armstrong
Posts: 5
Joined: 17 Jun 2018, 07:20
E-book readers owned: Moon+Reader Pro (Android); Kindle 7
Number of books owned: 100
Country: Canada
Contact:

Re: 100-page handwritten guestbook workflow?

Post by Ben Armstrong »

L.Willms wrote: 25 Jun 2018, 03:33 I am missing Scan Tailor in the list of your software. See the section on Scan Tailor in the "Software and processing" group of this forum.

This is very useful for de-skewing the images and finding the actual borders of the paper.
I read about Scan Tailor and even downloaded and ran it, but it was not immediately obvious to me if it was going to be much help. Perhaps if I spent a bit more time with it, I'll see some way it can help.
As to the form of publication, I allow myself to list some thoughts of mine:

1. Avoid Scan Tailor's offer to reduce the images to binary black and white, and keep the color. In gray scale images the shadows look awful, as one can see in the examples shown above. Color is quite probable a feature of various entries.

2. Keep the order of the pages, but add a secondary chain of links in chronological order, which might jump back and forth thru the physical pages. This can be done in PDF files, at least when using Adobe's Acrobat Pro.
As for grey scale scans, I selected that in xsane during my scans, not as a post-processing step, and have given the physical book to a fellow board member for storage, so if I want to change that now, I'll need to get the book back.

I see what you mean about the shadows, though! My normal process for personal scans of printed material is to scan them gray scale, then boost contrast to clean up the shadows and "lint" from the page. So long as the illustrations don't suffer too much ... But perhaps I'll scan the 2nd book in color and compare. Then if I like the result better, I'll ask for the 1st book back again to rescan.
Last edited by Ben Armstrong on 19 Sep 2018, 17:02, edited 1 time in total.
User avatar
Ben Armstrong
Posts: 5
Joined: 17 Jun 2018, 07:20
E-book readers owned: Moon+Reader Pro (Android); Kindle 7
Number of books owned: 100
Country: Canada
Contact:

Re: 100-page handwritten guestbook workflow?

Post by Ben Armstrong »

dpc wrote: 19 Jun 2018, 13:58 If I was doing this I'd just save it as a PDF. Almost every device has a PDF viewer and it will allow the finished product to accurately represent the original.
Of course, that's the perfectly sane thing to do. But perhaps I'm not perfectly sane?
Have you looked through the forums at Mobileread.com? This site is more about the hardware side of book scanning and producing a collection of quality images that can be used for creating a variety of digital documents, while the folks at Mobileread seem to focus on a variety of tools and processes that turn those images into ebooks (and others).
Yes, I'm quite familiar with mobileread.com, and I also noted the hardware slant here, and perhaps thought I might have chosen the wrong place to post even before I had submitted it. However, my book plans are so completely unconventional I felt my project idea was probably equally off the beaten path on both sites. I ultimately chose to post here because of the production focus overall, whereas mobileread.com seems to have more of a consumer focus. I thought I'd meet up with more people with experience thinking about how to design the processing pipeline, and the fact that I wanted to implement one largely in software was besides the point, since the physical challenges pretty much prevent most of the ebook scanning/conversion tricks you'd learn about on mobileread from being effective at all. So far, I've not been disappointed in my choice to post here.
zbgns
Posts: 61
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: 100-page handwritten guestbook workflow?

Post by zbgns »

I also think that PDF is probably the best solution. When it comes to workflow and tools used, my approach would be probably as follows:

1. Color options

I would rather suggest to scan in color than in greyscale. Flatbed scanners usually scan in color and in grayscale with the same speed (correct me if I’m wrong), only b&w is significantly faster. Even if there is no need to finally preserve color in these images, they may be painlessly desaturated (or binarized) with usually better result in comparison with gaining grayscale (or b&w) images directly from a scanner.

Usually it is better to save images in lossless format. So I would avoid JPEG and choose TIFF or PNG in XSane in output preferences. DPI 300 should be sufficient.

2. Batch processing of images

I find Scan Tailor being superior for this task. Especially Scan Tailor Advanced (https://github.com/4lex4/scantailor-advanced) fork has features that seems to be useful for your project. I mean especially color processing functions like color segmentation, normalization, posterization and so on. It allows for reduction of background noise as well as for improvement of readability. At the output there should be group of pictures with consistent sizes, margins and so on.

In addition I found this project: https://mzucker.github.io/2016/09/20/noteshrink.html as potentially useful for your project (however i haven't tested it). Incredible things may be also done with ImageMagick, but some experience is necessary to have desired results. There is also extension to GIMP (BIMP) that allow for batch processing of images.

3. Sizes and DPI

There are two main parameters important: number of pixels in width and height as well as DPI. DPI 300 seems to be the best idea. Number of pixels must correspond with the DPI. When number of pixels (both dimensions) is divided by DPI, the result in inches should be consistent with physical dimensions of paper pages that were scanned.

4. Images’ format and compression

As Scan Tailor output files are in TIFF format, they are relatively huge and are not good in this form to be merged into PDF. Thus they must be converted into some else format. I would say that there are only 3 real choices: JPEG, PNG and JPEG2000. JPEG format is most popular and there should be no any compatibility problems with it. However it is lossy and adds its own compression distortions. PNG gives better quality but the compression is not so efficient.

For this reason I would try with JPEG2000 as the best compromise between quality and filesizes. There are two programs under Linux that are able to perform real group conversion of images into jpeg2000 (.jp2 extension) format: opj_compress and ImageMagick. The second one was not able to deal with jpeg2000 files by default in my system and it took me a lot of time to force it to work, so I rely mainly on opj_compress. Details on usage are under following link: https://github.com/uclouvain/openjpeg/wiki/DocJ2KCodec.

5. Combining pictures into PDF file

There is a program called img2pdf which I strongly recommend. JPEG, PNG and JPEG2000 images may be wrapped into PDF with no recompression. In result there is no loss in quality, as well as the process is very fast. The command (executed from directory where images are gathered) would be as follows

Code: Select all

img2pdf -o output.pdf --imgsize 300dpix300dpi -i *.jp2
. Details are under this link: https://gitlab.mister-muffin.de/josch/img2pdf.

6. Bookmarks and metadata

I use booky script for that https://github.com/SiddharthPant/booky, especially for adding table of contents to books in PDF format (even those professionally created and bought as e-books form official internet book stores). However I imagine it is also possible to use it for creating more advanced things basing on bookmarks. Metadata may be added using PDFMtEd, or exiftool.

7. Publication in Internet

There is small optimization possible when PDF is to be published in the Internet named linearization (web optimization). Such linearized PDF file shows its contents in a web browser before is fully downloaded. Useful especially in case of heavy weighted PDF files containing raster images. It may be done using qpdf tool with --linearize option.

8. Conclusions

It may look quite complex, however is not as difficult and time consuming as it may seems at the beginning. The advantage for me is that I'm able to control each step of processing and may apply some adjustments if necessary to have desired result. It is also fully linux-based solution, although it may be performed under Windows as well (at least big part of that). Not all software described is available in repositories and some of programs need to be built from source (e.g. Scan Tailor Advanced).

I think that the only real commercial alternative would be Abbyy Fine Reader, which is able to perform some steps in more automatic way. As a bonus there is also MRC compression metod implemented that gives the best possible compression results (https://en.wikipedia.org/wiki/Mixed_raster_content). Adobe Acrobat (I have access to and know relatively well) is rather useless when it comes to processing of raster graphics, so it may be used to perform only step 5 and partly step 4.

Sorry for being quite late with my response. I started to wrote it long time ago but lost it by incident and had to write from beginning once again. I hope it that you will be able to find anything useful anyway.
User avatar
Ben Armstrong
Posts: 5
Joined: 17 Jun 2018, 07:20
E-book readers owned: Moon+Reader Pro (Android); Kindle 7
Number of books owned: 100
Country: Canada
Contact:

Re: 100-page handwritten guestbook workflow?

Post by Ben Armstrong »

Thanks so much for your detailed reply, zbngs. These all seem like reasonable suggestions, so I'll keep them in mind and start from the top with improving my scans when I tackle book 2, then go back and get book 1 to rescan properly.
Post Reply