How to convert a book to serchable pdf using open source software

zbgns · Post by **zbgns** » 24 Oct 2018, 14:52

This is continuation of my previous thread about digitization of a book viewtopic.php?f=14&p=21548#p21548. It was about hardware part, namely capturing pages. Now I’m going to describe my method of processing these files to obtain a book in electronic form.

My expectations are following:
1. Searchable, so there must be OCR performed;
2. Table of contents allowing for easy navigation over the book;
3. Usability for TTS (text to speech);
4. Small filesize.

The epub format meets all these requirements and seems to be the best solution. However the problem is that it depends on specificity of each individual book how long it takes to create properly formatted epub. It may work mainly for fiction books with simple formatting and no footnotes/endnotes. In case of scientific books (or in general non-fiction books) it may be too laborious to achieve this. I do not want to spend more time on processing a book than on reading it. I already described my approach to creating epub files. It is available under this link: viewtopic.php?f=3&t=3440#p20852

However when it comes to more complex books (footnotes, endnotes, various styles, various fonts, symbols, tables etc.) I give up and I’m content with pdf format. So I would like to describe my workflow indicating also how much time and effort is necessary to successfully finalize the work. This is Linux based workflow, however it may be adopted also to Windows machines. The software I use is open source and there are also Windows versions in case of all key tools.

The description is based on the actual work I done a few days before. I captured 352 images of a book from 1984. The paper is yellow now and quality of print does not meet current standards. It is not consistent and there are “lighter” and “darker” pages and areas. However, it was my aim to base the test on a book that is not the easiest example.

Below there are all steps I had to perform. I indicate how much time it took to finish each stage. I tried to measure it carefully, but in some cases it is approximation only. In case of command line programs I was able to count it precisely (‘time’ command) but as it takes some time to navigate from directory to directory, inserting (pasting) commands etc. I add 10 extra seconds in each case.

1. Transfer of images – 03:00

I used cable transfer and copied all files from iPhone to my computer. The time is approximation only (I forgot to start the stopper, but it probably took a little less than 3 minutes).

2. Check double/omitted pages – 02:30

I used gThumb to check each 10th page to verify whether they in proper order. It has very convenient bottom panel with thumbnails. I may set ten thumbnails in a row, so each tenth page should be in one column.

There were only two doubled pages, so this stage took only 2 and half minute.

3. Scan Tailor Advanced – 27:46 in total

Input pages look like this (I applied a little bit higher compression to reduce the filesize in order to meet the filesize limit on this forum).

3.1 Loading images – 00:20

I chose directory containing images and set DPI (466 in my case).

3.2 Fix orientation – 00:38

Odd pages rotated 90 degrees counterclockwise whether even rotated 90 degrees counterclockwise.

3.3 Split pages – 01:25

Fully automatic process.

3.4 Deskew – 02:02

Also fully automatic, no manual intervention

3.5 Select content – 10:07

It was most laborious as there are spots on the paper so Scan Tailor did wrong selection of contents on some pages. I had to correct this. Scan Tailor can show pages in order by increasing height and increasing width, so it is big help. Manual corrections took 07:34

3.6 Margins – 07:00

I chose margins that fit to actual size of pages. In case of front cover and back cover margins I set no margins. As some pages are not fi.lled with full block of text (beginnings of chapters, end of chapters and so on), some manual adjustments are necessary. Change of order from ‘natural’ to ‘by increasing height’ also helps very much.

3.7 Output – 06:14

Resolution set to 300 DPI. Front cover and back cover output is Color/Grayscale (I checked also ‘Equalize lumination’). Remaining pages with text reduced to ‘black and white’. I did not change threshold, but for this specific book I might reduce the thickness a little bit. I also did not touch ‘despeckling’, although it might be useful to ‘despeckle’ it a bit as binarization does not eliminated all spots, stains and similar unwanted things.

This is sample output page (converted from tif to png format, lossy compression, but visual quality is similar).

4. Creating additional working directory – 00:10

I created additional directory, named it ‘pdf’ and moved front cover and back cover to this directory.

5. OCR – 19:35

Tesseract is the only real option under Linux. It is open source command line program and is under dynamic development now. The newest version is 4.0.0 Release Candidate 4 and I recommend this, although it is not the final version. There are also Windows binaries available.

Tesseract is able to create two layer pdf files. One layer contains scanned images whereas the second layer is invisible and contains the text ocr-ed text. However Tesseract is not capable to apply the most efficient compression on images, I mean jbig2 algorithm. So it is necessary to do this compression in separate step. At the current stage I’m interested only in the text layer. So I need to execute following command from the directory where Scan Tailor output black and white pages are gathered (color front cover and back cover have been moved to separate directory):

Code: Select all

ls *.tif >output.txt && tesseract -l pol -c textonly_pdf=1 output.txt text pdf

ls command creates list of files to be processed by Tesseract which is necessary as Tesseract does not recognize glob patterns, so direct ‘*. tif’ does not work as input for Tesseract. There must be also proper language chosen – there is ‘pol’ in my case, what means ‘Polish’.

The result is a 1,5 MB pdf file. It seems to be empty as all pages are white. But there is invisible text that may be copied, it is also searchable.

6. Compressing b&w images – 00:40

I mentioned I want to apply the most efficient compression available. Tesseract is able to use only CCITT Group 4 compression, which is less effective. So I use jbig2 encoder. The open source implementation is available under this link:

https://github.com/agl/jbig2enc

The command executed from the same directory as in Tesseract case is following:

Code: Select all

jbig2 -s -p -v *.tif && pdf.py output >lossy.pdf

I would like to mention that the total size of files was 15,3 MB (tif files CCITT Group 4 compression), jbig2enc is able to reduce it to 3,3 MB, whereas Tesseract is not able to do any additional reduction. jbig2enc is also very fast and processed 350 files in 40 seconds.

7. Joining two layers – 00:11

Now I need to join the two pdf files (one created by Tesseract and one by jbig2enc) into one, two-layer pdf file, containing both images and text. I need pdftk for that. The command is following:

Code: Select all

pdftk lossy.pdf multibackground text.pdf output jbig2lossyocr.pdf

Commands indicated in steps 5 – 7 may be combined and executed together, so the work described in these steps may be done in one pass:

Code: Select all

ls *.tif >output.txt && tesseract -l pol -c textonly_pdf=1 output.txt text pdf && jbig2 -s -p -v *.tif && pdf.py output >lossy.pdf && pdftk lossy.pdf multibackground text.pdf output jbig2lossyocr.pdf && rm output.*

The main work is done. The file is small, but contains exact visual representation of the original book layout (the visible layer with images) as well as the text that may be searched, copied and may be used for TTS.

8. Covers – 00:15 (compression) + 00:10 (changing to pdf)

Front cover and back cover are images with full color preserved and I want them to be in color in the book. At the same time they must be compressed efficiently, as I do not want to increase the filesize of the final book. So I need convert covers from tif to jpeg2000 format. I use opj_compress command line program for that. Instructions how to use it are under following link: https://github.com/uclouvain/openjpeg/wiki/DocJ2KCodec.

I use following command:

Code: Select all

opj_compress -r 200 -i cover.tif -o cover.jp2

(the -r parameter is responsible for level of compression).

It must be applied twice as there are front cover and back cover.

Afterwards, both files need to be saved as separate pdf files. There is img2pdf command line program that is able to convert jpeg2000 files to pdf format with no recompression. The command (executed to front cover and back cover separately) should be similar to this:

Code: Select all

img2pdf --imgsize 300dpix300dpi -o cover.pdf cover.jp2

9. Combining pdf files altogether – 00:10

There is the front cover, contents of the book and back cover in three separate pdf files and they need to be joined in proper order. I used pdftk for that using following command:

Code: Select all

pdftk cover.pdf jbig2lossyocr.pdf bcover.pdf cat output book.pdf

10. Table of contents – 07:18

The ‘book.pdf’ is almost ready. It lacks only the table of contents. I’m able to insert it using booky script: https://github.com/SiddharthPant/booky
There are also instructions, how to use it,

First, there is necessary to prepare a text file containing titles of chapters and page numbers separated by comma. Usually I copy OCR-ed text from a page containing ToC to a text file, adjust it to the required format using regular expressions if possible. Then, it may be incorporated to the pdf file. I needed over 7 minutes to complete this stage and this is relatively long. However in case of complex ToC generating such a text file should be significantly faster than navigating by the whole book, selecting each chapter tittle and adding them one by one.

11. Metadata – 02:27

This is not necessary, but it is good to have the book title and so on in metadata. PDFMtEd
https://github.com/glutanimate/PDFMtEd
helps me to do this. It is graphical tool and males available metadata fields that may be modified.

12. Summary

The whole work took over 1 hour and 40 minutes. In more detail:

‘Hardware’ part covering taking all photos, transferring images to the computer and verifying that all images in proper order – 00:46:21

Software processing – 00:56:15

The speed is partly limited by my computer (i5-4210U CPU). For example OCR lasted almost 20 minutes. More powerful machine would complete it faster.

There may be also a question raised, whether OCR is of sufficiently good quality to use it for TTS, especially taking into consideration lack of proofreading in any form. There are errors of course but not very big number (let say 1 or 2 per page on average). So I may listen to such ‘audiobook’ and errors that occur are not so disturbing and annoying that it does not make sense.

I also planning to compare the above method with Adobe Acrobat based solution.

dtic · Post by **dtic** » 25 Oct 2018, 04:28

Thanks for writing this, always interesting to read about what workflow people have.

zbgns wrote: ↑24 Oct 2018, 14:52 I used gThumb to check each 10th page to verify whether they in proper order. It has very convenient bottom panel with thumbnails. I may set ten thumbnails in a row, so each tenth page should be in one column.

Nice and simple method! So you only have to press the Down key to step preview every tenth image then?
A Windows alternative for this kind of check is my https://github.com/nod5/BookGapCheck

zbgns wrote: ↑24 Oct 2018, 14:52 3. Scan Tailor Advanced – 27:46 in total
...
3.5 Select content – 10:07
It was most laborious as there are spots on the paper so Scan Tailor did wrong selection of contents on some pages. I had to correct this. Scan Tailor can show pages in order by increasing height and increasing width, so it is big help. Manual corrections took 07:34
...
3.6 Margins – 07:00

If what you're scanning is mostly text and black and white graphs then this alternative method requires much less manual work:
1. First crop all images to include only the book page. My https://github.com/nod5/BookCrop is a quick but rough Windows tool for that but there are other methods out there.
2. Then use the old Scan Tailor Enhanced which has command line processing. Run a command line job that automatically selects the whole page as content for all pages. This means no manual work to select content and margins on individual pages.

zbgns · Post by **zbgns** » 25 Oct 2018, 11:24

dtic wrote: ↑25 Oct 2018, 04:28 (...), always interesting to read about what workflow people have.

Thanks for your interest and comments.

dtic wrote: ↑25 Oct 2018, 04:28 (...) So you only have to press the Down key to step preview every tenth image then?

Unfortunately not. It is necessary to scroll down with mouse and click on thumbnails in a column. However there are also improvements possible. I may start from the end end check if page numbers are in right columns. If numbers are shifted, then go somewhere in the middle in order to see in what half there are extra/missing pages, and so on.

dtic wrote: ↑25 Oct 2018, 04:28 A Windows alternative for this kind of check is my https://github.com/nod5/BookGapCheck

I'm aware of your program, as I read the thread on this forum. I noticed that it is basing on similar idea, nevertheless I did not comment this, as it is Windows based so I'm not able to check, how it would work in my workflow.

dtic wrote: ↑25 Oct 2018, 04:28 If what you're scanning is mostly text and black and white graphs then this alternative method requires much less manual work:
1. First crop all images to include only the book page. My https://github.com/nod5/BookCrop is a quick but rough Windows tool for that but there are other methods out there.

Also Scan Tailor Advanced offers this crop function - 'Page Box' at the 'Select Content' stage. Unfortunately this is not able to fully solve the problem. First, position of pages changes as I try to keep a book in the middle (lens in phone cameras produce less distortions in the middle that on edges, it is also due to construction of my scanner). The second issue is that there are usually very small mistakes, like: omitted page number, selected content area bigger than necessary only a bit. It is difficult to eliminate this problem by cropping page.

dtic wrote: ↑25 Oct 2018, 04:28
2. Then use the old Scan Tailor Enhanced which has command line processing. Run a command line job that automatically selects the whole page as content for all pages. This means no manual work to select content and margins on individual pages.

I'm aware of the Scan Tailor CLI mode, however I didn't even tested it. The main reason is that I like to have control over the process and check whether there no errors, even if it takes some time.

Thank you for the comments once again. It helped me to realize that maybe I need to modify the scanner (or/and clean up the glass carefully before each use). Otherwise it will be difficult to have full automatic (or semi-automatic) post-process workflow.

dtic · Post by **dtic** » 26 Oct 2018, 12:34

zbgns wrote: ↑25 Oct 2018, 11:24
dtic wrote: ↑25 Oct 2018, 04:28 First crop all images to include only the book page. My https://github.com/nod5/BookCrop is a quick but rough Windows tool for that but there are other methods out there.
Also Scan Tailor Advanced offers this crop function - 'Page Box' at the 'Select Content' stage. Unfortunately this is not able to fully solve the problem. First, position of pages changes as I try to keep a book in the middle (lens in phone cameras produce less distortions in the middle that on edges, it is also due to construction of my scanner). The second issue is that there are usually very small mistakes, like: omitted page number, selected content area bigger than necessary only a bit. It is difficult to eliminate this problem by cropping page.

Yes, BookCrop only works well if all pages are at roughly the same position in the photos. One solution is to have some part of the scanner setup that holds the book cover/spine in a fixed position when shooting. Otherwise some other, smarter tool that reliably detects the page edges in the photo is needed. I use a python/OpenCV script for that situation which produces around one error per 200 pages, but it is very sensitive to the specific lighting, background and scanner setup so I haven't released it. I hope someone much better at OpenCV than me will one day release a general tool that can very reliably crop whole pages from a wide range of book page photos.

zbgns wrote: ↑25 Oct 2018, 11:24 I'm aware of the Scan Tailor CLI mode, however I didn't even tested it. The main reason is that I like to have control over the process and check whether there no errors, even if it takes some time.

If you do find a way to first successfully crop all whole pages then in my experience there won't every be any errors at all when you in the next step use Scan Tailor Enhanced CLI processing simply to binarize images with only text or black and white graphs. Besides, you can just wait for all pages to finish and then review the thumbnails and redo only those images that have errors, if any.

zbgns · Post by **zbgns** » 26 Oct 2018, 18:56

dtic wrote: ↑26 Oct 2018, 12:34
Yes, BookCrop only works well if all pages are at roughly the same position in the photos. One solution is to have some part of the scanner setup that holds the book cover/spine in a fixed position when shooting. Otherwise some other, smarter tool that reliably detects the page edges in the photo is needed. I use a python/OpenCV script for that situation which produces around one error per 200 pages, but it is very sensitive to the specific lighting, background and scanner setup so I haven't released it. I hope someone much better at OpenCV than me will one day release a general tool that can very reliably crop whole pages from a wide range of book page photos.

The above comments let me to understand limitations of my method I did not realize earlier. I do not think that I am able to do any bigger progress without significant hardware modifications. However I do not intend to build a new scanner in the near future. I am quite happy with the current construction as it really compact, portable and takes little space when stored between usages. It is perfect for my home needs as I need it for less than one hour once or twice per week. On the other hand I need to deal with limitations of the construction, especially the camera it uses. So the postprocess part must stay relatively laborious, I am afraid.

dtic wrote: ↑26 Oct 2018, 12:34 If you do find a way to first successfully crop all whole pages then in my experience there won't every be any errors at all when you in the next step use Scan Tailor Enhanced CLI processing simply to binarize images with only text or black and white graphs. Besides, you can just wait for all pages to finish and then review the thumbnails and redo only those images that have errors, if any.

Well, I still want to preserve color in case of front covers and back covers so it brings additional complication. Nevertheless I need to try Scan Tailor Enhanced CLI mode (I never used this fork before) to assess, what potential time savings are possible in comparison with the full GUI workflow. I am afraid that number of manual post tweaks will be necessary. Maybe Scan Tailor Advanced will be equipped with good working CLI mode. It was in plans as far as I remember. Unfortunately, there was no any move at the GitHub page for longer period, but I hope that 4lex4 (the developer of STA) is still interested in developing this amazing piece of software.

Nevertheless, I think that there is place for small automation of my workflow. As I use CLI tools for significant part of work, there may be bash script written, which will run all commands in sequence instead of me. It would be less flexible solution, but as the computer could do the work without my oversight, it should save my time anyway. I do not have any experience in programming, but it seems to be so simple that I hope I am capable to do this.

zbgns · Post by **zbgns** » 29 Oct 2018, 20:07

zbgns wrote: ↑26 Oct 2018, 18:56 (...) As I use CLI tools for significant part of work, there may be bash script written, which will run all commands in sequence instead of me.

It seems that my script works. So the workflow looks like this:
1. scanning
2. transfer of images to the computer
3. check whether they are in proper order
4. Scan Tailor
5. creating the ToC text file
6. unsupervised stage when I run the script which covers:

OCR

compression b&w

covers in color

joining all pdf parts altogether

incorporating ToC into final copy of the book

7. completing metadata (title, author etc).

In addition the script renames final pdf file to the name of the parent directory (created before images are copied from the phone) and archives all images using 7zip. So at the output I have only two files: the pdf and the 7zip archive.

There are also limitations. The first and the last file must be color images (front cover and back cover), remaining images must be b&w. Moreover, the text file containing ToC must be created before the script is run. There is also one absolute path, so the script is not very portable. I think that completing metadata may be part of this script, but I have not looked for a solution for that yet.

My benefit is that monotonous inserting of shell commands is not necessary anymore. I should have thought about this semi automation earlier.

dtic · Post by **dtic** » 31 Oct 2018, 09:26

Nice going!

One more tip: If you use chdkptp (which TwoCamControl is a frontend for) you can download each image directly to the PC.

The big advantage with that is that a script can monitor the download folder for new files and immediately run them through a page cropper tool and then to Scan Tailor Enhanced command line processing to output .tif. You could also add on your automatic steps for OCR and pdf conversion. This way you'd have individual .pdf files ready soon after you're finished shooting the pages, depending a bit on how fast your PC is. Your manual order checking/ToC/metadata steps can then be done at the end without waiting for processing steps inbetween them.

zbgns · Post by **zbgns** » 31 Oct 2018, 17:49

Thank you dtic for this hint. It is very interesting, albeit I'm not able to implement it directly now. I use iPhone as a camera (I do not even have a Cannon camera) and I do not think there is any software that can send images to my computer directly after they are taken. I thought about sending via cloud drive (Dropbox), but apparently iPhone is not able to take photos and run Dropbox at the same time. It seems to be possible however in case of Android cameras. I need to say that the workflow described by you would require more advanced scanner and far more sophisticated software (scripts) than I'm able to prepare myself.

Anyway, it is really interesting to learn how other people deal with the software part and how far are able to come with the automation of all steps. Is this what you described part of your workflow?

zbgns · Post by **zbgns** » 07 Nov 2018, 07:58

zbgns wrote: ↑29 Oct 2018, 20:07 (...) I think that completing metadata may be part of this script, but I have not looked for a solution for that yet.

I worked on the script a little bit. The metadata function has been implemented. I need to prepare the ToC text file where also metadata are placed and the script is able to read them and insert to the pdf file. So the script takes care of:
1. Separation of front and back cover which are compressed with jpeg2000 encoder as color images and wrapped into pdf format,
2. OCR and jbig2 compression on remaining b&w images and joining text and images in a two layer pdf file,
3. attaching front cover and back cover to the book in proper order,
4. Inserting ToC (in form of bookmarks) to the pdf file, so it is easy to navigate to specific places in the book like beginnings of chapters; it also respects hierarchy of entries (headings, subheadings etc),
5. Including metadata: Title, Author Subject, Keywords (may be also others),
6. Renaming the book to a parent directory name,
7. Archiving all images and files by creating 7zip archive (also renamed respectively).

As soon as the script finalizes its work (no my supervision is necessary during its run) I have only 2 files:
1. the final book in pdf format with ToC and metadata so is ready to use and no additional manual tweaking is necessary,
2. the 7zip archive that may be easily moved to an external place of storage.

dtic · Post by **dtic** » 08 Nov 2018, 13:32

zbgns wrote: ↑31 Oct 2018, 17:49 Is this what you described part of your workflow?

Yes, but a (very slow) work in progress.

Thanks also for the link to https://github.com/SiddharthPant/booky , useful tool. When I have time I'll try to make a version that uses tab indentation to set levels instead of using brackets.

DIY Book Scanner

How to convert a book to serchable pdf using open source software

How to convert a book to serchable pdf using open source software

Re: How to convert a book to serchable pdf using open source software

Re: How to convert a book to serchable pdf using open source software

Re: How to convert a book to serchable pdf using open source software

Re: How to convert a book to serchable pdf using open source software

Re: How to convert a book to serchable pdf using open source software

Re: How to convert a book to serchable pdf using open source software

Re: How to convert a book to serchable pdf using open source software

Re: How to convert a book to serchable pdf using open source software

Re: How to convert a book to serchable pdf using open source software