My expectations are following:
1. Searchable, so there must be OCR performed;
2. Table of contents allowing for easy navigation over the book;
3. Usability for TTS (text to speech);
4. Small filesize.
The epub format meets all these requirements and seems to be the best solution. However the problem is that it depends on specificity of each individual book how long it takes to create properly formatted epub. It may work mainly for fiction books with simple formatting and no footnotes/endnotes. In case of scientific books (or in general non-fiction books) it may be too laborious to achieve this. I do not want to spend more time on processing a book than on reading it. I already described my approach to creating epub files. It is available under this link: viewtopic.php?f=3&t=3440#p20852
However when it comes to more complex books (footnotes, endnotes, various styles, various fonts, symbols, tables etc.) I give up and I’m content with pdf format. So I would like to describe my workflow indicating also how much time and effort is necessary to successfully finalize the work. This is Linux based workflow, however it may be adopted also to Windows machines. The software I use is open source and there are also Windows versions in case of all key tools.
The description is based on the actual work I done a few days before. I captured 352 images of a book from 1984. The paper is yellow now and quality of print does not meet current standards. It is not consistent and there are “lighter” and “darker” pages and areas. However, it was my aim to base the test on a book that is not the easiest example. Below there are all steps I had to perform. I indicate how much time it took to finish each stage. I tried to measure it carefully, but in some cases it is approximation only. In case of command line programs I was able to count it precisely (‘time’ command) but as it takes some time to navigate from directory to directory, inserting (pasting) commands etc. I add 10 extra seconds in each case.
1. Transfer of images – 03:00
I used cable transfer and copied all files from iPhone to my computer. The time is approximation only (I forgot to start the stopper, but it probably took a little less than 3 minutes).
2. Check double/omitted pages – 02:30
I used gThumb to check each 10th page to verify whether they in proper order. It has very convenient bottom panel with thumbnails. I may set ten thumbnails in a row, so each tenth page should be in one column. There were only two doubled pages, so this stage took only 2 and half minute.
3. Scan Tailor Advanced – 27:46 in total
Input pages look like this (I applied a little bit higher compression to reduce the filesize in order to meet the filesize limit on this forum). 3.1 Loading images – 00:20
I chose directory containing images and set DPI (466 in my case).
3.2 Fix orientation – 00:38
Odd pages rotated 90 degrees counterclockwise whether even rotated 90 degrees counterclockwise.
3.3 Split pages – 01:25
Fully automatic process.
3.4 Deskew – 02:02
Also fully automatic, no manual intervention
3.5 Select content – 10:07
It was most laborious as there are spots on the paper so Scan Tailor did wrong selection of contents on some pages. I had to correct this. Scan Tailor can show pages in order by increasing height and increasing width, so it is big help. Manual corrections took 07:34
3.6 Margins – 07:00
I chose margins that fit to actual size of pages. In case of front cover and back cover margins I set no margins. As some pages are not fi.lled with full block of text (beginnings of chapters, end of chapters and so on), some manual adjustments are necessary. Change of order from ‘natural’ to ‘by increasing height’ also helps very much.
3.7 Output – 06:14
Resolution set to 300 DPI. Front cover and back cover output is Color/Grayscale (I checked also ‘Equalize lumination’). Remaining pages with text reduced to ‘black and white’. I did not change threshold, but for this specific book I might reduce the thickness a little bit. I also did not touch ‘despeckling’, although it might be useful to ‘despeckle’ it a bit as binarization does not eliminated all spots, stains and similar unwanted things.
This is sample output page (converted from tif to png format, lossy compression, but visual quality is similar).
4. Creating additional working directory – 00:10
I created additional directory, named it ‘pdf’ and moved front cover and back cover to this directory.
5. OCR – 19:35
Tesseract is the only real option under Linux. It is open source command line program and is under dynamic development now. The newest version is 4.0.0 Release Candidate 4 and I recommend this, although it is not the final version. There are also Windows binaries available.
Tesseract is able to create two layer pdf files. One layer contains scanned images whereas the second layer is invisible and contains the text ocr-ed text. However Tesseract is not capable to apply the most efficient compression on images, I mean jbig2 algorithm. So it is necessary to do this compression in separate step. At the current stage I’m interested only in the text layer. So I need to execute following command from the directory where Scan Tailor output black and white pages are gathered (color front cover and back cover have been moved to separate directory):
Code: Select all
ls *.tif >output.txt && tesseract -l pol -c textonly_pdf=1 output.txt text pdf
The result is a 1,5 MB pdf file. It seems to be empty as all pages are white. But there is invisible text that may be copied, it is also searchable.
6. Compressing b&w images – 00:40
I mentioned I want to apply the most efficient compression available. Tesseract is able to use only CCITT Group 4 compression, which is less effective. So I use jbig2 encoder. The open source implementation is available under this link:
https://github.com/agl/jbig2enc
The command executed from the same directory as in Tesseract case is following:
Code: Select all
jbig2 -s -p -v *.tif && pdf.py output >lossy.pdf
7. Joining two layers – 00:11
Now I need to join the two pdf files (one created by Tesseract and one by jbig2enc) into one, two-layer pdf file, containing both images and text. I need pdftk for that. The command is following:
Code: Select all
pdftk lossy.pdf multibackground text.pdf output jbig2lossyocr.pdf
Code: Select all
ls *.tif >output.txt && tesseract -l pol -c textonly_pdf=1 output.txt text pdf && jbig2 -s -p -v *.tif && pdf.py output >lossy.pdf && pdftk lossy.pdf multibackground text.pdf output jbig2lossyocr.pdf && rm output.*
8. Covers – 00:15 (compression) + 00:10 (changing to pdf)
Front cover and back cover are images with full color preserved and I want them to be in color in the book. At the same time they must be compressed efficiently, as I do not want to increase the filesize of the final book. So I need convert covers from tif to jpeg2000 format. I use opj_compress command line program for that. Instructions how to use it are under following link: https://github.com/uclouvain/openjpeg/wiki/DocJ2KCodec.
I use following command:
Code: Select all
opj_compress -r 200 -i cover.tif -o cover.jp2
It must be applied twice as there are front cover and back cover.
Afterwards, both files need to be saved as separate pdf files. There is img2pdf command line program that is able to convert jpeg2000 files to pdf format with no recompression. The command (executed to front cover and back cover separately) should be similar to this:
Code: Select all
img2pdf --imgsize 300dpix300dpi -o cover.pdf cover.jp2
There is the front cover, contents of the book and back cover in three separate pdf files and they need to be joined in proper order. I used pdftk for that using following command:
Code: Select all
pdftk cover.pdf jbig2lossyocr.pdf bcover.pdf cat output book.pdf
The ‘book.pdf’ is almost ready. It lacks only the table of contents. I’m able to insert it using booky script: https://github.com/SiddharthPant/booky
There are also instructions, how to use it,
First, there is necessary to prepare a text file containing titles of chapters and page numbers separated by comma. Usually I copy OCR-ed text from a page containing ToC to a text file, adjust it to the required format using regular expressions if possible. Then, it may be incorporated to the pdf file. I needed over 7 minutes to complete this stage and this is relatively long. However in case of complex ToC generating such a text file should be significantly faster than navigating by the whole book, selecting each chapter tittle and adding them one by one.
11. Metadata – 02:27
This is not necessary, but it is good to have the book title and so on in metadata. PDFMtEd
https://github.com/glutanimate/PDFMtEd
helps me to do this. It is graphical tool and males available metadata fields that may be modified.
12. Summary
The whole work took over 1 hour and 40 minutes. In more detail:
- ‘Hardware’ part covering taking all photos, transferring images to the computer and verifying that all images in proper order – 00:46:21
- Software processing – 00:56:15
There may be also a question raised, whether OCR is of sufficiently good quality to use it for TTS, especially taking into consideration lack of proofreading in any form. There are errors of course but not very big number (let say 1 or 2 per page on average). So I may listen to such ‘audiobook’ and errors that occur are not so disturbing and annoying that it does not make sense.
I also planning to compare the above method with Adobe Acrobat based solution.