How to convert a book to serchable pdf using open source software

Share your software workflow. Write up your tips and tricks on how to scan, digitize, OCR, and bind ebooks.

Moderator: peterZ

zbgns
Posts: 61
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: How to convert a book to serchable pdf using open source software

Post by zbgns »

I performed small comparison of the free method with Adobe Acrobat (Pro DC 15) output. It was much more tricky that I had expected. The comparison is based on the same book I previously scanned and processed. This is what I obtained:

1. Speed
_____________ Adobe Acrobat ____ Open software
Combining ____ 00:01:50 ________ 00:00:30
Optimization __ 00:00:30 ________ -
OCR _________ 00:06:59 ________ 00:19:35
ToC __________ 00:06:27 ________ 00:07:18
Metadata _____ 00:02:27 ________ 00:02:27
_____________ 00:18:13 _______ 00:29:50

Acrobat presented faster OCR and this made the difference, however it partly results from more powerful hardware (i5-6300U, RAM 16 GB vs. i5-4210U, RAM 8 GB). As regards creating of ToC (15 chapters), there were two different methods used. In Acrobat I had to go to each page with chapter headings select them and add one by one. The free software method required to prepare an input text file, so I needed to copy contents of the page containing ToC and reprocess it to obtain required format. Acrobat method occurred faster, however in case of bigger ToC I would expect opposite result.

2. Filesizes

I compared filesizes in two categories: efficiency of jbig2 lossy and lossless modes.
a) jbig2 lossy

Open source method: 5.0 MB (3.2 MB without OCR layer);
Acrobat: 5.9 MB (4.0 MB without OCR layer).

It is worth to mention that Acrobat also compressed covers using MRC method (front cover and back covers were combined from 3 images of various DPI to increase level of compression). In result the front cover compressed by Acrobat is only 24,7 kB in size vs. 62,3 kB in case of free software.

b) jbig2 lossless

Open source method: 13.6 MB
Acrobat: 14.4 MB

Acrobat is capable recreate fonts on the basis of scanned documents and replace raster images with synthesized vector fonts.
aa.png
ab.png
In past this method was called ClearScan, in Acrobat DC Pro 15 it was renamed to "Editable text and images". I applied it also to the book for comparison. The document I obtained is 13.7 MB big and this is surprising to me, as I expected much smaller size. It is due to big number of various fonts embedded in the file.

3. OCR quality

In order to compare number of OCR errors I copied 10 pages of OCR-ed text created by Acrobat and Tesseract to a word processor and did automatic comparison: all differences are directly visible:
b.png
The sample was randomly chosen (pages 164 – 174, 3937 words, 26074 characters). I do not know whether it is fully representative of the whole book, It seems quite typical part. Afterwards I counted errors created by Acrobat and Tesseract. I classified them to specific categories as some are more and some less inconvenient and annoying.

a) wrong letters – A: 12; T: 5
b) added or omitted diacritics – A: 2; T: 6
c) misrecognized upper indexes – A: 4; T: 9
d) joined words – A: 0; T: 8
I ignored punctuation errors.

My conclusion is that Tesseract made bigger number of mistakes (28 in total vs. 18) but they are less “critical” in comparison with the result obtained by Acrobat.

As I use books in pdf format as a source for TTS, I compared also, how they are read by Moon Reader Pro + Ivona TTS engine on my Android phone. The main problem was that there are erroneous additional paragraph breaks in random places what makes the listening less fluent and comfortable. In the indicated part free software produced 21 such false breaks whereas Acrobat inserted 48 of them. Significant difference. I checked also how it looks in case of ClearScan document. The false breaks disappeared, but additional spaces in the middle of words showed up. I counted 58 such unwanted spaces in the sample part and they are definitely more annoying that mentioned paragraph breaks.
babul
Posts: 3
Joined: 14 Jul 2018, 20:17
Number of books owned: 0
Country: Poland

Re: How to convert a book to serchable pdf using open source software

Post by babul »

Hey, great guide!

I have a question though: is it possible to somehow correct OCR output from Tesseract other than editing the pdf in Acrobat (I don't feel like paying monthly for that)? I've been trying to do OCR with gImageReader on Linux. It uses Tesseract and you can preview txt file, but to actually correct it, you need to edit hOCR file directly in the application or in some text editor, which is kinda troublesome due to navigating between html tags, but it can be made easier with Vim or something.

If errors can happen in neatly prepared pdfs, then much more correcting is needed when you try to convert some typewriter stuff. I usually remake those in LaTeX, so copying plain text and correcting it is fine, but I wonder if you can correct it in other way than editing hOCR file.
zbgns
Posts: 61
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: How to convert a book to serchable pdf using open source software

Post by zbgns »

babul wrote: 18 Mar 2019, 05:11 I have a question though: is it possible to somehow correct OCR output from Tesseract other than editing the pdf in Acrobat (I don't feel like paying monthly for that)? I've been trying to do OCR with gImageReader on Linux. It uses Tesseract and you can preview txt file, but to actually correct it, you need to edit hOCR file directly in the application or in some text editor, which is kinda troublesome due to navigating between html tags, but it can be made easier with Vim or something.
Please note that the method described does not rely on hOCR created by Tesseract (or gImageReader) and incorporated into a pdf as the 'text' layer. It is kind of 'generic' text-only pdf that Tesseract is capable of creating. I'm using this mode instead of hOCR because it provides better (almost perfect) positioning of text. After joining altogether the text is in right place under the graphics layer. This precision is not available when hOCR based method is used. But the most important issue for me is that the method described in this thread gives pdf files which work good as sources for TTS. hOCR based pdfs are not useful for this at all (correct me if I'm wrong).
babul wrote: 18 Mar 2019, 05:11 If errors can happen in neatly prepared pdfs, then much more correcting is needed when you try to convert some typewriter stuff. I usually remake those in LaTeX, so copying plain text and correcting it is fine, but I wonder if you can correct it in other way than editing hOCR file.
When it comes to proofreading and typo corrections, I would say it is hardly possible to perform this without tools like Acrobat. I'm not 100% sure but it seems that e.g. Master PDF Editor or Qoppa PDF Studio also can do this. They are cheaper than Acrobat and have Linux versions. The hardcore option is to recompress the pdf text-only Tesseract output and edit the sourcecode of such pdf. I tried it only once (with good result by the way as it was batch change of one sequence of letters that was incorrectly recognized) since it was too extreme for me. I'm able to accept some amount of OCR errors in pdfs as full proofreading is very time consuming and isn't worth efforts. Tesseract is very good OCR tool, so if everything is done correctly (some experience is necessary) recognition errors are relatively rare and not very annoying.

However if you need a perfectly looking document you may try to take a plain text and recreate the layout in e.g. MS Word or LO Writer, correcting also typos on this occasion. Afterwards you may save this as pdf or any other kind of file (e.g. epub). I also tried this but find it too laborious in normal conditions.
i8dcI32QWRfiwVj
Posts: 12
Joined: 26 Jul 2018, 09:28
Number of books owned: 0
Country: Germany

Re: How to convert a book to serchable pdf using open source software

Post by i8dcI32QWRfiwVj »

dtic wrote: 26 Oct 2018, 12:34
zbgns wrote: 25 Oct 2018, 11:24
dtic wrote: 25 Oct 2018, 04:28 First crop all images to include only the book page. My https://github.com/nod5/BookCrop is a quick but rough Windows tool for that but there are other methods out there.
Also Scan Tailor Advanced offers this crop function - 'Page Box' at the 'Select Content' stage. Unfortunately this is not able to fully solve the problem. First, position of pages changes as I try to keep a book in the middle (lens in phone cameras produce less distortions in the middle that on edges, it is also due to construction of my scanner). The second issue is that there are usually very small mistakes, like: omitted page number, selected content area bigger than necessary only a bit. It is difficult to eliminate this problem by cropping page.
Yes, BookCrop only works well if all pages are at roughly the same position in the photos. One solution is to have some part of the scanner setup that holds the book cover/spine in a fixed position when shooting. Otherwise some other, smarter tool that reliably detects the page edges in the photo is needed. I use a python/OpenCV script for that situation which produces around one error per 200 pages, but it is very sensitive to the specific lighting, background and scanner setup so I haven't released it. I hope someone much better at OpenCV than me will one day release a general tool that can very reliably crop whole pages from a wide range of book page photos.
zbgns wrote: 25 Oct 2018, 11:24 I'm aware of the Scan Tailor CLI mode, however I didn't even tested it. The main reason is that I like to have control over the process and check whether there no errors, even if it takes some time.
If you do find a way to first successfully crop all whole pages then in my experience there won't every be any errors at all when you in the next step use Scan Tailor Enhanced CLI processing simply to binarize images with only text or black and white graphs. Besides, you can just wait for all pages to finish and then review the thumbnails and redo only those images that have errors, if any.
Does anyone know of useful software for cropping scans that runs on Mac Os?
zbgns
Posts: 61
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: How to convert a book to serchable pdf using open source software

Post by zbgns »

i8dcI32QWRfiwVj wrote: 05 Aug 2019, 11:57 Does anyone know of useful software for cropping scans that runs on Mac Os?
I think this thread answers more or less to your question: viewtopic.php?f=24&p=21785&sid=0ab34fa7 ... f21#p21785

In short, I would recommend Scan Tailor Advanced. It is able to crop scans and runs on Mac OS.
i8dcI32QWRfiwVj
Posts: 12
Joined: 26 Jul 2018, 09:28
Number of books owned: 0
Country: Germany

Re: How to convert a book to serchable pdf using open source software

Post by i8dcI32QWRfiwVj »

zbgns wrote: 05 Aug 2019, 15:42
i8dcI32QWRfiwVj wrote: 05 Aug 2019, 11:57 Does anyone know of useful software for cropping scans that runs on Mac Os?
I think this thread answers more or less to your question: viewtopic.php?f=24&p=21785&sid=0ab34fa7 ... f21#p21785

In short, I would recommend Scan Tailor Advanced. It is able to crop scans and runs on Mac OS.
Thanks! I am already using Scantailor for cropping, but am looking for software that would get rid of excess white space around text, and fingers, to edit my scans with before I start working with Scantailor.
zbgns
Posts: 61
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: How to convert a book to serchable pdf using open source software

Post by zbgns »

i8dcI32QWRfiwVj wrote: 06 Aug 2019, 07:41 Thanks! I am already using Scantailor for cropping, but am looking for software that would get rid of excess white space around text, and fingers, to edit my scans with before I start working with Scantailor.
I use Scan Tailor Advanced for this, i.e. in order manually select area, where STA looks automatically desired contents. It works very good provided that pages are in the same position. I find it also more efficient to use one tool for this instead various ones. But it depends on your specific needs and your workflow. If you need to crop pages before proceeding with Scan Tailor you may give a try to ImageMagick, especially if pages are in stable position on all pictures. Otherwise it may be necessary to look for something more advanced, like for example this: viewtopic.php?f=24&p=21785#p21785
i8dcI32QWRfiwVj
Posts: 12
Joined: 26 Jul 2018, 09:28
Number of books owned: 0
Country: Germany

Re: How to convert a book to serchable pdf using open source software

Post by i8dcI32QWRfiwVj »

zbgns wrote: 06 Aug 2019, 12:38
i8dcI32QWRfiwVj wrote: 06 Aug 2019, 07:41 Thanks! I am already using Scantailor for cropping, but am looking for software that would get rid of excess white space around text, and fingers, to edit my scans with before I start working with Scantailor.
I use Scan Tailor Advanced for this, i.e. in order manually select area, where STA looks automatically desired contents. It works very good provided that pages are in the same position. I find it also more efficient to use one tool for this instead various ones. But it depends on your specific needs and your workflow. If you need to crop pages before proceeding with Scan Tailor you may give a try to ImageMagick, especially if pages are in stable position on all pictures. Otherwise it may be necessary to look for something more advanced, like for example this: viewtopic.php?f=24&p=21785#p21785
True, STA is not too bad, and I wasn't aware of the 'sort pages by ...' option you mentioned in your first post. This really makes finding pages where text hasn't been properly recognised easier, thanks for that! Shall also give the AI option mentioned in the post you refer to a try should I find myself with some spare time on my hands.
zbgns
Posts: 61
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: How to convert a book to serchable pdf using open source software

Post by zbgns »

zbgns wrote: 14 Nov 2018, 20:35 As I use books in pdf format as a source for TTS, I compared also, how they are read by Moon Reader Pro + Ivona TTS engine on my Android phone. The main problem was that there are erroneous additional paragraph breaks in random places what makes the listening less fluent and comfortable.
I need to correct my comments about incorrect paragraph brakes in OCR-ed pdf files created by Tesseract or Adobe Acrobat. The issue has nothing to do with the Tesseract’s pdf output - apparently, these files are correctly formatted and are in line with the pdf specification. Incorrect reflow of text extracted from pdf files results from work of an interpreter rendering the text.

I lately found the Android app called Librera PRO that does great work with OCR-ed pdf files, provides TTS function and can join lines into sentences basing on end-of-sentence marks like “.?!”. There is still issue with sentences not ending with these marks (like titles, subtitles etc.) but I find this mode less problematic anyway. Moreover, Librera PRO can join lines of text at the end of one page and beginning of next page, so there is no any interruption during TTS when a page is turned.

So now I prefer Librera PRO over Moon+ Reader Pro as my favorite e-book reader program (I like also that this is the open source soft). It doesn’t mean however I devalue Moon+ Reader Pro in any way – it does good job, too.

Moreover, I would like also to describe a method how to make tesseract pdf files more TTS ready. Let’s say there is a book where there is heading at the top of each page and page number at the bottom.
Capture.PNG
Capture.PNG (71.05 KiB) Viewed 12611 times
I would like to have these elements omitted during TTS. To obtain this, it is possible to indicate specific sections of scanned images that need to be recognized by Tesseract during the OCR process. Tesseract can read text files with “uzn” extension which describe sections of a scanned image. The format looks like this:

Code: Select all

left top width height freetext
Details on this format are under following link: https://github.com/OpenGreekAndLatin/gr ... uzn-format
So the easiest example would be e.g.:

Code: Select all

94  195  1512  1952
In order to create such file it is necessary to measure dimensions (in pixels) of a content box that covers desired area of a page. It may be done using tools like GIMP or even MS Paint. Personally I use gThumb as it is my default image browser. There is crop function which gives necessary information after selecting block of text. Afterwards info about dimensions of selection area needs to be rewritten to a text file in appropriate order.
1.png
The name of this .uzn file must be the same as the image file. As there is number of pages in a scanned book, the ".uzn" file must be multiplied in order to obtain number of ".uzn" files with names corresponding to source images names. It may be done in terminal by inserting following commands:
1.

Code: Select all

ls *.tif | cut  -d "." -f 1 > list.txt
– it creates a text file “list.txt” containing names of all files in a folder with “.tif” extension but without extensions.
2.

Code: Select all

while read line; do cp 1.uzn "$line".uzn; done < list.txt
– the command sequentially copies the source “.uzn” file created previously (“1.uzn” in the example) and gives them correct names with “.uzn” extension.

After this each image file in the folder has its own “.uzn” file.

In the next step images may be OCR-ed using Tesseract and unwanted areas will be omitted, provided that single column segmentation mode is used. It means that additional parameter “--psm 4” or “--psm 6” must be indicated. So, the tesseract command must look like this:

Code: Select all

tesseract -l pol+fra --psm 4 -c textonly_pdf=1 output.txt text pdf
(where “output.txt” is a textfile listing images that needs to be OCR-ed).

Of course, the below command may be incorporated into the shell script described by me previously, so on basis of source images (Scan Tailor output) and two text files: “toc” (containing bookmarks and metadata) and “1.uzn” (containing description of recognition area(s)), the book in pdf format is created automatically.
pdf.png
pdf.png (393.63 KiB) Viewed 12611 times
There are following limitations of this method:
1. The block of text where recognition is performed must be in the same place on each page, what means that processing with Scan Tailor needs to be careful.
2. Tesseract does not use automatic segmentation but relies on “.uzn” files, so e.g. two column text on pages cannot be recognized properly unless it is directly indicated in the “.uzn” file.

By the way, I described the method with only one content area, but if necessary, there may be higher number of areas for recognition described by an “.uzn” file. Moreover, it is also possible to adopt this method to a case where for example even and odd pages differs in layout and must be treated separately. I also tested this but haven’t found any case where it would be useful in practice.
i8dcI32QWRfiwVj
Posts: 12
Joined: 26 Jul 2018, 09:28
Number of books owned: 0
Country: Germany

Re: How to convert a book to serchable pdf using open source software

Post by i8dcI32QWRfiwVj »

Hi!

This might be a tesseract issue, but I thought I would ask my question here nonetheless. I've been using your workflow to convert a folder of tiff into one PDF. I ran the following script on the folder:

Code: Select all

ls *.tif >output.txt && tesseract -l eng -c textonly_pdf=1 output.txt text pdf && jbig2 -s -p -v *.tif && jbig2enc_pdf.py output >lossy.pdf && pdftk lossy.pdf multibackground text.pdf output jbig2lossyocr.pdf && rm output.*
When tesseract processed the files, it did so in the following order:

Code: Select all

Tesseract Open Source OCR Engine v4.0.0 with Leptonica
Page 0 : 1-koe_2R.tif
Page 1 : 10-koe_1L.tif
Page 2 : 10-koe_2R.tif
Page 3 : 100-koe_1L.tif
Page 4 : 100-koe_2R.tif
Page 5 : 101-koe_1L.tif
Page 6 : 101-koe_2R.tif
Page 7 : 102-koe_1L.tif
Page 8 : 102-koe_2R.tif
Page 9 : 103-koe_1L.tif
Page 10 : 103-koe_2R.tif
Page 11 : 104-koe_1L.tif
Page 12 : 104-koe_2R.tif
Page 13 : 105-koe_1L.tif
Page 14 : 105-koe_2R.tif
Page 15 : 106-koe_1L.tif
... and so on. As you can imagine, this created a completely jumbled PDF file. Do by any chance know how to remedy this?

Best,
x

EDIT: I found a solution myself, and it is embarrassingly simple. If files are renamed such that they start with a leading zero, tesseract processes them all fine. For instance ...

Code: Select all

Tesseract Open Source OCR Engine v4.0.0 with Leptonica
Page 0 : 00001-koe.tif
Page 1 : 00002-koe.tif
Page 2 : 00003-koe.tif
Page 3 : 00004-koe.tif
Page 4 : 00005-koe.tif
Page 5 : 00006-koe.tif
Page 6 : 00007-koe.tif
Page 7 : 00008-koe.tif
Detected 327 diacritics
Page 8 : 00009-koe.tif
Detected 294 diacritics
Page 9 : 00010-koe.tif
Detected 248 diacritics
Page 10 : 00011-koe.tif
Detected 245 diacritics
Page 11 : 00012-koe.tif
Detected 422 diacritics
Page 12 : 00013-koe.tif
Page 13 : 00014-koe.tif
... and so on.
Post Reply