How to convert a book to serchable pdf using open source software

Share your software workflow. Write up your tips and tricks on how to scan, digitize, OCR, and bind ebooks.

Moderator: peterZ

zbgns
Posts: 61
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: How to convert a book to serchable pdf using open source software

Post by zbgns »

As you already wrote it was the sorting problem due to inconsistent naming of files. BTW this is not the Tesseract issue as it cannot process batch of separate files directly and the workaround is necessary by creating a list of files in right order which Tesseract may follow. This list was created by 'ls' command, namely by listing all 'tif' files in the working directory in name order and saving them to a text file ('output.txt' in this case). Potentially, you might avoid renaming files by change of sorting order, e.g. using sort by creation date (assuming there were created in correct sequence). Tesseract fully relies on this list and processes file by file in the given order.
Please also note that the scripts in this thread are very basic and prone to problems like this. They definitely do not follow good practices of writing shell scripts and I have not presented them as a complex working solution but rather as examples of my individual approach. More sophisticated implementation would be required before this is offered to others as a kind of 'universal' software.
cosinus
Posts: 2
Joined: 15 Apr 2020, 12:12
E-book readers owned: kindle pw4
Number of books owned: 0
Country: Norway

Re: How to convert a book to serchable pdf using open source software

Post by cosinus »

Thank you so much for posting this workflow.
I was really helpful.

I have a few modifications. :-)
First I think the cover should be at the same size when scrolling the pdf file. I had some problem since I scanned the covers at higher resolutions.

After creating the jbig2lossyocr.pdf file, I checked the resolutions of the text file and created the cover the same widh and ppi,

Code: Select all

$pdfimages -list -l 2 jbig2lossyocr.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1908  3110  gray    1   1  jbig2  no      1316  0   400   401 3899B 0.5%
   2     1 image    1908  3110  gray    1   1  jbig2  no      1319  0   400   401 4926B 0.7%

$mogrify -units PixelsPerInch  -density 400   -resize 1908x cover.tif 
 
I have only created two books so far and I was lucky with the chapter titles. The are named "Kapittel 1" and counting up. I was able to create the toc these way, with help of http://manpages.ubuntu.com/manpages/tru ... ine.1.html

Code: Select all

$echo "0 1 Book Title" > book.toc
$pdftotext book.pdf -|  awk -vRS=$'\f' -vNAME="Kapittel"      'index($0,NAME){printf "1 %d %s\n", NR, NAME;}' |grep -n '^' |awk -F':' '{print $2" "$1}' >> book.toc

$ head book.toc 
0 1 Book Title
1 2 Kapittel 1
1 15 Kapittel 2
1 23 Kapittel 3
1 32 Kapittel 4
1 40 Kapittel 5

pdfoutline book.pdf book.toc  book-toc.pdf 
In also think the page numbers should match between the scanned pages and the pdf file.
I solved this with http://jpdftweak.sourceforge.net/
I did this modification. This will set the first textpage to 7 and the pdf will open full page.

Code: Select all

Pager number tab
1; i,ii; cover 1
2; 1,2,3;;7
Interaction tab
x Set viewer preferences
x Fit window to pdf

Metadata. I found a source for bib files here in Norway. More internationally I think this is a good starting point. https://davetang.org/muse/2014/06/30/co ... to-bibtex/ and especial the OttoBib link.
I download the bib file. In nautilus open with Jabref. In jabref right click the bib entry and attach the pdf file. The in jabref
Tools - Write xmp metadata to pdf’s
I don't use jabref for anything else.

I also found a way to create PDF/A-2B from the created file with toc and metadata. It's online.
https://www.pdftron.com/pdf-tools/pdfa-converter/
It worked well and jbig and jpx images are kept, so the file size are about the same. It don't worked if the jpdftweak was the last step, so there may be some bugs with jpdftweak. I also think it was necessary to have the xmp metadata from jabref attached to the pdf file.
zbgns
Posts: 61
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: How to convert a book to serchable pdf using open source software

Post by zbgns »

Thank you for your comments and sharing details of your workflow. Nice to see, that someone found useful the thread I wrote.
cosinus wrote: 05 May 2020, 09:34 First I think the cover should be at the same size when scrolling the pdf file. I had some problem since I scanned the covers at higher resolutions.
You are right. The scripts are adopted for work with 300 DPI images. In case of higher resolution it would be necessary to adjust them respectively. It is also possible to have higher resolution for covers than for remaining contents and have the same sizes in a pdf file. It depends on combination of DPI and number of pixels to have correct “physical” size (measured in centimeters or inches).
cosinus wrote: 05 May 2020, 09:34 I was able to create the toc these way, with help of http://manpages.ubuntu.com/manpages/tru ... ine.1.html
Thanks for indicating this tool. I was not aware that this exists.
cosinus wrote: 05 May 2020, 09:34 In also think the page numbers should match between the scanned pages and the pdf file.
I solved this with http://jpdftweak.sourceforge.net/
Apart jpdftweak, I used for this also pagelabels-pyhttps://github.com/lovasoa/pagelabels-py. But usually the simpler the better. It may be sufficient to remove some blank pages at the beginning to have the numbers on pages fitting to sequential numbers in the pdf file.
cosinus wrote: 05 May 2020, 09:34 Metadata. I found a source for bib files here in Norway. More internationally I think this is a good starting point. https://davetang.org/muse/2014/06/30/co ... to-bibtex/ and especial the OttoBib link.
I didn't even know this possibility. I will try to use it, as there are some bibtex files fitting to my books.
cosinus wrote: 05 May 2020, 09:34 I also found a way to create PDF/A-2B from the created file with toc and metadata. It's online.
OCRmyPDF https://github.com/jbarlow83/OCRmyPDFseems to be able to convert to PDF/A format if you want to avoid online tools.
cosinus
Posts: 2
Joined: 15 Apr 2020, 12:12
E-book readers owned: kindle pw4
Number of books owned: 0
Country: Norway

Re: How to convert a book to serchable pdf using open source software

Post by cosinus »

zbgns wrote: 06 May 2020, 09:34 OCRmyPDF https://github.com/jbarlow83/OCRmyPDFseems to be able to convert to PDF/A format if you want to avoid online tools.
Thanks.
Yes I have looked at OCRmyPDF but it don't support JBIG or JPEG2000 images in PDF/A.
I think it's ghostcript that lacks that functionality.

Quote from ocrmypdf webbpage.

Code: Select all

PDFs containing JBIG2-encoded content will be converted to CCITT Group4 encoding, which has lower compression ratios, if Ghostscript PDF/A is enabled.
PDFs containing JPEG 2000-encoded content will be converted to JPEG encoding, which may introduce compression artifacts, if Ghostscript PDF/A is enabled.
zbgns
Posts: 61
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: How to convert a book to serchable pdf using open source software

Post by zbgns »

My bad. I was convinced that OCRmyPDF supports jbig2 but apparently this applies only to regular pdfs.
Noitaenola
Posts: 5
Joined: 02 Jun 2020, 13:29
Number of books owned: 0
Country: Rather

Re: How to convert a book to serchable pdf using open source software

Post by Noitaenola »

I'm not sure about JPEG2000, but OCRmyPDF seems to support JBIG2 if you first install jbig2enc: Installing the JBIG2 encoder.
i8dcI32QWRfiwVj
Posts: 12
Joined: 26 Jul 2018, 09:28
Number of books owned: 0
Country: Germany

Re: How to convert a book to serchable pdf using open source software

Post by i8dcI32QWRfiwVj »

To my knowledge, when you feed the script posted by zbgns (thanks again for sharing, by the way) with coloured *.tiff files, e.g. colour scans of book covers, these files will turn black in the conversion process. Is there any handy solution for adding coloured book covers of the same size as the text pages to the final book *.pdf?
zbgns
Posts: 61
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: How to convert a book to serchable pdf using open source software

Post by zbgns »

Actually, each book created by me using the described method has a colored front cover and back cover. Contents between covers are binarized (B&W). There may be added pictures in color, but it would be necessary to manually convert them to appropriate format and turn into pdf, and afterwards insert to the final pdf file.
My workflow is following:

1. Save all images to a directory. The first and the last image are in color. Remaining may be already B&W or not, but will be binarized anyway.

2. Create a subfolder:

Code: Select all

mkdir -p pdf
3. The first and the last image is moved to the subfolder, and the names of them are changed appropriately:

Code: Select all

mv "`ls *.tif | head -1`" pdf/fcover.tif
mv "`ls *.tif | tail -1`" pdf/bcover.tif
4. If areas of OCR recognition should be indicated (in order to omit headers and footers) I multiply uzn file (it must be prepared earlier):

Code: Select all

ls *.tif | cut  -d "." -f 1 > list && while read line; do cp 1.uzn "$line".uzn; done < list
5. Tesseract OCR recognition (invisible text layer):

Code: Select all

ls *.tif > output.txt && tesseract -l pol+fra --psm 4 -c textonly_pdf=1 output.txt text pdf
6. Binarization and jbig2 compression of images (visible layer of the pdf file):

Code: Select all

jbig2 -s -p -v *.tif && pdf.py output > lossy.pdf
7. Joining layers altogether in order to have one pdf file with images and text layer underneath:

Code: Select all

pdftk lossy.pdf multibackground text.pdf output jbig2lossyocr.pdf
qpdf may be used for that instead of pdftk (of course syntax of the bash command is then completely different).
Now we have B&W contents of the book and may go to cover(s) where color needs to be preserved.

8. Go to the subfolder where we covers were moved (step 3):

Code: Select all

cd pdf
9. Apply jpeg2000 compression:

Code: Select all

opj_compress -r 200 -i fcover.tif -o fcover.jp2
opj_compress -r 200 -i bcover.tif -o bcover.jp2
Jpeg compression may be applied instead of jpeg2000 (then Imagemagick may be useful).

10. Wrap color images into pdf container (no matter jpeg2000 or jpeg, the tool and the method may be the same):

Code: Select all

img2pdf -o fcover.pdf --imgsize 300dpix300dpi fcover.jp2
img2pdf -o bcover.pdf --imgsize 300dpix300dpi bcover.jp2
Density, i.e. DPI value must be indicated ('--imgsize 300dpix300dpi' in the example above). If your DPI is different, e.g. a cover has 600 DPI, that must be adjusted respectively. There may be e.g. 300 DPI for contents of a book and e.g. 600 DPI for cover(s) at the same time, but combination of dimensions in pixels times DPI must give the same 'physical' size. Otherwise, the pdf will contain pages of different sizes, what looks terribly.

11. Copy contents of the book (cerated in step 7) to the subfolder where covers are located:

Code: Select all

mv ../jbig2lossyocr.pdf ./
12. Join: front cover + contents of the book + back cover into one pdf file

Code: Select all

pdftk fcover.pdf jbig2lossyocr.pdf bcover.pdf cat output book.pdf
qpdf can do that as well if one wants replacement for pdftk.

In result, the output pdf file is almost complete book with color front cover and back cover and contents binarized (black letters on white background) and OCRed. The next step would be to add indexes (toc) and metadata.

I gave examples how to deal with CLI tools, as it is possible to put all commands into a script and run all the steps in one pass. But of course other tools may be used instead, including GUI ones.
i8dcI32QWRfiwVj
Posts: 12
Joined: 26 Jul 2018, 09:28
Number of books owned: 0
Country: Germany

Re: How to convert a book to serchable pdf using open source software

Post by i8dcI32QWRfiwVj »

This is really impressive. Thank you very much!
Merlijn
Posts: 3
Joined: 16 Jul 2021, 14:57
Number of books owned: 0
Country: Netherlands

Re: How to convert a book to serchable pdf using open source software

Post by Merlijn »

Hope this post is not too little too late, but I wanted to remark that I've in the past year written tooling that does exactly this. It takes as input a stack of images and a hOCR file for the OCR (generated by tesseract), and produces a PDF, compressed with JPEG2000 images (with separate foreground and background images) and JBIG2 (or CCITT) compression for the foreground mask. It can easily lead to a 10x reduction if the input files are also JPEG2000 files, more otherwise. You can tweak the quality params if the quality is not acceptable.

There are some examples on how it does MRC, here: https://archive.org/~merlijn/projects/a ... c-examples

It's AGPLv3 and you can find it here: https://git.archive.org/merlijn/archive-pdf-tools (to create the combined hocr file, use `hocr-combine-stream` from https://git.archive.org/merlijn/archive-hocr-tools / https://archive.org/~merlijn/archive-ho ... ine-stream)

Cheers,
Merlijn

E: The PDFs should also pass PDF/A 3b and most of PDF/UA (checked with VeraPDF)
Post Reply