LATEST - If you had difficulty registering recently please try again now we have found and fixed the problem.

Scantools for Linux - convert to PDF with OCR

Convert page images into searchable text. Talk about software, techniques, and new developments here.
Post Reply
Krokkie
Posts: 2
Joined: 16 Jan 2020, 05:05
E-book readers owned: Nook
Number of books owned: 0
Country: UK

Scantools for Linux - convert to PDF with OCR

Post by Krokkie »

Scantools for Linux - convert to PDF with OCR

It may interest some users in the community to produce OCR'd PDF's. There are already some solutions in place for this (such as pdfbeads or pdf.py) but how about just adding OCR on the fly by processing an existing scan to PDF or just add OCR to an existing PDF?

Scantools is a set of Linux PDF/A tools with the ability to perform OCR.

Scantools
https://cplx.vm.uni-freiburg.de/scantools/

Scantools

Scantools is a high-quality library and a matching set of command line programs for the handling and manipulation of scanned documents. The library is written in C++ and makes heavy use of Qt5.

At present, the library can convert image files to PDF/A. Files in JBIG2, JPEG and JPEG2000 format are directly included into the PDF, other files are compressed in a lossless manner. HOCR files, which are produced by optical character recognition programs such as ‘tesseract’, can be used to make the PDF file searchable. The resulting files comply with the ISO PDF/A standard for long-term archiving of digital documents and offer compression rates comparable to that of the DJVU file format.

There are currently three command line utilities.

image2pdf, converts images to a PDF/A compliant PDF file.
hocr2any, converts HOCR files to text, or renders them as raster graphics or PDF files
ocrPDF, adds a text layer to a graphics-only PDF file, without re-encoding graphics data or otherwise modifying file content



Downloads here
https://software.opensuse.org/package/scantools
fjkraan
Posts: 1
Joined: 07 Dec 2019, 06:02
E-book readers owned: IRex, Kobo, BeBook, ...
Number of books owned: 0
Country: Netherlands
Contact:

Re: Scantools for Linux - convert to PDF with OCR

Post by fjkraan »

This is very interesting, and I will look into it when I have my D.I.Y scanner upgrade completed. I did work with tesserract and hocr before, but what I didn't find was a tool for correcting the OCR text before adding it to the PDF. Editing hocr files directly is possible, but not very convenient. Does scantools has any support for this?

Greetings,

Fred Jan
Noitaenola
Posts: 5
Joined: 02 Jun 2020, 13:29
Number of books owned: 0
Country: Rather

Re: Scantools for Linux - convert to PDF with OCR

Post by Noitaenola »

I'd also like a tool to easily edit hocr files. I haven't tried this yet, but seems to ease at least some of the work: PoCoTo: The CIS OCR PostCorrectionTool.
User avatar
Crispin
Posts: 1
Joined: 26 Jul 2022, 12:45
E-book readers owned: Kindle 7, jail-broken
Number of books owned: 66
Country: Serenia

Re: Scantools for Linux - convert to PDF with OCR

Post by Crispin »

SCANTOOLS - update

Main link above is not valid anymore.
Here is the author's page:

Prof. Dr. Stefan Kebekus | Software - Scantools
https://cplx.vm.uni-freiburg.de/softwar ... #scantools

Scantools homepage
https://kebekus.gitlab.io/scantools/

DOWNLOADS:
Binaries for Linux (.deb, .rpm, .ymp):
https://kebekus.gitlab.io/scantools/download/
Packages for all common Linux distributions.
In addition to the binaries, the packages also include manual pages, development files and API documentation.

Linux snaps from the Snap Store:
https://snapcraft.io/scantools
The binaries exposed by the snap are called scantools.image2pdf, scantools.hocr2any and scantools.ocrPDF.
Manual pages are not included; use to option “–help” for usage information.

Binaries for other platforms:
"I expect that scantools should compile on all modern platforms, desktop or mobile. Due to limited time, I support Linux only."

Source code:
https://gitlab.com/kebekus/scantools

- - -
INTERESTING FACTS (compiled from FAQ, man pages)
image2pdf - produced PDF/A compliant output files (PDF/A-2b, more precisly).
It seems that pdf.py, jbig2enc, JPG2000 +encoders have issues (e.g. do not interpret XMP data correctly - do not pass PDF/A compliance tests).
image2pdf seems to corrects or avoid these issues.
JPEG2000 .JP2 files should be converted* to .JPX for image2pdf, or just renamed from .jp2 to .jpx (but losing PDF/A full compliance then).
This is because .jpx is backward compatible extension of .jp2, required for full ISO/IEC 15444-2 standard (PDF/A-2b).
image2pdf supports just Windows-1252 encoding - adding OCR data currently works for western languages only.

*Unfortunately, none of open source libs yet are supporting it - only (free) Cacadou binaries are able to produce compliant .jpx image.
(see https://en.wikipedia.org/wiki/JPEG_2000#Libraries)
Chasys Draw IES (Viewer, Artist, Converter+batch) and FastStone Image Viewer (Converter+batch) have full support of jPEG2000-JPX.
Both are free Windows GUI utilities but working fine under Linux Wine.
https://www.jpchacha.com/chasysdraw/index.php
https://www.faststone.org/

hocr2any - renders HOCR to PDF, to a sequence of images or exports a text file (UTF8).
Author claims "hocr2any produces output that is better than that of many of the competing programs. But judge for yourself!"
Not all features of HOCR are implemented because "HOCR specification is a bit vague at times".

ocrPDF - adds a text layer to a image-only PDF file. The tesseract OCR engine is used for text detection.
Windows-1252 encoding > western languages only.
> "Why not use the “pdf” output mode of tesseract directly? Why not use the program OCRmyPDF?"
"While both competing programs do a fine job in general, they both re-compress the image data found in the PDF file. For PDFs that contain JBIG2 encoded data, this will often lead to an increase in file size by a factor of about 10.
The program ocrPDF is able to deal with JBIG2 data and increases file size only moderately, to the extent needed to include the text data."
^ It seems that OCRmyPDF now supports JBIG2:
https://ocrmypdf.readthedocs.io/en/latest/jbig2.html
Post Reply