Scantools for Linux - convert to PDF with OCR
Posted: 16 Jan 2020, 06:20
Scantools for Linux - convert to PDF with OCR
It may interest some users in the community to produce OCR'd PDF's. There are already some solutions in place for this (such as pdfbeads or pdf.py) but how about just adding OCR on the fly by processing an existing scan to PDF or just add OCR to an existing PDF?
Scantools is a set of Linux PDF/A tools with the ability to perform OCR.
Scantools
https://cplx.vm.uni-freiburg.de/scantools/
Scantools
Scantools is a high-quality library and a matching set of command line programs for the handling and manipulation of scanned documents. The library is written in C++ and makes heavy use of Qt5.
At present, the library can convert image files to PDF/A. Files in JBIG2, JPEG and JPEG2000 format are directly included into the PDF, other files are compressed in a lossless manner. HOCR files, which are produced by optical character recognition programs such as ‘tesseract’, can be used to make the PDF file searchable. The resulting files comply with the ISO PDF/A standard for long-term archiving of digital documents and offer compression rates comparable to that of the DJVU file format.
There are currently three command line utilities.
image2pdf, converts images to a PDF/A compliant PDF file.
hocr2any, converts HOCR files to text, or renders them as raster graphics or PDF files
ocrPDF, adds a text layer to a graphics-only PDF file, without re-encoding graphics data or otherwise modifying file content
Downloads here
https://software.opensuse.org/package/scantools
It may interest some users in the community to produce OCR'd PDF's. There are already some solutions in place for this (such as pdfbeads or pdf.py) but how about just adding OCR on the fly by processing an existing scan to PDF or just add OCR to an existing PDF?
Scantools is a set of Linux PDF/A tools with the ability to perform OCR.
Scantools
https://cplx.vm.uni-freiburg.de/scantools/
Scantools
Scantools is a high-quality library and a matching set of command line programs for the handling and manipulation of scanned documents. The library is written in C++ and makes heavy use of Qt5.
At present, the library can convert image files to PDF/A. Files in JBIG2, JPEG and JPEG2000 format are directly included into the PDF, other files are compressed in a lossless manner. HOCR files, which are produced by optical character recognition programs such as ‘tesseract’, can be used to make the PDF file searchable. The resulting files comply with the ISO PDF/A standard for long-term archiving of digital documents and offer compression rates comparable to that of the DJVU file format.
There are currently three command line utilities.
image2pdf, converts images to a PDF/A compliant PDF file.
hocr2any, converts HOCR files to text, or renders them as raster graphics or PDF files
ocrPDF, adds a text layer to a graphics-only PDF file, without re-encoding graphics data or otherwise modifying file content
Downloads here
https://software.opensuse.org/package/scantools