Scan Tailor Advanced

zbgns · Post by **zbgns** » 18 Jun 2018, 04:21

antwoorden wrote: ↑15 Jun 2018, 03:20 But there's one functionality I'd advise to exclude: moving the dewarping from the final to the third stage. At least there should be an ability to keep it in stage 3. The reason for this, is that the dewarping is often crashing, with many of my scanned books. When this happens in the 3rd stage, it crashes my whole project.

I noticed that the problem with the dewraping in Scan Tailor Experimental is when you have blank pages in the project. Crash is when STE tries to apply the dewarping to such blank pages. So if blank pages are excluded from dewarping (no geometric distortion corrections applied or only deskew) there should be no crash.

Thus I do not think that the problem you encountered with Scan Tailor Experimental is actually contradiction to move the dewarping function to the third stage. I really opt for this change in Scan Tailor Advanced.

L.Willms wrote: ↑16 Jun 2018, 03:12
No, that is not Scan Tailor's main purpose.

Its purpose is the production of clear and crisp images ready for OCR or other processes from original scans or photographies.

The unique functionality of Scan Tailor is ability to prepare set of scans or photos for further OCR-ing, but after processing they are also ready for combining in a pdf or djvu file. However no additional functions like extraction of pictures from pdfs or binding output pictures into pdf would make Scan Tailor better in this main function.

By the way, as it comes to extracting images from pdf files, the best tool I know is pdfimages from the poppler set. It allows for real extraction of images whereas Adobe Acrobat (at least Pro DC 2015 version I have access to) renders them.

Combining of images into a pdf file may be done using jbig2 encoder (bi-tonal images) or img2pdf (color ones). The text layer from tesseract may be easily added using pdftk. So it is quite easy to have two layer pdf file with efficiently compressed images (jbig2 for bi-tonal pictures of text and jpeg2000 for color images) and text layer underneath. It takes more time and effort when you compare this method with using FineReader or Acrobat but results are comparable or even better as you have better control over each step of processing.

L.Willms · Post by **L.Willms** » 25 Jun 2018, 03:04

zbgns wrote: ↑18 Jun 2018, 04:21 By the way, as it comes to extracting images from pdf files, the best tool I know is pdfimages from the poppler set. It allows for real extraction of images whereas Adobe Acrobat (at least Pro DC 2015 version I have access to) renders them.

Ah, thanks. That is interesting, since I use Acrobat 8 Pro for this.

BTW, at the next Black Friday sales I might buy ABBYY FineReader 14, which works directly in image PDFs.

Sallen112 · Post by **Sallen112** » 26 Jun 2018, 10:18

zbgns wrote: ↑18 Jun 2018, 04:21
L.Willms wrote: ↑16 Jun 2018, 03:12
No, that is not Scan Tailor's main purpose.

Its purpose is the production of clear and crisp images ready for OCR or other processes from original scans or photographies.

The unique functionality of Scan Tailor is ability to prepare set of scans or photos for further OCR-ing, but after processing they are also ready for combining in a pdf or djvu file. However no additional functions like extraction of pictures from pdfs or binding output pictures into pdf would make Scan Tailor better in this main function.

By the way, as it comes to extracting images from pdf files, the best tool I know is pdfimages from the poppler set. It allows for real extraction of images whereas Adobe Acrobat (at least Pro DC 2015 version I have access to) renders them.

Combining of images into a pdf file may be done using jbig2 encoder (bi-tonal images) or img2pdf (color ones). The text layer from tesseract may be easily added using pdftk. So it is quite easy to have two layer pdf file with efficiently compressed images (jbig2 for bi-tonal pictures of text and jpeg2000 for color images) and text layer underneath. It takes more time and effort when you compare this method with using FineReader or Acrobat but results are comparable or even better as you have better control over each step of processing.

I'm sorry but I still do not see why we still cannot have this feature implemented, just because its main purpose is only for Scans and Pictures to process DOES NOT MEAN a PDF extraction and inputter back into a PDF cannot be coded into Scantailor (nothing is off limits for open source). Unless their is some kind of limitation for the program that a PDF extractor and inputter cannot be coded in, I would like to hear it besides it being not Scantailor's original goal (Scantailor advanced already has more features than the original one). Adobe is already a very large program with tons of features in it. I am trying to suggest EXACTLY the opposite that of why we need third party software to do one task that I think Scantailor should have now all built into it (it would only be a very minor feature in the program and optional to use).

It even says on the Scantailor website here: http://scantailor.org/ that the goal IS TO prepare images into a PDF (or DJVU).

L.Willms · Post by **L.Willms** » 26 Jun 2018, 13:04

Sallen112 wrote: ↑26 Jun 2018, 10:18
I'm sorry but I still do not see why we still cannot have this feature implemented, just because its main purpose is only for Scans and Pictures to process DOES NOT MEAN a PDF extraction and inputter back into a PDF cannot be coded into Scantailor (nothing is off limits for open source).

because there are dozens of programs available which do it in excellent ways.

I want the Scan Tailor developers to concetrate on the unique features of Scan Tailor instead of reproducing in Scan Tailer what other programs already perform excellently, whereas as a completely new challenge for Scan Tailor, it could only be mediocre, and if improved, then at the cost of the unique features of Scan Tailor.

I don't want programs who do 100 different tasks in mediocre way, but prefer to have 100 programs which each do their task in an excellent way.

cday · Post by **cday** » 26 Jun 2018, 13:09

You are asking for PDF input and output functions to be added to ScanTailor, fair enough, but meanwhile there is freeware software that can create a PDF file from a folder of images, and also extract images from a PDF file...

Examples are IrfanView (Windows) and XnView MP (cross-platform): both can create PDF files directly, but require freeware Ghostscript to be installed rasterise the images in existing PDF files at a specified DPI.

dtic · Post by **dtic** » 26 Jun 2018, 14:13

zbgns wrote: ↑18 Jun 2018, 04:21 By the way, as it comes to extracting images from pdf files, the best tool I know is pdfimages from the poppler set.

I agree pdfimages in Poppler ( https://poppler.freedesktop.org/ ) is useful. Does anyone know of a Windows version of the latest stable release poppler-0.66.0 ?

While on the topic of pdf to images: In my experience some pdf files unpack to images with negative black and white colors. To fix that run GraphicsMagick's convert command on the images with the command parameter -negate .

Sallen112 wrote: ↑26 Jun 2018, 10:18 I'm sorry but I still do not see why we still cannot have this feature implemented

https://en.wikipedia.org/wiki/Feature_creep

L.Willms · Post by **L.Willms** » 26 Jun 2018, 15:11

dtic wrote: ↑26 Jun 2018, 14:13 https://en.wikipedia.org/wiki/Feature_creep

Thank you for that!

This reminded me of this classic on software engineering: "The Mythical Man-Month", by Frederick P. Brooks, jr. Published in 1975 by Addison-Wesley (ISBN 0-201-00650-2).

The author was a project manager for the IBM System/360 and for OS/360, its operating system. He begins a chapter warning against "The Second-System Effect" with a quote from Ovid:

Adde parvum parvo magnus acervus erit -- Add little to little and there will be a big pile.

I refrain from adding more quotes from the actual text.

Except this: "How does a project get to be a year late? -- One day at a time!"

0kelvin · Post by **0kelvin** » 26 Jun 2018, 17:42

Can scantailor detect corrupted files? My HDD is corrupting images at wild. When a project happens to have one image file corrupted, scantailor overloads the HDD to the point that windows almost completely freezes. I can't even load task manager to terminate scantailor because it takes like 10 minutes to load.

zbgns · Post by **zbgns** » 26 Jun 2018, 17:45

dtic wrote: ↑26 Jun 2018, 14:13
While on the topic of pdf to images: In my experience some pdf files unpack to images with negative black and white colors. To fix that run GraphicsMagick's convert command on the images with the command parameter -negate .

I guess this happens in case of MRC compressed pdf files.
https://en.wikipedia.org/wiki/Mixed_raster_content
Images with inverted colors (black background and white letters) obtained by unpacking images from pdfs are masks placed on top of color backgrounds. It is the most effective way to obtain crisp color text on white background. Typically there is jp2 compressed color background and the mask with jbig2 compression. In case of such pdfs it may be better way to render an image of the whole page instead of extracting pictures from them. Otherwise you may get several pictures and each represents only a part of an input page, which are usually useless for further processing like preparation to OCR.

Anyone knows any open implementation of this MRC method that may be used for non commercial projects? I guess there are only proprietary implementations so far. E.g. Abbyy FineReader is able to produce such MRC pdfs, which is big advantage of this tool.

seldor · Post by **seldor** » 27 Jun 2018, 04:43

0kelvin wrote: ↑26 Jun 2018, 17:42 Can scantailor detect corrupted files? My HDD is corrupting images at wild. When a project happens to have one image file corrupted, scantailor overloads the HDD to the point that windows almost completely freezes. I can't even load task manager to terminate scantailor because it takes like 10 minutes to load.

Simple: Get a new HDD, and quick unless you like corrupted files (eventually a OS file will get corrupted, then you'll not only need to get a new HDD but also to install a new OS...
When an OS tries to read a bad sector, it retries again and again after failing, hoping that it will be able to read it once in a lot of tries. This slows everything down and is nothing ScanTailor can fix because the OS cannot deliver the bytes of the file. Maybe it would be able to determine a corrupted image afterwards, but that heavily depends on the image format and the used library...
But.. just get a new HDD (or better SSD)

DIY Book Scanner

Scan Tailor Advanced

Re: Scan Tailor Advanced

Re: Scan Tailor Advanced

Re: Scan Tailor Advanced

Re: Scan Tailor Advanced

Re: Scan Tailor Advanced

Re: Scan Tailor Advanced

Re: Scan Tailor Advanced - OFF TOPIC

Re: Scan Tailor Advanced

Re: Scan Tailor Advanced

Re: Scan Tailor Advanced