Scan Tailor

Scan Tailor specific announcements, releases, workflows, tips, etc. NO FEATURE REQUESTS IN THIS FORUM, please.

Re: Scan Tailor

Postby phaedrus » 23 Nov 2009, 22:03

Hi Tulon, FYI I have just tried running the images through Scantailor again, varying the DPI setting.

With the setting at 600 x 600 it would completely fail to select any content on around 20% or a little more of the images. At 300 x 300 it would select content from all the images (but still with the page number missing from something like 50%). Much better with the latter DPI setting. My feeling is that it's probably a fundamental contrast issue - these were photographs taken under standard library overhead lighting + some natural light from a nearby window, there was no additional illumination on the pages. That said they looked fine and there's no appreciable difference between the failing pages and the rest which led to my puzzling over the reason for the problem.

I had read the post you referred me to previously, was interested to see others had requested something similar but what I was wondering about was the feasability of selecting an amount to 'add' at the bottom of each content box to include the page number where missing - slightly different from setting a fixed content box and simply propagating that throughout. Another way might be to detect the bottom edge of the page (would depend on the scan I guess) and simply measure up a pre-determined level from there. While I realised that every scan has the potential to be different (placement/size/skew) I suspect the way it's being done here with a camera/stand means that the positioning is likely to be more uniform and the possiblity of propagating a single selection throughout is likely to meet with more success than with traditional methods. In this particular case for instance I think position and scale are sufficiently similar between images that a fixed content box would work fine - or at least the need to go through and fix up individual pages would be reduced somewhat. Certainly if I had any coding skills I'd gladly take you up on your suggestion to include it, I think as an option it'd be useful. Sadly I don't so I can only contribute through testing & reporting on results :-) If I become familiar enough with ST I'll have a crack at a help file of sorts if that's of any use.

Rgds, P.
phaedrus
 
Posts: 56
Joined: 06 Oct 2009, 05:47
Location: New Zealand

Re: Scan Tailor

Postby phaedrus » 23 Nov 2009, 22:48

Well, regarding my last statement I thought I'd have a go at translating the 'quick start guide' and 'tips for scanning' from Russian to English. Here's the result with some extra stuff added in by me for the quick start guide:

Quick Start guide.

1. Open ScanTailor & select ‘new project’ to begin with a new set of images.
2. Browse to select the directory that contains the images you wish to process
3. Select the images as required.
4. ScanTailor will default to using the same directory appended with an ‘out’ directory for processed images but you can select a different directory at this stage if you wish.
5. At the next stage it’s necessary to set the DPI level. Click on ‘all pages’, set or use drop arrow in ‘custom’ to select the level. For accurate content selection use a lower setting (300 x 300).
6. Set the orientation of the first image, if necessary click ‘apply to’ and select ‘all pages’ to propagate that setting to the rest of the images.
7. Click to ‘split pages’. If the image is of two pages of a book for instance ScanTailor will attempt to split the image into the two pages. Click on the > arrow to have ScanTailor batch process split for the remainder of the images. Splits can be set using the page layout section.
8. In all cases individual pages can be inspected in the thumbnails to the right of the main page, clicking on a thumbnail will load that page to the main area and allow for adjustment of various parameters.
9. Select ‘deskew’ if it’s needed and ScanTailor will attempt to straighten up text and images based on the edges of the text/image. Once again use the > arrow to batch process as needed and use the thumbnails to quickly check each page.
10. Click on ‘select content’ and batch process (> right arrow). Check thumbnails and adjust by selecting that thumbnail and altering main page as needed (click and drag edges of the content box). If no content is selected a content box may be manually entered by selecting that image and right-click in the main window, add content box and adjust as needed. Likewise one may remove a content box from a page by right-click and delete.
11. Then move to ‘page layout’ and adjust the various parameters according to your requirements. ScanTailor defaults work well for most cases.
12. Finally click on ‘output’ and select the type of image output desired. Black and white will produce a clean output image for simple text and line drawings. If some text appears to be ‘missing’ try adjusting the line thickness to see if it appears (increase). Once set this may be applied to all pages and run through the batch process. Alternatively for images with photographs etc one may select ‘mixed’ or even colour/grayscale. In all cases it’s wise to check the thumbs after processing particularly in black & white or mixed mode to ensure that all the required content is included.
13. Files will be saved as tiff’s in the selected directory and may be further manipulated or used as required from there.


Tips for scanning.

To get a good result with Scan Tailor, and minimise errors in the automatic processing, follow these rules when scanning:

 Don’t scan in black and white mode - use grayscale, or, if necessary, colour.
 Don’t scan at any lower resolution than 300 DPI, typically scan at 300 or 600 DPI.
 Do not scan to JPEG – the format is lossy, and converting to other formats will not result in better quality. (Camera users will most likely need to use JPG mode but use the best resolution possible).
ï‚· When scanning typically use TIFF format, but be careful as TIFF can be used for jpeg-compression algorithms.
ï‚· If you need to compress the image then choose LZW - a lossless format
ï‚· If unable to use TIFF then scan to PNG - it is guaranteed to use lossless compression algorithms, or, in extreme cases, to BMP (file size will be uncompressed and therefore enormous, and should preferably be converted into a format that uses lossless compression).
ï‚· Avoid scanning mode "Document", and generally try to disable all options to improve the scans.

I'll have a go at uploading it to SF if I can find out how :-)

UPDATE: Tulon, I've mailed you via SF to be able to edit the ST english Wiki.

Cheers, P.
phaedrus
 
Posts: 56
Joined: 06 Oct 2009, 05:47
Location: New Zealand

Re: Scan Tailor

Postby Tulon » 24 Nov 2009, 18:26

Phaedrus,

To get Wiki editing permissions, please visit the wiki while being logged in to SourceForge, then tell me your username there. Visiting the wiki while being logged in will make it recognize your SourceForge username, so that I could give you editing permissions.
When Scan Tailor asks you to enter DPIs manually, never enter arbitrary values. The video tutorial shows how to estimate the real DPI.
Tulon
 
Posts: 536
Joined: 03 Oct 2009, 06:13
Location: London, UK

Re: Scan Tailor

Postby phaedrus » 25 Nov 2009, 02:06

Hi Tulon, I'll PM you from here - I already sent a message via SF but had a feeling it didn't work :-)

Cheers, P.
phaedrus
 
Posts: 56
Joined: 06 Oct 2009, 05:47
Location: New Zealand

Re: Scan Tailor

Postby Tulon » 25 Nov 2009, 03:53

Phaedrus,

I kind of received your PM via SourceForge, except it ended up in my spam folder. Gmail doesn't seem to like mail from sourceforge.net very much.
Anyway, I gave you editing permissions.
BTW, the easiest way to create a new page is to do a search for its name and then follow the red link indicating a non-existing page.
When Scan Tailor asks you to enter DPIs manually, never enter arbitrary values. The video tutorial shows how to estimate the real DPI.
Tulon
 
Posts: 536
Joined: 03 Oct 2009, 06:13
Location: London, UK

Re: Scan Tailor

Postby DSpider » 26 Nov 2009, 13:35

I thought Scan Tailor would help with OCR-ing. It didn't. On the contrary... ABBYY FineReader 10 - considered the best OCR program right now (well, at least for Romanian) actually performed better from the original rather than the processed file...


I may have found a bug in Scan Tailor. The top left corner is missing.

Would you like to see ?

Archive.7z (2.2 MB)
Sorry I couldn't attach it here (maximum size is 2 MB).

Take a look at Scan Tailor Proccessed.tiff and you'll see the first line doesn't include a hyphen (diaglog dash "-").


Also, after deskewing with ABBYY FineReader the recognision was improved. But this may be an issue with FineReader itself not Scan Tailor... Because I also noticed that manually selecting the text area is worse compared to the automatic analysis. If you take a look at From the original, ABBYY deskewed, automatic selection of the text area.PNG the first hyphen ("-") was set as a... table ? I mean it's blue... Text areas are green. Wtf.

OCR-ing is such a b*tch... But this is not a topic on FineReader or OCR-ing. What I wanted to ask was if there's a "better" alternative to this. Actually, no. Which is the "best" deskewer available right now (either open-source or commercial) ? For pre-OCR processing that is.
DSpider
 
Posts: 48
Joined: 26 Nov 2009, 09:56

Re: Scan Tailor

Postby spamsickle » 26 Nov 2009, 14:46

DSpider, I always turn off (uncheck) the Despeckle box in black and white mode. I think that will take care of your bug too. For most of what I scan, "speckles" aren't a problem, and the de-speckling in Scan Tailor is kind of aggressive, swallowing lots of dashes and tildes and equals signs and page numbers that aren't really specks at all. Maybe when I scan some of my 18th century books, I'll need the despeckling if I want to OCR them, but for now, I just turn it off.

One thing I've noticed, though -- if I select "color or grey scale", Scan Tailor seems to ignore the content selection, and gives me more than I've marked as the "page". Is this a bug or a feature of page formatting?
spamsickle
 
Posts: 577
Joined: 06 Jun 2009, 23:57

Re: Scan Tailor

Postby Tulon » 26 Nov 2009, 16:37

DSpider wrote:I thought Scan Tailor would help with OCR-ing. It didn't. On the contrary... ABBYY FineReader 10 - considered the best OCR program right now (well, at least for Romanian) actually performed better from the original rather than the processed file...

This shouldn't surprise you. Scan Tailor's output might look better for the eyes, but from the point of view of FineReader, it contains less information than the original. I imagine it would do better if you would output in "Color / Grayscale" mode.

DSpider wrote:I may have found a bug in Scan Tailor. The top left corner is missing.

It's a case of overly aggressive despeckling. Improving it will be the focus of Scan Tailor's next major release.
When Scan Tailor asks you to enter DPIs manually, never enter arbitrary values. The video tutorial shows how to estimate the real DPI.
Tulon
 
Posts: 536
Joined: 03 Oct 2009, 06:13
Location: London, UK

Re: Scan Tailor

Postby Tulon » 26 Nov 2009, 16:45

spamsickle wrote:One thing I've noticed, though -- if I select "color or grey scale", Scan Tailor seems to ignore the content selection, and gives me more than I've marked as the "page". Is this a bug or a feature of page formatting?

It's a feature. There is a "White margins" option in this mode that will clear everything outside of the content box. Chances are you wouldn't like the results though, because there will be a very visible transition between content area and margins. That is, unless your page background is clear white.
When Scan Tailor asks you to enter DPIs manually, never enter arbitrary values. The video tutorial shows how to estimate the real DPI.
Tulon
 
Posts: 536
Joined: 03 Oct 2009, 06:13
Location: London, UK

Re: Scan Tailor

Postby spamsickle » 26 Nov 2009, 19:21

Hmmm, what I really want is a "No margins" option in this case, I think. Often, my color pages are just the front and back cover, and what I want on that page is just the "content" of those images. What I'm doing now is often interactively specifying a tiny sliver as "content" and trying to get the image I want that way, but it's hard to predict when what I pick bears little resemblence to what I get. Just shaving away my selection doesn't really work either, because ST doesn't seem to just shave an equal amount off of my output -- there are discontinuities, when removing a little of my selection causes the output to "jump" quite a bit, and other cases in which removing a lot of the selection doesn't seem to have any effect at all.

It's my fault. I shouldn't be treating ST as a black box. I have the source, but I still haven't started digging into it. It's just so darn useful as a tool, and I have so darn many books to get through. Worst case, I'll just stop having ST process these cover images, and convert a straight JPEG selection into my final PDF.

Right now, my biggest "problem" is not with ST, but with the loss of quality introduced by the programs I'm using to convert ST's TIFFs to the PDFs I'm using in my final ebook. When there are images on the page as images (ST Mixed mode), they tend to make my text more ragged than I'd like.
spamsickle
 
Posts: 577
Joined: 06 Jun 2009, 23:57

PreviousNext

Return to Scan Tailor

Who is online

Users browsing this forum: Google [Bot] and 1 guest