Processing for a newbie - help needed

Don't know where to start, or stuck on a certain problem? Drop by and tell us about it. Feel like helping others? Start here.

Processing for a newbie - help needed

Postby Heelgrasper » 26 Feb 2012, 23:22

I’m a total newbie so it’s no wonder I sometimes get a little confused and perhaps just isn’t able to find the right place where something is mentioned here. Bear with me if that’s the case here. And I know you’ll have to bear with me for making this so long in pretty crappy english.

I use Windows 7 and will likely use some sort of Windows OS till that seems as odd as someone using MS DOS today. My OS is also about the only software I use that actually cost money. It’s not some political statement and it doesn’t matter to me if I’m using freeware or free software. I’m just cheap and if I - legally - can get something for free that is just as good or nearly as good as something I have to pay for then I see no reason to pay. To make it clear: I find it perfectly fair that you have to pay people for a job they do, including making software, and I wouldn’t bitch at all for having to pay $10 for some software. I just try hard to avoid it.

I’m guessing I’m not the only one thinking that way so hopefully helping me means helping a lot of others too.

From my starting point I’ve tried looking a little at how to do the processing work after the actual scanning. I had a couple of articles I had scanned/photographed that I practised a bit on. For me the goal is to end up with a usable PDF and I managed to get that but I’m sure the process can be perfected. I’m just having trouble finding out exactly how.

So now I’ll try to line up four post processing procedures with a few comments along the way. I hope some of you will then jump in and tell what should/could be used at the different stages in the different procedures. I’ve tried to make a number system so it should be easy to identify what stage, we’re talking about. As far as I can see those four procedures should cover pretty much anything you would want to do with your images.

Keep in mind, free software as much as possible.

Process 1: Images directly to PDF
This is the shortest way from camera to PDF. You keep as much information as possible but end up with big PDF’s.
1.1. Create PDF from images. Note: I’ve used i2pdf and it worked fine. I haven’t tried alternatives though so other programs might work better. At Early European Books ( http://eeb.chadwyck.com/ ) they seem to use this procedure “on the fly”. I know most of you can't get to see anything from there so you'll just have to take my word for it.

Process 2: Clean up images before making a PDF
For making a PDF that looks like a book printed on bright white paper and making the image files as small as possible by reducing pure text to b/w, resulting in in a smaller PDF.
2.1. Process the images (could include splitting pages, deskew, making, cropping and making resulting images the same size). Could be done in ScanTailor or Book Scan Wizard. If you just want to batch crop the images I suspect there’s something simpler to be used. Note: I’ve only tried ScanTailor but to my understanding BSW do much of the same.
2.2. Create PDF. Note: This stage is the same as 1.1.

Process 3: Add OCR layer to PDF
For making a PDF you can search etc. Note: I haven’t tried this yet.
3.1. Process the images. Note: Same as 2.1.
3.2. Do OCR. Note: Might need some manual error correcting.
3.3. Create PDF with OCR layer

Process 4: Create PDF based on OCR
For creating a PDF with the text from the book in a format to your liking.
4.1. Process the images. Note: Same as 2.1 and 3.1.
4.2. Do OCR. Note: Same as 3.2
4.3. Put the text from the OCR into an editor and make the text look like you want it and then save/export it as PDF. Note: If I was doing it, I would use OpenOffice Writer.

So far I’ve tried process 1 and 2 with some success. Only “problem” have been ending up with rather large PDF’s but as far as I can tell that has a lot to do with picking the right DPI and image compression. Still I feel I’m missing something. I have a PDF of a 138 page book from The Internet Archive, a “process 2” work with all pages in color, and the file is just under 8 MB. I’m nowhere near that.

It’s likely I’ll never use “process 4” but that’s just because I in general want to scan books where I want the PDF to look close to the original book and often with pictures and figures. If I was interested in scanning novels etc. I would be likely to use it.

Feel free to call me a nut, show me the errors of my way or give suggestions on software to use, short instructions on how to use it or links to earlier threads etc. I should read.
---
Jakob Øhlenschlæger
Randers, Denmark

The past is a foreign country: they do things differently there
L. P. Hartley
User avatar
Heelgrasper
 
Posts: 70
Joined: 19 Feb 2012, 21:04
Location: Randers, Denmark

Return to HELP

Who is online

Users browsing this forum: No registered users and 1 guest