Alternative Software Workflow

Share your software workflow. Write up your tips and tricks on how to scan, digitize, OCR, and bind ebooks.

Alternative Software Workflow

Postby jradi » 19 Jul 2009, 20:05

I just thought I'd post some alternatives for those looking for a different end product, or maybe having a problem with postprocessor or pagebuilder.

1. After I take all the photos, I copy them into two separate directories c:\right and c:\left. I rotate these images just by highlighting all images in the directory, right clicking and clicking on "rotate clockwise" or "rotate counterclockwise." This feature is built into vista.
(4 minutes)

2. I use Ant Renamer to rename all the left pages, starting with page 1, enough zeros for 3 digits, enumerate by 2. I rename all the right pages, starting with 2, enough zeros for 3 digits, enumerate by 2.
(<1 minute)

http://www.antp.be/software/renamer

3. I use JPEGCrops to crop all the left hand images first. There's a feature called "synchronize crops." I turn this on and load up all the images. I set the crop for the first page and page through the other thumbnails to see if it looks fine. I pay particular attention to not grab too much of the opposing page and limit the amount of border around the page. In general, there's very little variance from page to page, so it's pretty easy to set a single crop zone that will work all the way across.

With really thick books, there's a little drift due to the thickness of the book. For these, I only import 1/3 or 1/2 the pages into JPEGCrops at a time. For the random page that falls out of bounds, you can crop it separately from the rest of the synchronized copies, that's why I page through all the thumbnails to see if there are any that need individual tweaking.

(about 10 min)
http://ekot.dk/programmer/JPEGCrops/

4. Then I copy all the cropped images into a single directory c:\both. Actually, JPEGCrops outputs to the "both" directory so I skip this step. I make a random spot check to make sure that my numbering of alternative pages worked, I just randomly look at images to make sure the actual pages are opposing each other. I usually spot check the beginnin, middle and end.

5. Lastly, I import all images into ABBYY Express 9.0 and perform a conversion into "other." I prefer to have my ocr outputted to vanilla text rather than word or pdf. ABBYY does a really good job of keeping the formatting, which I hate. I really just want plain old text to format later.

(2-3 hours per 100 photos - this is an overnight step)

6. I save the output of abbyy to a text file, then I look for warnings. Usually, if there's a page that was out of focus or something, abbyy will tell you that there was no text to capture. I also just quickly look through the images for pages that have very little text to see if abbyy handled them ok. If not, I make a note of those page numbers as well. Then I do a word search of the text file and manually "repair" any problems. Usually this involves manually typing 2-3 sentences per blurry page, always a handful per book.

If I'm paying attention, I usually can skip step 6 by making sure to put a page with text on it in front of the camera during the focusing step. However, I'm usually doing something else (listening to the radio, an audiobook, watching tv), and I zone out and miss some pages...

Anyway, that's my workflow using almost all free software. The abbyy is the expensive part, $50, but you can skip the ocr if all you want is to output to a pdf document.
User avatar
jradi
 
Posts: 82
Joined: 06 Jun 2009, 21:31
Location: DC - NoVa

Re: Alternative Software Workflow

Postby daniel_reetz » 19 Jul 2009, 22:08

I had no idea JpegCrops existed. Another excellent, free renaming utility is "Bulk Rename Utility".
User avatar
daniel_reetz
 
Posts: 2485
Joined: 03 Jun 2009, 13:56

Re: Alternative Software Workflow

Postby you1 » 20 Jul 2009, 02:12

Thanks Jradi.

I had no idea JpegCrops existed. Another excellent, free renaming utility is "Bulk Rename Utility".


Ya, imagine that; like an idiot, I wrote my own peace.

Half the battle is knowing what's out there, and the other half is remembering to use them.
you1
 
Posts: 92
Joined: 08 Jun 2009, 18:55
Location: Central California

Re: Alternative Software Workflow

Postby jradi » 21 Jul 2009, 08:04

I still think a task specific program would be better, I just couldn't get it working and I wanted to get scanning books. The above solution is ok, but not perfect. ABBYY somehow rotates the page so all the lines of text are perfectly horizontal. I would love to see that feature in a jpegcrops type of program. I would also like to automate the whole process, my way takes too much time - about 30 minutes of manual labor per book. Still, it's worth it, but if I could limit my manual labor to the scanning process and just hit a button for the software side, I could do 3 times the scanning...
User avatar
jradi
 
Posts: 82
Joined: 06 Jun 2009, 21:31
Location: DC - NoVa

Re: Alternative Software Workflow

Postby daniel_reetz » 21 Jul 2009, 08:18

Surya is employing your method now (I hear from him via email much more often than the forum).

I know you've got a method worked out and everything, but PageBuilder will do your first two steps -- crop and rotate. From those JPGs you could use ABBY. All you need is PageBuilder 2 and the Matlab Component Runtime. Apologies if you've already tried it. Just be sure to check the JPG output radio button.
User avatar
daniel_reetz
 
Posts: 2485
Joined: 03 Jun 2009, 13:56

Re: Alternative Software Workflow

Postby spamsickle » 21 Jul 2009, 21:03

I appreciate the tip on JPEGcrops. I'd been a bit frustrated by "book jitter" when cropping images with ImageMagick. This program conveniently eliminates that issue.

I switch to "free" on the aspect ratio, click "synchronize crops" and frame one page. This puts a page-sized frame on each of the pages. Then I unclick "synchronize crops" and scroll through them using the thumbwheel on my mouse, and nudge the frames into place if necessary.

The only problem I've run into is that it runs out of memory (on a 4-gig machine) if I try to do several hundred pages at once.
spamsickle
 
Posts: 572
Joined: 06 Jun 2009, 23:57

Re: Alternative Software Workflow

Postby jradi » 24 Jul 2009, 07:02

You learn something every day. I didn't realize there was a way to use jpegcrops without a fixed aspect ratio, I also didn't realize I could toggle between synchronize and not synchronize the crops.

I'll give pagebuilder a shot this weekend. I'm glad to have a working system, but anything to automate it is appreciated. I'm just looking for a no hassle way to get the jpegs into abbyy and then into my kindle. For that, I don't need the cleanest jpegs, just the quickest.
User avatar
jradi
 
Posts: 82
Joined: 06 Jun 2009, 21:31
Location: DC - NoVa

Re: Alternative Software Workflow

Postby xylon » 24 Jul 2009, 08:01

is anyone else sometimes getting an error in abbyy as to what appears to be the result of using jpegcrops? this error is call "internal program error: .\src\imageinfoimpl.cpp, 378" in abbyy. this error does not happen everytime i use jpegcrops.
Last edited by xylon on 24 Jul 2009, 08:26, edited 2 times in total.
Image
User avatar
xylon
 
Posts: 27
Joined: 01 Jul 2009, 15:29

Re: Alternative Software Workflow

Postby jradi » 24 Jul 2009, 08:05

Yeah, I get a random error and I can't figure it out. For those pages, I end up having to go back to the original page, rotate it, crop it using a different program, and insert it into my batch before reimporting into abbyy.

Luckily, abbyy flags bad jpegs right off the bat so it's not like I've wasted hours waiting on the ocr before this error occurs.
User avatar
jradi
 
Posts: 82
Joined: 06 Jun 2009, 21:31
Location: DC - NoVa

Re: Alternative Software Workflow

Postby xylon » 24 Jul 2009, 08:26

setting jpegrops to 13x19 cm. aspect seems to fix the problem.
Image
User avatar
xylon
 
Posts: 27
Joined: 01 Jul 2009, 15:29

Next

Return to Tutorials/How-To's

Who is online

Users browsing this forum: No registered users and 0 guests