DIY scanner and Scan Tailor processed books on Google Books

A place to tell us about your work and projects. Self-links encouraged!

Moderator: peterZ

User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: DIY scanner and Scan Tailor processed books on Google Bo

Post by Misty »

Thanks, Dingo. It's unfortunate that that requires Python though - if I'm going to distribute this script, I was hoping to reduce dependencies as much as possible, so I'd rather people didn't need Python installed. Is there a way to compile Python code that might eliminate that dependency?

Tim, I thought this Slashdot story was timely: http://linux.slashdot.org/story/10/07/2 ... art_pos=26

Apparently it's possible to use Cuneiform and Exactimage together to achieve something like what we're looking for, but the WatchOCR Linux LiveCD seems pretty vague about what they're actually doing to combine them - no source download, or documentation on how they're configured.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
User avatar
dingodog
Posts: 110
Joined: 22 Jul 2010, 18:19
Number of books owned: 1000
Country: on the net
Location: on the net
Contact:

Re: DIY scanner and Scan Tailor processed books on Google Bo

Post by dingodog »

Python is used in order to join all symbols generated by jbig2 encoding, in one pdf

anyway, I can suggest, to people wanting use this jbig2 and pdf.py script

to use:

*Puppy Linux* (it is able to run entirely in RAM without install as LIVE CD)
- http://dokupuppylinux.co.cc/

- add the jbig2enc encoder
- http://dokupuppylinux.co.cc/programs:encoders (I built from source by myself, based on latest sourcecode and leptonica libs)

- add then python
- http://dokupuppylinux.co.cc/programs:python

(recommended *python 2.5* pet version, simply download and click in puppy linux)

I use this environment (puppy linux plus jbig2enc and python package) + scantailor to process my books

I found jbig2 very strong in compression capabilities for B/W images
Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: DIY scanner and Scan Tailor processed books on Google Bo

Post by Tulon »

I was quite late to notice this thread. Obviously, I am very pleased.

It prompted me to update Scan Tailor's front page and change development status on SourceForge from Beta to Production/Stable.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.
Mondotofu
Posts: 6
Joined: 04 Mar 2014, 00:52

Tesseract without compilation

Post by Mondotofu »

Tim wrote:[
There are three decent (depending on your needs and skills) options for open source OCR right now. Tesseract, Ocropus, and Cuneiform.
Tesseract - http://code.google.com/p/tesseract-ocr/ - the development version has to be built from source in order to get page layout analysis. See http://code.google.com/p/tesseract-ocr/wiki/ReadMe
@ Tim --
Depending upon your development platform, you may have choices. For example, with Ubuntu and some other Debian based distributions, you can get packaged distributions of tesseract.

As of this writing,

Code: Select all

sudo apt-get install tesseract-ocr
will set up version 2.04 on your machine. By the way, later versions of Ubuntu are supported besides the one listed at the Google site. I'm currently on Ubuntu 10.04 x86.

If you're running Windows, you might not need to compile. Just go to http://code.google.com/p/tesseract-ocr/downloads/list and download the windows executables for version 2.04 as well.

Similarly, packages for Red Hat and its clones (such as CentOS) and bleeding edge projects Fedora have packages (rpm archives) to install. Websites such as rpmfind.net and rpm.pbone.net are good search engines for finding these packages and pre-requisite information for installation of recent tesseract-ocr rpms.

If you're running a Mac, it looks like Darwinports offer an easy way to access the functionality ( http://tesseract.darwinports.com/ ).

Mondotofu
Tim

Re: Tesseract without compilation

Post by Tim »

Mondotofu wrote: @ Tim --
Depending upon your development platform, you may have choices. For example, with Ubuntu and some other Debian based distributions, you can get packaged distributions of tesseract.
Indeed, what I was pointing out, was to get the new features, particularly the page layout analysis that is in the tesseract 3.0 development tree, you do currently have to build from source until it is released. Page layout allows handling things other than single column text pages. Cuneiform also has some layout analysis, but it doesn't work that great for what I've tried it for so far.
Digitizer
Posts: 9
Joined: 18 Jan 2011, 11:58

Re: DIY scanner and Scan Tailor processed books on Google Bo

Post by Digitizer »

Misty wrote:I produced these using a script I wrote that yokes together a few different tools to produce layered PDFs - that helps keep the filesizes down, by using an efficient bitonal compression on the text while downscaling the illustrations to 100DPI and compressing them with medium JPEG. I've received permission from my employer to release the script as GPL, so I'll make that available soon. I just have a few improvements to make before I do that.
Hi,

where can i get the mentioned script? I think about my own workflow now and will appriciate any suggestion.
Thanks for sharing this with us!

Cheers,
Marcus.
User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: DIY scanner and Scan Tailor processed books on Google Bo

Post by Misty »

Hi Marcus,

My script was publicly released as PDFMaker. However, I'd recommend trying pdfbeads instead of PDFMaker. It's more advanced and has functionality I didn't include in PDFMaker, including OCR support.

While I did release PDFMaker as GPL, I've discontinued development on the current version. If I need to contribute anything, I would either contribute to pdfbeads or do a rewrite of PDFMaker.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
Post Reply