by spamsickle » 12 Nov 2009, 12:24
I've mentioned elsewhere the tools that I'm using, but might as well repeat it here.
I use ImageMagick to convert from image files to PDFs. Formerly, that meant JPGs to PDFs; now that I'm using Scan Tailor, it's TIFFs to PDFs, but ImageMagick still handles both just fine. The mogrify command
mogrify -format PDF *.TIFF
will convert all the TIFF files in a directory to PDF files, while retaining the TIFFs. If you're going from JPGs, just change TIFF to JPG...
Once all my files are PDFs, I use PDFTK (PDF toolkit) to batch them together:
pdftk *.pdf output mybook
ren mybook mybook.pdf
If you're using Scan Tailor with standard output, and your book is less than 1000 pages, you can collapse this into a single step:
pdftk 0*.pdf output mybook.pdf
The first "script" will take all the PDF files in the directory and output a new PDF. If you're naming the new (output) file something.pdf, the "output" file will have a name that looks like an "input" file, which will turn into an infinite loop -- not what you want. So output some non-PDF name, and rename it after you're done.
The second will take all the PDF files that begin with "0", and output a new PDF. Assuming your new PDF doesn't also begin with "0", it won't match your input wildcard, and will work as you want. Since Scan Tailor outputs TIFF files that begin with 4-digit numerics, 0*.PDF names will be created by ImageMagick by default.
I'm still ending up with "image" files rather than nice tight PDFs with embedded fonts. In theory, I can run them through ABBYY to get text, and then use some text-to-PDF tool to convert that into a compact PDF. In practice, most of the books I'm scanning are not pure text; they contain images, tables, and such, which would have to be maintained as images anyway. I'm willing to give up storage space to save conversion time; your values may lead to different choices. Currently, I'm not even bothering with the ABBYY step for most of my books, and for those that do get the OCR treatment, it's still "text under image" for the "final" product.
I am saving all of my original DIY-scanner images separately, so if newer and better tools become available, I can go back and create a different result some time down the road.