First, thanks for the quick response and for just being all around awesome. Second, if its appropriate to do so and no one has a build laying around, I'll ask again in a week or so after ISP craziness can be sorted out.
My workflow right now looks like this
1) Take high resolution pictures of documents of arbitrary size and orientation laying in a folder on a table top, with varying light and background. (Custom console program to usb trigger a cannon rebel ti)
Results in thousands of jpegs spread across hundreds of folders where the folder name represents the location information for that document in the archive.
2) Do content selection, splitting, orientation, and cropping (Scan Tailor)
Results in color tiffs cropped nicely to just the text on the front/back of each document which generally fits in an letter-sized page. Still looking for a way to get rid of pages that contain only junk like my hand or a blank back of a document.
3) Use a script to convert those tiffs into a single pdf for each folder, taking the name from the folder.
4) Pull all the pdfs out and into a single folder.
Run ocr on the pdfs (Acrobat Professional /Omnipage)
Because the layout of information inside that folder can vary wildly, there's no manual way to go about it. It could be two A-sheets side by side, it could be a legal sheet laying horizontal, it could be a legal sheet laying vertical, the folder may be visible underneath or completely obscured, the table might be a clean white or these horrible speckled ones. I take about 10,000 pages in a week on my short visits, and the trick now is to get that from 100gb of jpegs into nice, small, skim-able, and searchable pdfs. The whole process works manually with Scan Tailor, I just need to be able to script it with a command line.