Tim wrote:JDSimmons wrote:Thanks for the information. I'll have to check out that documentation a little more. I have a high opinion of Scan Tailor, and if I wasn't trying to make e-books to donate to the Internet Archive (where they like to have the page images look like the originals) I wouldn't use any other method of creating e-books.
What part of the Scantailor cleanup process do they object to? It might be something you can skip or compensate for in Scantailor. Or maybe IA could be persuaded or knows of someone that can contribute code to Scantailor to get output the way they like it.
Actually the Internet Archive will take anything you give them as long as its in the public domain or properly licensed. Submissions from people like me are called "Community Texts". I'm trying to make my submissions look as much as possible like the books that IA scans themselves with their famous Scribe workstations. Nobody asked me to do that, it's just something I feel the need to do. This means that I need to crop my pictures to get page images, and I have to deskew the picture before cropping, etc.
Scan Tailor does not deskew or crop pages. Instead, it figures out where the content is on the page, deskews just that rectangular area, and sticks it on a brand new white page with new margins. The end result is very attractive, but does not look like the original page any more. The color of the page and the margins are different from the original book. Most of the time this is a great improvement, but it just isn't what the original page looked like.
If you check out the links to my three donated books (you can read them online without downloading them) you'll get a better idea of what I'm talking about.
Now when I submitted "Ancient Manners" to Distributed Proofreaders the look of the original pages was never an issue, because they're going to create Plain Text files and HTML. They just needed the cleanest pages and illustrations I could give them, and Scan Tailor is right on the money for that purpose.
Internet Archive also contains books digitized by Google and Microsoft, and the quality of their scans is noticeably worse than what IA does itself. My donations aren't up to IA standards, but I'm working on it.