CZUR ET24 Pro to epub?

Don't know where to start, or stuck on a certain problem? Drop by and tell us about it. Feel like helping others? Start here.
CZUR ET24 Pro to epub?

Post by johnwhelan »

Yes I have a couple of Canon A810 which I experimented with but I'm lazy and eventually brought a CZUR ET24 Pro which does the job and I don't have to worry about lighting and so forth. I have a few elderly books with small font size some pages are numbered, and the printing has faded in places.

It scans fairly quickly and gives me a set of .jpgs one for each page.

Using the OCR that came with the device I note some pages are perfect and others need cleaning up. Their dictionary is possibly missing a few words. The process seems to be scan to .jpg then OCR to txt or Word or possibly .pdf.

If I export to word 97 Wordperfect loses much of the text, export into Libreoffice and you get a very odd document that seems to set the font differently for different pages, plus you get the hard carriage returns stuck in the document.

.pdf, I just prefer not to go there. The document size is rather large to say the least.

With .txt I can strip out the hard carriage returns in notepad++ before loading into Libreoffice and running a spell check and comparing the text to the original. Load the odt file into Calibre and covert to .epub.

Any thoughts on improving the process?

Thanks John
Re: CZUR ET24 Pro to epub?

Post by brett »

I've been learning how to digitally translate out-of-print books to ebooks, mostly related to Buckminster Fuller, like his last book Cosmography. Some of the materials I've collected are fragile-- how I got here to begin building a scanning rig to be gentle on the spines.

The books I'm working with were printed during the 1930's or later. Layouts are not very complex. Have some math formulas, tables of data, and some images for figures. I want a digital version that will re-flow to whatever screen I'm using and is searchable-- I'm not interested in archival quality, perfect reproduction of pages. I wanted a source manuscript that can be translated to whatever ebook format comes along, ePub3 is current winner. Because of laying out math equations to display in an ebook, I ended up in the LaTeX world to create the manuscripts.

I also have an ET24 Pro on a Mac. In September, CZUR updated their scanner software to support Apple M1 chips. I get the best scanning results from it using the low setting on the main lights with the side lights on, in a dark room. For OCR, I use the side by side page in grayscale or color mode. Auto-enhance seems to introduce extra red color noise around edges in my setup. I put time into "Adjust the central seams" in the manual correction tools to get better page images.

I like to strip content down to data and remove as much layout/formatting as possible to avoid translation errors down stream. Word processors, like Word and WordPerfect, embed layout info that often gets garbled on the way to ePub. Once all the pages are imaged, I export as a Word OCR (.docx). To get the cleanest text for the HTML that Calibre generates for ePubs, I run .docx through 'pandoc' that extracts images and translates to ".tex". LaTeX is usually a step too far, but it all depends on the goal. Printed books are too heavy to ship to Mars, so only ebooks will make the trip and live on.

To help with proofreading, I recommend viewing text in a fixed-width font and using an editor/IDE like VS Code or WebStorm/IntelliJ for text manipulation. To fix the quirks in the output from the OCR engines (ABBYY or Tesseract), I've been learning Regular Expressions and using its pattern matching to correct scannos (l <-> 1 or m <-> rn) and replace odd special characters.

ePub is really just HTML/CSS so a simple layout is better for the broadest viewing.

Hope that helps.
