Displaying paper page as HTML

General discussion about software packages and releases, new software you've found, and threads by programmers and script writers.

Moderator: peterZ

Post Reply
allranger
Posts: 5
Joined: 04 Mar 2014, 00:52

Displaying paper page as HTML

Post by allranger »

I am building a scanner to preserve old books. My current plan is to make PDF documents because the page is as important as the text in some cases.
For the future though I was wondering if anyone knew of an option that could take a page, OCR it, and display it as searchable HTML text or so.
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: Displaying paper page as HTML

Post by rob »

Well, ABBYY FineReader can OCR into a number of formats, including HTML. You can choose "plain" HTML, or "exact" HTML, which does CSS, paragraph spacing, and so on. It will fail under some circumstances. I'm afraid it's not perfect, but it's close.

Here's the support email I sent, and the response I got back -- the usual corporate "We're not gonna fix it" bullshit. They said that the ABBYY output is "not going to be a mirror image of the original document", and that "The software only converts the text image to editable text".
Hi,

I'm using FineReader 9.0.0.882.

I have a jpg file of a single page from a fiction book. FineReader performs well in recognition, and very well in formatting the document, but it looks like sometimes the formatting of the recognized document in the Text window is not exported properly. In other words, although the Text window shows the correct formatting, the exported document is not formatted properly. There is no limiting factor in the export format, which is why I think there could be a bug in FineReader export. Could you please investigate the issue? I love FineReader, I think it's better than any other OCR package out there, but this export issue makes me cry :)

I'm using exact HTML export with Full CSS. Most of the time the export is correct, but sometimes, annoyingly, the export is incorrect. There doesn't seem to be anything about the HTML export that would prevent the format from being correct.

I have included the original jpg image, along with the exported HTML and PDF output. I draw your attention to these paragraphs:

(indented properly) "Eighty-five is the best I can do."
(indented properly) "Okay, I'll talk him into eighty-five. But just for you. I wouldn't do it for anybody else."
(NOT indented properly) "You're a sweetheart."

...

(indented properly) "Not earned out yet? Are you sure?"
(indented properly) "Sad but true."
(indented properly) "Hmm. Well, I guess Sheldon can live with a million until the next royalty checks come in. In his tax bracket, it isn't so bad."
(NOT indented properly) "The self-discipline will be good for him."
(NOT indented properly) "But how about making it a two-book deal?"
(rest of page NOT indented properly, except for last paragraph)

I checked the other exports, and found the following (each with Exact Copy selected):

HTML: Not formatted properly
PDF: Formatted properly
RTF: Not formatted properly
DOC: Not formatted properly
XML: Not formatted properly


Here are my settings:

Document:
Document languages: English
Document print type: Autodetect
(all other options not selected)

Scan/Open:
Automatically read acquired page images
Image Processing
X Correct image skew
X Detect page orientation
(all other options not selected)

Read:
Thorough reading
Table processing
(all options not selected)
Training
Do not use user patterns

Save:
HTML:
Retain Layout: Exact copy
Save mode: Full (use CSS)
Text Settings:
X Use solid line as page break
X Keep headings and footers
(all other options not selected)
Picture Settings: Medium (for screen)
Character encoding:
Code page: (Automatic)
Code page type: Windows


Thanks,

--Rob
page141.jpg
(1 MiB) Downloaded 259 times
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
Post Reply