We have fixed some issues with the board email system. If you have not received an email for password reset or verification, please try again.

ABBYY 12 - why are end-of-line dashes an odd character?

Convert page images into searchable text. Talk about software, techniques, and new developments here.
Post Reply
glenleslie
Posts: 17
Joined: 13 Aug 2012, 09:08
E-book readers owned: Kindle - multiple platforms
Number of books owned: 1000
Country: United States

ABBYY 12 - why are end-of-line dashes an odd character?

Post by glenleslie »

I've noted over many scans that dashes at the end of lines turn into an odd character

¬


ASC character 172 is substituted for dashes ... is there a way to tell ABBYY to always use a specific character to replace what it thinks it found?

Obviously it's sort of a perfectionist problem. The only time you see this character is if you use a PDF reader to reflow the text or if you export the project to a text format. Wondered if someone knew a quick way to address this.

orwell_review_ocr_issue.jpg
orwell_review_ocr_issue.jpg (273.4 KiB) Viewed 5264 times
cday
Posts: 310
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: ABBYY 12 - why are end-of-line dashes an odd character?

Post by cday »

I vaguely recognise the problem but I spend my time in Linux now (since Windows 10... ;)) and rarely have a need to fire up my Windows 7 computer with FineReader on it.

I fully understand perfectionism! I think the symbol is a specialised typographic symbol, probably indicating that the text should be a single word without a space if the text is re-flowed to fit between different margins, I haven't been able to find a name for it, though.

I presume those symbols are present in the final output file when it is viewed?

Does FineReader possibly have a 'find and replace' facility you could use to remove them, I would have to find the PDF guides to maybe find the answer to that? Or some configuration options??

One would think that your issue must be a common one, it is not at all evident why any normal user would want those symbols, so possibly there is an answer online, or maybe Abbyy has a forum of some kind.
BruceG
Posts: 72
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: ABBYY 12 - why are end-of-line dashes an odd character?

Post by BruceG »

This is something I also have noticed. It is a symbol that I have not been able to reproduce.

Just looking at two documents recently scanned. A newspaper in the 1940's the hyphen was recognized as per your document. The magazine from 2000 the hyphen was recognized as a -. What difference does this make to the output file. I often save magazines with text on top of the image as well as text beneath the image (as text is edited). Newspapers are only saved with text under the image (as the text is not edited - I only have one life).

These both were saved with text on top to check what happened. A paragraph in each of the Abbyy files was copy & pasted into word, to see if it was the same or different than the pdf output.
What I found was both symbols in Abbyy produced a - hyphen in the outputted pdf and when copy and pasted into Word. When the formatting changed in Word the hyphen disappeared if not at the end of a line.

One problem I would like to find an answer is how to fix margins that are not straight.
BillGill
Posts: 128
Joined: 18 Dec 2016, 17:13
E-book readers owned: Calibre, FBReader
Number of books owned: 7000
Country: USA

Re: ABBYY 12 - why are end-of-line dashes an odd character?

Post by BillGill »

I have FineReader 14. I have it output the text file to Word 16 and those marks show up, but only if I have Word set to show hidden characters: carriage returns, page breaks, etc. When I am through proofing the text I convert it to the EPUB format. At that point those characters, whatever they are, disappear, so they aren't a problem for me.

Bill
BillGill
Posts: 128
Joined: 18 Dec 2016, 17:13
E-book readers owned: Calibre, FBReader
Number of books owned: 7000
Country: USA

Re: ABBYY 12 - why are end-of-line dashes an odd character?

Post by BillGill »

Now I think of what I should have added.

My biggest problem is that em dashes are detected as simple dashes. I have to watch for those all the way through.

Bill
L.Willms
Posts: 134
Joined: 21 Sep 2016, 10:51
E-book readers owned: Tolino Shine
Country: Germany
Location: Frankfurt/Main, Germany

Re: ABBYY 12 - why are end-of-line dashes an odd character?

Post by L.Willms »

These are indicators for possible in-word line breaks, syllabification.

Writing in Libre or Open Office Writer, you would enter such syllable breaks by CTRL- (CTRL and dash)

Same in MS Word.

FineReader inserts this special character when it detects a dash at line end, indicating that the last word on that line and the first word of the following line actually are one single word, with a line break at the grammatically correct place between two syllabels.
cday
Posts: 310
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: ABBYY 12 - why are end-of-line dashes an odd character?

Post by cday »

L.Willms wrote: 13 Sep 2021, 05:21 FineReader inserts this special character when it detects a dash at line end, indicating that the last word on that line and the first word of the following line actually are one single word, with a line break at the grammatically correct place between two syllables.
But is there no option to display a standard hyphen as in the 'Original' image on the left above?

Is it possible to use Find and Replace to substitute a standard hyphen?

If not, that seems surprising in such a sophisticated product.
L.Willms
Posts: 134
Joined: 21 Sep 2016, 10:51
E-book readers owned: Tolino Shine
Country: Germany
Location: Frankfurt/Main, Germany

Re: ABBYY 12 - why are end-of-line dashes an odd character?

Post by L.Willms »

cday wrote: 13 Sep 2021, 08:20
L.Willms wrote: 13 Sep 2021, 05:21 FineReader inserts this special character when it detects a dash at line end, indicating that the last word on that line and the first word of the following line actually are one single word, with a line break at the grammatically correct place between two syllables.
But is there no option to display a standard hyphen as in the 'Original' image on the left above?
Why would you want to have hyphens in the middle of words in your text, like in "hy-phens in the mid-dle of words in your text"?
You want to have the words as one, and not broken in two pieces, right?

ABBYY's Finereader is so friendly as to recognize that the hypen at the end of a line is not meant to appear in the actual word, when this is in free flow text.

And when you create an identically looking PDF, you will normally use "text behind image", so that the hypen is shown as part of the image, but not part of the text.
cday
Posts: 310
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: ABBYY 12 - why are end-of-line dashes an odd character?

Post by cday »

L.Willms wrote: 13 Sep 2021, 10:20 Why would you want to have hyphens in the middle of words in your text, like in "hy-phens in the mid-dle of words in your text"?
You want to have the words as one, and not broken in two pieces, right?
My reasoning was that the default preferred output would match the hyphenation in the original text image: if you look at the 'Original' and 'OCR' images in the post above, the two hyphens in the 'Original' image are correct hyphenation of the words concerned at a line break...

ABBYY's Finereader is so friendly as to recognize that the hypen at the end of a line is not meant to appear in the actual word, when this is in free flow text.
True, but the text on the page is not free flow text due to the line break...

And when you create an identically looking PDF, you will normally use "text behind image", so that the hypen is shown as part of the image, but not part of the text.
Yes, "text behind image" is the usual option, but again the hyphen in the text on the page is correct hyphenation of the words concerned.

If someone wants the symbol that indicates that the words wouldn't be hyphenated in free flow text, then sure, let them have the option...

If the OCR result is not used for "text behind image" and may be reflowed in a DPI software, for example, then that mark never seen in printed text could well be appropriate to ensure correct reflow. But given the possible alternative forms of output, and that "text behind image" is surely a common use, shouldn't it be possible to select which form of output is displayed?
L.Willms
Posts: 134
Joined: 21 Sep 2016, 10:51
E-book readers owned: Tolino Shine
Country: Germany
Location: Frankfurt/Main, Germany

Re: ABBYY 12 - why are end-of-line dashes an odd character?

Post by L.Willms »

cday wrote: 13 Sep 2021, 11:22My reasoning was that the default preferred output would match the hyphenation in the original text image: if you look at the 'Original' and 'OCR' images in the post above, the two hyphens in the 'Original' image are correct hyphenation of the words concerned at a line break...
It seems that you do not understand the difference between appearance and meaning.

Finereader shows on the left the image, which the program is trying to interpret, and in the image of the text, the graphem of a dash "-" does not have a meaning, the graphem does not in and by itself if it is to link several words as in "mother-in-law" or "first-rate" or "self-pity", or if it just indicates that the last word at the end of a line and the first word in the following line are to be read as a contigouus sequences of letters, without a dash inbetween.

Finereader is "intelligent" enough, using its dictionary and spelling correction, to find out the MEANING of dashes, and thus can ascertain if a dash at line end is meant to to reproduced in text, or if is to be preserved in the text.

It then encodes a dash at line end as a "conditional line break" and that is what you see in the OCR panel to the right of the program window. This shows the MEANING of the text, not its image. Once you render this text, the text program will either produce the image of a dash, if a line break occurs in that word, or show nothing at this place within a word when no line break occurs there.

Gotit?
Post Reply