Need fast help to TiF -> Searchtable PDF (OCR) SWEDISH

Convert page images into searchable text. Talk about software, techniques, and new developments here.

Need fast help to TiF -> Searchtable PDF (OCR) SWEDISH

Postby Pelle » 08 May 2012, 11:26

As stated in topic I have big problems after I have edited the pics in scan tailor and now I wanna be able to to so all text in the pages are searchable in PDF and also make the background white (the books background is slighty yellow).

This is a simple question on wich app is best for Swedish words. What apps and (free) dictionarys do u use for Swedish langauge in books (OCR) ??

Very fast answers are apprechiated becuase I have a university exam Very soon and I wanna be able to read lots and lots on my pad instead fo carrying x KG books always...

Best regards, :oops:
Pelle
 
Posts: 13
Joined: 12 Apr 2012, 13:40

Re: Need fast help to TiF -> Searchtable PDF (OCR) SWEDISH

Postby Heelgrasper » 08 May 2012, 11:52

I've used Tesseract and PDFbeads (via Homer, http://bookscanner.pbworks.com/w/page/45602013/How%20to%20install%20the%20software) on Danish texts with what looks like good results so that might work for Swedish too. Doesn't matter anything for reading, just a question on how likely it is that a search will find all the right stuff or if you want to copy-paste something from the book. And as such just an added feature compared to the printed book.

Making the background white would (as far as I know) be something to do in ScanTailor by setting the output to bitonal (b/w). Or mixed if there are illustrations.
---
Jakob Øhlenschlæger
Randers, Denmark

The past is a foreign country: they do things differently there
L. P. Hartley
User avatar
Heelgrasper
 
Posts: 70
Joined: 19 Feb 2012, 21:04
Location: Randers, Denmark

Re: Need fast help to TiF -> Searchtable PDF (OCR) SWEDISH

Postby Pelle » 08 May 2012, 11:58

I've just did a quick try with Tesseract and vietOCR GUI to that with both swedish and swedish "fraktur" (wtf now fraktur is i dont know..). Anyway. It didnt go so very well, maby it is becuase the page is a littbit tilted u think?

I found the thingy in Scan Tailor you mentioned (to get the bg more white:isch) and it worked fine. It was called "Equilize illumination" and was found on the last "process page" (aka, where you are about to create the actual TIF files) so thanks alot for that.

If anyone have agood (great) app for getting this swedish medicine language books correctly OCR:ed PLEASE write here.

Best regards,
Pelle
 
Posts: 13
Joined: 12 Apr 2012, 13:40

Re: Need fast help to TiF -> Searchtable PDF (OCR) SWEDISH

Postby daniel_reetz » 08 May 2012, 12:15

Hi pelle, I understand English may be your second language, so communication might be a little tough. Generally we don't care about perfect English here but a few things will help you get a better answer.

1. Don't rush people - everyone here already helps as fast as they can.
2. Please post an example page and a clear description of your problem. For example you say the page is a little bit tilted - well, we can't see that -- so we can't help.
3. Please post about your computer. Windows, Linux, Mac? What have you tried that DID work? Also tell us what you've tried that DIDN'T work, so we can figure out what went wrong.

Heelgrasper is very knowledgeable and very well practiced with difficult texts so please heed his advice and spend some time with that software and that approach.

Thanks,
The Management ;)
User avatar
daniel_reetz
 
Posts: 2485
Joined: 03 Jun 2009, 13:56

Re: Need fast help to TiF -> Searchtable PDF (OCR) SWEDISH

Postby Pelle » 08 May 2012, 16:02

Hello Daniel.

Im sorry for the rush. I didnt ment to me disrespectful in any way. I just wanna get pass this last exam and start my summer work without having to redoo the exam in august :p
I have uploaded two exact same pictures with TinyPic now for you to look at, one is called "whitish.tif" and the other "darker" one is called "normal.tif". As mentioned above I edited it as Hallgard said. (to get one of em whiter).

With Tesseract and VietOCR3 (GUI for Tesseract on windows) I get this when OCR on the...

Whitish.tif..:

[img]0http://i47.tinypic.com/op1r0y.jpg[/img]

wuswm
rama 1
um.-iw v
»eu nn-|»=a.|«-mlhvp-I-Ann.-||;«||" ~ '
mm» uum=.1=| 1:
v..a.....\.».=mw u
L.|....Mm~m u
|;=.,.1.....,,;.,,... M
M....,.m......,-.« .»
mmm man-mm. 11
m.N.,,....,.fi...,,. <1
o1.».w,,...w...1m.N... .-
o1.»<.«.m..<.1.,..,.1.,...u... W
o,«.“...,...|..,.d|.,.; ..
mm »,.«|.~........m. W
»<mm> uvgm-1=|mmfing 1:
ß<,..,,,,. r..,m....|:..|>|.|.....,.1.m«.,...,1,., L.
A.1.m..§«m,.v|.|.....,.«1| 1.
mm . ms_«.nm»»«»<|...1-p.»4.m..u U

-------------------------------------------------------------------------------------------------------------------

An this on the normal.tif..:

[img]0http://i49.tinypic.com/1jkzlt.jpg[/img]

|NNEHALL
Förord 7
Inledning 9
KAPITELI Läkemedel 13
vadärenläkemedelï 1;
Laxrrrrrrarlmrrrrr rr
ßeredmrrgrform rr
Adminiszmiornsärr 14
KAMTEL2 Ordination 11
Behërigz an ordinera 17
o1i1r=ryp=r av rrraanrrmrr ra
olika former för ordinazion rg
orairramnshflnalmg 10
Hur en ordirmirm slrrivr 10
KAPHEL3 Läkemedelshantering 23
Begrepp 1 far|ra11.rra= rm lalrrrrrmlslranrrrarrrr 13
Aamarrisrrermg =v1ä|r=rrr=a.| 1,,
KAPITEL 4 FASS - en handbok och en uppslagsverk 27
. mrrrrmrr om rrwrrrrrrrrrmur
J


If u compare to the picture it isnt even close to the words in the book/picture. And there is where my problem is.
And yes. You can proberbly say that my second language is English. Hope it isn't to many misspellings ;(

Best regards,
Pelle
 
Posts: 13
Joined: 12 Apr 2012, 13:40

Re: Need fast help to TiF -> Searchtable PDF (OCR) SWEDISH

Postby daniel_reetz » 08 May 2012, 16:06

No worries. ;) Especially about spelling ;)

It almost seems like the original pictures might be too low-resolution for the OCR engine. How big are the original images in pixels? Can you post a section of a page at the original size?
User avatar
daniel_reetz
 
Posts: 2485
Joined: 03 Jun 2009, 13:56

Re: Need fast help to TiF -> Searchtable PDF (OCR) SWEDISH

Postby Pelle » 09 May 2012, 07:09

The original picture was 3888x2592 px but when I took the pics I had like 10-12 cm "left out". AKA the picture where the pages are are: 1176x1764 (when cropped so only the page(s) are seen and not my floor).
It is taken by a EOS 400D so I have the posibility to take all pics in RAW mode but it seemed unnecesary big.

Maby I should just zoom in a bit more then so the pages are like 3888x2592 instead of 1176x1764 when fixed. Alsi I remember that Scan Tailor asked me to crop the files so I wrote 1200x1200 in the scan tailor box. Maby take the picture as 3888x2592 as I (you) said and then in scan tailor state 2000x2000 instead of 1200x1200?

;| :)
Pelle
 
Posts: 13
Joined: 12 Apr 2012, 13:40

Re: Need fast help to TiF -> Searchtable PDF (OCR) SWEDISH

Postby daniel_reetz » 09 May 2012, 09:59

That's exactly right. Fill the image with the page so all your pixels represent the book. Zoom in on that thing! Your OCR will improve VERY quickly. Try it with just one page, you'll see!
User avatar
daniel_reetz
 
Posts: 2485
Joined: 03 Jun 2009, 13:56

Re: Need fast help to TiF -> Searchtable PDF (OCR) SWEDISH

Postby abmartin » 09 May 2012, 11:41

As Dan says, the better the picture, the better your results will be.

When doing OCR, I find a significant increase in quality if I get over the 300 dpi line. I too use scantailor before doing OCR.

One thought, you mention that you were manually entering a value of 1200x1200. If you are doing that at the beginning of the entire process, that is definitely not correct. At the beginning, ScanTailor asks for the number of dots per inch. (2.54 cm) To correctly determine the DPI of an image, I like to take a photo with a ruler. I can then measure that ruler with GIMP's measuring tool. (I expect most image editing software has this capability) If you did enter 1200x1200 at the beginning of the process, you told scantailor that the image was less than two inches wide. (~5 cm) That information gets encoded in the final image. Tesseract might then be very confused by the size of the image, trying to read text less than a mm in height.

If you entered that number at the end of the process for the output DPI, I find that unnecessary too. Doing that is asking Scantailor to create pixels. It does make the images look smoother on a screen, but I find that 300 or 600 final DPI gets a better result with Tesseract.


Responding to an earlier question, Tesseract's Swedish Fraktur isn't going to be helpful on that image. Fraktur is an old style of writing that died out in Scandenavia by the early 20th century. (The Germans held on a bit longer) The standard Swedish language is what you will want to use since it is in a Roman typeface. https://sv.wikipedia.org/wiki/Frakturstil
abmartin
 
Posts: 41
Joined: 15 Sep 2010, 15:33
Location: Ohio

Re: Need fast help to TiF -> Searchtable PDF (OCR) SWEDISH

Postby Pelle » 09 May 2012, 13:42

Abmartin. let me just say three words. You are wonderful! :p

It worked much better (not 100% but atleast 60-70%) when I didnt enter 1200x1200 in the beginning when ScanTailor asked me to input size. I instead just choosed 600dpi and I did the page whitish in the end before saving the files.

I dont understand that thing with inch, not even a bit. I dont know any other scandinavian that know inch either, we always use: Millimeter, decimeter Centimeter and Meter. ( 10 mm = 1 cm | 10 cm = 1 dm | 10 dm = 1m ) ^^

Maby this is on to get sticky for scandinavians? :p


Again thanks alot, Heelgrasper, daniel_reetz and abmartin!

:mrgreen: :D

btw I was looking for a button "solved" but dont find any ;(
Pelle
 
Posts: 13
Joined: 12 Apr 2012, 13:40

Next

Return to OCR/Optical Character Recognition

Who is online

Users browsing this forum: No registered users and 1 guest

cron