HELP - Scan Tailor Project --> .pdf

Scan Tailor specific announcements, releases, workflows, tips, etc. NO FEATURE REQUESTS IN THIS FORUM, please.

Re: HELP - Scan Tailor Project --> .pdf

Postby dingodog » 29 Oct 2010, 11:25

So it is confirmed that even latest pdf.py does not produces pdfs with MEDIABOX equal to REAL sizes of image calculated by its DPI

I seen in jbig2enc.cc
http://github.com/agl/jbig2enc/blob/c70 ... big2enc.cc
http://github.com/agl/jbig2enc

was included patch (before it was included in main trunk, I patched myself taking fix dpi patch), giving ability to jbig2enc to code rightly DPI, so it it a problem of program /way to produce pdfs by resulting jbig2 files

it is a matter worth of investigating

meanwhile, knowing physical size of book scanned, you can apply this workaround

with Impose tool inside multivalent
- http://www.ziddu.com/download/1794145/M ... ar.gz.html (old version with tools, newer has only the viewer)
Code: Select all
java -cp /path...to/multivalent.jar tool.pdf.Impose -dim 1x1 -paper widthxheightin file.pdf

in our case, since my scan has sizes 8.5x11 inches:
Code: Select all
java -cp /path...to/multivalent.jar tool.pdf.Impose -dim 1x1 -paper 8.5x11in file.pdf

this sets PDF MEDIABOX to REAL sizes

in my experience, OCR ability is not influenced by this little problem, since even DPI is not settled, what it is important for good OCR is a CLEAR, BIG (in size) image
User avatar
dingodog
 
Posts: 81
Joined: 22 Jul 2010, 18:19
Location: on the net

Re: HELP - Scan Tailor Project --> .pdf

Postby Misty » 29 Oct 2010, 11:39

That's good to know, thanks! The incorrect value isn't just in /MediaBox. It's also in the q ... Q array. Both of those need to be right to display the page correctly.

Another workaround is to hardcode DPI value inside the pdf.py script. In lines 127 and 131, change the width and height values to (width*72)/600 and (height*72)/600

Since most people will always have 600dpi Scan Tailor files, this bit of hardcoding will work despite being an ugly hack. I'd prefer a real solution in the near future though. ;)
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
User avatar
Misty
 
Posts: 481
Joined: 06 Nov 2009, 12:20
Location: Frozen Wasteland

Re: HELP - Scan Tailor Project --> .pdf

Postby La_Tristesse » 06 Aug 2011, 12:56

Like this?

Code: Select all
contents = Obj({}, 'q %f 0 0 %f 0 0 cm /Im1 Do Q' % (float(width * 72) / 600, float(height * 72) / 600))
    resources = Obj({'ProcSet': '[/PDF /ImageB]',
        'XObject': '<< /Im1 %d 0 R >>' % xobj.id})
    page = Obj({'Type': '/Page', 'Parent': '3 0 R',
        'MediaBox': '[ 0 0 %f %f ]' % (float(width * 72) / 600), float(height * 72) / 600)),
        'Contents': ref(contents.id),
        'Resources': ref(resources.id)})
La_Tristesse
 
Posts: 9
Joined: 18 Jun 2011, 21:47

Re: HELP - Scan Tailor Project --> .pdf

Postby Misty » 10 Aug 2011, 10:16

Yup, pretty much.

Actually, my post was back in October. Since then, the author of jbig2enc added a hack into pdf.py that causes it to always assume 600 DPI. If you use the version of pdf.py from his Github page, it's done for you.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
User avatar
Misty
 
Posts: 481
Joined: 06 Nov 2009, 12:20
Location: Frozen Wasteland

Previous

Return to Scan Tailor

Who is online

Users browsing this forum: No registered users and 2 guests