Learning to Create Tiny DJVU files

Discussions, questions, comments, ideas, and your projects having to do with DIY Book Scanner software. This includes the Stereo Data Maker software for the cameras, post-processing software, utilities, OCR packages, and so on.

Re: Learning to Create Tiny DJVU files

Postby RichardT » 06 May 2012, 18:35

Thanks, hopefully as I get my workflow going smoothly, I'll put all the code I'm using out on my website with some explanations.

Last night I made a custom version of jbig2 that makes the output file names resemble the input file names (instead of just counting up from .0000). This is important because not every page is guaranteed to have a jbig2 mask. Like, notably, most book covers, which is what I usually use for page 0000.

The way I work so far is: I split my files into background, text (mask), and foreground. If a page has all three, the filenames will be...
  • page.0012.bg.jp2
  • page.0012.tif
  • page.0012.fg.jp2

Then my version of jbig2 finds the .tif file and spits out a 'page.0012.msk' file.

Now my python scripts pull together the components into a PDF file. Soon they will also accept HOCR inputs for the hidden text layer as well.

So the only really manual step in the process is the splitting part. For pages with no foreground layer, it's easy to split files from scantailor's "mixed mode" output. But when there's also a foreground component it's a manual process. So a somewhat automatic splitter would be a big deal. I'm reading some papers on the subject to gather ideas. I think I can get a leg up on the issue by forcing my fg layer to a very low dpi. That will eliminate a lot of choices between whether a pixel can be fg or not right away. Anyway, more on that part when I finally get to it. Hopefully by the end of the week I'll have HOCR and outline support done. Neither seems very hard.
RichardT
 
Posts: 12
Joined: 24 Apr 2012, 10:17

Re: Learning to Create Tiny DJVU files

Postby RichardT » 06 May 2012, 18:49

As a side note, I learned to use tesseract for djvu from djvubind and other websites. They generate "box" and "text" outputs, and then try their best to find the correspondence between them. That means doing the OCR twice, and even though I can't hardly believe it, the output doesn't always match. I think djvubind goes so far as to use difflib to try to find matching subsequences! Now that I've seen HOCR, though, it doesnt seem like djvu people should be going to that level of trouble. HOCR (which tesseract can produce natively) has the words split out already with their associated bounding boxes. Though you have to parse HTML to use it, which is a bummer, doing the OCR in one pass seems far superior.
RichardT
 
Posts: 12
Joined: 24 Apr 2012, 10:17

Re: Learning to Create Tiny DJVU files

Postby RichardT » 07 May 2012, 11:02

Today's food for thought. One possible advantage of PDF over DJVU is that you can position small images inside of a page, instead of having a page-sized background layer. I was wondering how much this mattered, since a large white area of a page should compress down pretty well.

So, I screengrabbed an image, and saved it once as-is, and once where I added whitespace until it only took up about 1/4th of the total area. I converted them both to jp2 at the same encoding rate. The image-only result was much much smaller, but at the same time the quality of the image was much worse. I don't know enough about jp2 to understand that fully, but I guess at the same rate, jp2 can push the "errors" over into the whitespace where it doesn't matter, and preserve more fidelity in the interesting part of the image.

To account for that, I re-encoded the image-only version at a higher rate until the two were visually very similar to me. The final image-only result was still about 25% smaller than the image+space version. So, there is some gain to be had from isolating the images (and remembering where to place them on the PDF page, of course!), rather than just saving page-sized images. A little post-processing of mixed-mode images might be in order.
RichardT
 
Posts: 12
Joined: 24 Apr 2012, 10:17

Re: Learning to Create Tiny DJVU files

Postby daniel_reetz » 07 May 2012, 11:17

25% is significant. Thank you for making these notes here. DJVU is not hugely popular in the US, but worldwide (and especially in Russia) it is the de facto standard. It's hard to find good, technical, english-language info on optimizing DJVU.
User avatar
daniel_reetz
 
Posts: 2490
Joined: 03 Jun 2009, 13:56

Re: Learning to Create Tiny DJVU files

Postby StMichel » 16 Oct 2012, 00:03

RichardT wrote:Thanks, hopefully as I get my workflow going smoothly, I'll put all the code I'm using out on my website with some explanations.


Hi,

I find your workflow very interesting and was thinking to approach post-processing in a very similar manner. Have you already posted details of the workflow, and if yes, could you provide a link to it?

I think I can manage up to the splitting of files into different layers with convert: text layer from black pixels, background layer from everything else and foreground manually (that's for masking colours in text layer, right?). The putting of the layers together seems non-trivial to me and I don't know where to start looking: googling create layered pdf pointed to primarily Adobe-specific webpages. Can you give a pointer where should I look for more information about that? The HOCR and embedding outlines are also very interesting, so if you can tell where you ended up with the experiments, I will be very delighted.

And by the way, thanks for putting up your notes on djvu creation; the mask layer creation thing of yours has for example been the best tutorial about the matter which I have seen.
StMichel
 
Posts: 1
Joined: 15 Oct 2012, 14:39

Previous

Return to Software

Who is online

Users browsing this forum: No registered users and 1 guest