Learning to Create Tiny DJVU files

RichardT · Post by **RichardT** » 06 May 2012, 18:35

Thanks, hopefully as I get my workflow going smoothly, I'll put all the code I'm using out on my website with some explanations.

Last night I made a custom version of jbig2 that makes the output file names resemble the input file names (instead of just counting up from .0000). This is important because not every page is guaranteed to have a jbig2 mask. Like, notably, most book covers, which is what I usually use for page 0000.

The way I work so far is: I split my files into background, text (mask), and foreground. If a page has all three, the filenames will be...

page.0012.bg.jp2
page.0012.tif
page.0012.fg.jp2

Then my version of jbig2 finds the .tif file and spits out a 'page.0012.msk' file.

Now my python scripts pull together the components into a PDF file. Soon they will also accept HOCR inputs for the hidden text layer as well.

So the only really manual step in the process is the splitting part. For pages with no foreground layer, it's easy to split files from scantailor's "mixed mode" output. But when there's also a foreground component it's a manual process. So a somewhat automatic splitter would be a big deal. I'm reading some papers on the subject to gather ideas. I think I can get a leg up on the issue by forcing my fg layer to a very low dpi. That will eliminate a lot of choices between whether a pixel can be fg or not right away. Anyway, more on that part when I finally get to it. Hopefully by the end of the week I'll have HOCR and outline support done. Neither seems very hard.

RichardT · Post by **RichardT** » 06 May 2012, 18:49

As a side note, I learned to use tesseract for djvu from djvubind and other websites. They generate "box" and "text" outputs, and then try their best to find the correspondence between them. That means doing the OCR twice, and even though I can't hardly believe it, the output doesn't always match. I think djvubind goes so far as to use difflib to try to find matching subsequences! Now that I've seen HOCR, though, it doesnt seem like djvu people should be going to that level of trouble. HOCR (which tesseract can produce natively) has the words split out already with their associated bounding boxes. Though you have to parse HTML to use it, which is a bummer, doing the OCR in one pass seems far superior.

RichardT · Post by **RichardT** » 07 May 2012, 11:02

Today's food for thought. One possible advantage of PDF over DJVU is that you can position small images inside of a page, instead of having a page-sized background layer. I was wondering how much this mattered, since a large white area of a page should compress down pretty well.

So, I screengrabbed an image, and saved it once as-is, and once where I added whitespace until it only took up about 1/4th of the total area. I converted them both to jp2 at the same encoding rate. The image-only result was much much smaller, but at the same time the quality of the image was much worse. I don't know enough about jp2 to understand that fully, but I guess at the same rate, jp2 can push the "errors" over into the whitespace where it doesn't matter, and preserve more fidelity in the interesting part of the image.

To account for that, I re-encoded the image-only version at a higher rate until the two were visually very similar to me. The final image-only result was still about 25% smaller than the image+space version. So, there is some gain to be had from isolating the images (and remembering where to place them on the PDF page, of course!), rather than just saving page-sized images. A little post-processing of mixed-mode images might be in order.

Post by **daniel_reetz** » 07 May 2012, 11:17

25% is significant. Thank you for making these notes here. DJVU is not hugely popular in the US, but worldwide (and especially in Russia) it is the de facto standard. It's hard to find good, technical, english-language info on optimizing DJVU.

StMichel · Post by **StMichel** » 16 Oct 2012, 00:03

RichardT wrote:Thanks, hopefully as I get my workflow going smoothly, I'll put all the code I'm using out on my website with some explanations.

Hi,

I find your workflow very interesting and was thinking to approach post-processing in a very similar manner. Have you already posted details of the workflow, and if yes, could you provide a link to it?

I think I can manage up to the splitting of files into different layers with convert: text layer from black pixels, background layer from everything else and foreground manually (that's for masking colours in text layer, right?). The putting of the layers together seems non-trivial to me and I don't know where to start looking: googling create layered pdf pointed to primarily Adobe-specific webpages. Can you give a pointer where should I look for more information about that? The HOCR and embedding outlines are also very interesting, so if you can tell where you ended up with the experiments, I will be very delighted.

And by the way, thanks for putting up your notes on djvu creation; the mask layer creation thing of yours has for example been the best tutorial about the matter which I have seen.

kempelen · Post by **kempelen** » 26 Feb 2014, 19:36

Thanks for sharing that Richard.

It was interesting to see how you turned away from DjVu to PDF.

What is others' opinion on DjVu? Is it still worth to spend time on it? I like the format very much and the free tools available. I think there are good reader tools on mobile OSes too.

I saw Daniel wrote it's rare in the US. Here in Hungary, some organizations use it (e.g. the "national electronic library"), but they always accompany the file with other formats. I post in this old topic because it's a very good example case for DjVu.

RichardT · Post by **RichardT** » 10 Mar 2014, 13:36

kempelen wrote: It was interesting to see how you turned away from DjVu to PDF.

Sorry I haven't been back here for quite some time. After some trial and error manually making PDFs as I showed in my posts, I finally decided life was too short and bought Acrobat. I then destructively scanned all my books over the next year and a half into clearscan PDFs. I am ok with the results.

BUT, I have to say, I still like DjVu the best in theory. So recently I tried to switch back. But, it is frustrating.

I took the time to get gsdjvu to build on modern ghostscript, so I can use their fg/bg layer splitting. But then I realized that I can't share JB2 dictionaries that way. csepdjvu doesn't do that, only the proprietary msepdjvu does it. And, is there any way to purchase msepdjvu?? If so I can't tell.

So, after re-investigating... it seems to me that the community really needs to enhance csepdjvu to create shared JB2 dictionaries or to enhance minidjvu to work with gsdjvu output. Otherwise if you stick with free tools you will be compromising size and quality.

RichardT · Post by **RichardT** » 10 Mar 2014, 15:54

Incidentally, while I'm here, recently I've taken quite a few bad scans with tons of jpeg artifacts and made them pretty compressible with a command like this:

Code: Select all

 convert.exe example.png -white-threshold 85% -ordered-dither o2x2,2 -morphology Erode:1 Disk:1 +dither -remap map.gif out/example.png

The problem was, the text was so "soft" that tools like DjVudigital put everything in the background on most pages. The screenshot below shows the problem (left) and solution (right):

: closeup.PNG (9.09 KiB) Viewed 13759 times

Let's break it down:

-white-threshold 85% get rid of most of the pale gray speckles, but it also gets rid of a fair amount of pixels that were previously making up our letters. We'll fix that later. For now, we're happy to have a pretty white background.
-ordered-dither o2x2,2 Cut down on the levels of gray here, and replace them with a tight, ordered dithering pattern. Now, in a super close-up like the one above, the ordered dithering looks bad compared to ImageMagick's default dither. But, I believe an ordered dither will compress better, and possibly help JB2 match characters against one another better. And, at normal viewing resolution you can't really see it anyway.
-morphology Erode:1 Disk:1 Remember the white-threshold cut out some of our lettering? And also the ordered dither created some speckling as well. So we morphologically Erode the white areas, which grows black areas together. This helps create more solid letter shapes and reduces the dithered look a lot.
+dither -remap map.gif So now we remap colors to just black and white (map.gif just has a single white pixel and a single black pixel). I added +dither because I don't want ImageMagick to do any more dithering during the remapping process. I'm not sure if it does dither during a remap or not, but with +dither I don't have to think about it.

The results compressing the processed files are MUCH better than what I started with!

BTW to make the color map just use:

Code: Select all

  convert -size 2x1 xc:black -draw 'fill white point 0,0'  map.gif

RichardT · Post by **RichardT** » 12 Mar 2014, 11:31

As I mentioned earlier in this thread, djvumake is very particular about the size of subsampled images, relative to the original. I used to pad/chop ALL of the pages to be divisible by my sampling factor, but that's kind of a pain. Since then, I've changed my workflow so it's not necessary.

I thought maybe djvumake was too particular, so I loosened up its checks. Didn't help. BUT, in the process I learned what exact width and height DjVu expects.

If you take a 600dpi image of size WxH and subsample the background layer to 150 (a factor of 4), then your background image must be (W+4-1)/4 x (H+4-1)/4 pixels in size. No more, no less. So, I made a little script to look up the right size for me. It uses djvused if you give it a djvu file, and ImageMagick otherwise:

Code: Select all

#!/bin/bash

# usage infile subsample
# like subsamp.sh in.djvu 4 .. to subsample by a factor of 4.
if [ $# -lt 2 ] ; then
  echo "Usage: subsamp.sh infile sample-divisor"
  exit
fi

SIZE=
if [ ${1##*.} == "djvu" ] ; then
   SIZE=$(djvused -e 'size' $1 | sed -e 's/width=\([0-9]*\) height=\([0-9]*\)/\1x\2/')
else
   SIZE=$(identify -format '%wx%h' $1)
fi

WIDTH=${SIZE%%x*}
HEIGHT=${SIZE##*x}

(( chwid = (WIDTH + $2 - 1) / $2 ))
(( chhgt = (HEIGHT + $2 - 1) / $2 ))

echo "${chwid}x${chhgt}"

Now when I make the background image, the end of the 'convert' line can do something like:

Code: Select all

  convert infile  ...stuff...  -resize $(~/bin/subsamp.sh infile 4)\! outfile

NOTE the escaped exclamation mark at the end of the new size. That's important. Otherwise IM will "helpfully" preserve your aspect ratio and give you a slightly different size. This drove me nuts until I found out what was going on.

So now I can leave my inputs their natural size, and produce any subsampling I want, and it works every time.

Post by **daniel_reetz** » 12 Mar 2014, 21:32

Nice! Thanks for the follow-up, RichardT!

DIY Book Scanner

Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files

Re: Learning to Create Tiny DJVU files