Last night I made a custom version of jbig2 that makes the output file names resemble the input file names (instead of just counting up from .0000). This is important because not every page is guaranteed to have a jbig2 mask. Like, notably, most book covers, which is what I usually use for page 0000.
The way I work so far is: I split my files into background, text (mask), and foreground. If a page has all three, the filenames will be...
- page.0012.bg.jp2
- page.0012.tif
- page.0012.fg.jp2
Now my python scripts pull together the components into a PDF file. Soon they will also accept HOCR inputs for the hidden text layer as well.
So the only really manual step in the process is the splitting part. For pages with no foreground layer, it's easy to split files from scantailor's "mixed mode" output. But when there's also a foreground component it's a manual process. So a somewhat automatic splitter would be a big deal. I'm reading some papers on the subject to gather ideas. I think I can get a leg up on the issue by forcing my fg layer to a very low dpi. That will eliminate a lot of choices between whether a pixel can be fg or not right away. Anyway, more on that part when I finally get to it. Hopefully by the end of the week I'll have HOCR and outline support done. Neither seems very hard.