Software Requests (Besides Dewarping)?
Moderator: peterZ
-
- Posts: 290
- Joined: 20 Jun 2009, 12:19
- E-book readers owned: SONY PRS-505, Kindle DX
- Number of books owned: 9999
- Location: Grand Rapids, MI
- Contact:
Re: Software Requests (Besides Dewarping)?
hey, guys, we've talked about it in other threads, but what about page renumbering. You've got two cameras generating randomly named (but sequential) page image files. Generally, we put the images into left and right directories. It is really nice to have them renamed into something sequential that you can then feed to ScanTailor.
This problem has been solved before with a bit of scripting here or there. I don't quite think we have a "canonical" solution, that'll merge two directories' image files into an output directory with everything numbered exactly like you like it.
Sometimes when I'm scanning I find that I goof up and either double-clutch (imaging a pair of pages twice), or skip. After I discover this, I need to insert or delete some pages. And it's annoying to rename files afterwards.
Some tasks are inherently difficult, e.g. OCR, but others are more tractable, but annoyingly tedious. If you're looking for a target of opportunity, I think a definitive "file name renumberator" could be easy enough to pull off, and tricky enough to be interesting. I'm interested enough in doing something like this that I'll help, but i'm not going to bother if nobody else thinks it's worth the bother.
This problem has been solved before with a bit of scripting here or there. I don't quite think we have a "canonical" solution, that'll merge two directories' image files into an output directory with everything numbered exactly like you like it.
Sometimes when I'm scanning I find that I goof up and either double-clutch (imaging a pair of pages twice), or skip. After I discover this, I need to insert or delete some pages. And it's annoying to rename files afterwards.
Some tasks are inherently difficult, e.g. OCR, but others are more tractable, but annoyingly tedious. If you're looking for a target of opportunity, I think a definitive "file name renumberator" could be easy enough to pull off, and tricky enough to be interesting. I'm interested enough in doing something like this that I'll help, but i'm not going to bother if nobody else thinks it's worth the bother.
- daniel_reetz
- Posts: 2812
- Joined: 03 Jun 2009, 13:56
- E-book readers owned: Used to have a PRS-500
- Number of books owned: 600
- Country: United States
- Contact:
Re: Software Requests (Besides Dewarping)?
It's not quite what you want, but Matti wrote a renamer for our Instructable using glass:
http://www.instructables.com/id/Bargain ... ox/#step10
There's also Bulk Rename Utility:
http://www.bulkrenameutility.co.uk/
As well as the recently introduced File Wrangler:
http://diybookscanner.org/forum/viewtop ... =674#p6473
And Anonymous's own OCR Page Namer:
http://diybookscanner.org/forum/viewtop ... =674#p6335
I point these out not to discourage anyone, but rather to put in one place the many approaches to renaming that we have so far. A pre-processor for Scan Tailor could easily make this its first function.
http://www.instructables.com/id/Bargain ... ox/#step10
There's also Bulk Rename Utility:
http://www.bulkrenameutility.co.uk/
As well as the recently introduced File Wrangler:
http://diybookscanner.org/forum/viewtop ... =674#p6473
And Anonymous's own OCR Page Namer:
http://diybookscanner.org/forum/viewtop ... =674#p6335
I point these out not to discourage anyone, but rather to put in one place the many approaches to renaming that we have so far. A pre-processor for Scan Tailor could easily make this its first function.
-
- Posts: 496
- Joined: 04 Mar 2014, 00:53
Re: Software Requests (Besides Dewarping)?
...which is why I started a post about clocking cameras so that upon import, files are already sorted based on their capture time: http://www.diybookscanner.org/forum/vie ... ?f=3&t=627
Why make your CPU work any extra when sorting can be done in real time?
Why make your CPU work any extra when sorting can be done in real time?
-
- Posts: 596
- Joined: 06 Jun 2009, 23:57
Re: Software Requests (Besides Dewarping)?
I guess that depends on what you mean by "exactly like you like it." For me, if the pages are in the same order they were in in the book, that's exactly like I like it. I've seen some comments -- maybe from you, I don't remember -- which suggest that some people want page "iii" in the book named "iii", and page "43" named "43". I asked in that OCR page naming thread if there is a reader which recognizes "page name" in this manner, and so far nobody's said yes. Unless there is such a reader, I don't see the point for myself of doing anything more than keeping all the pages with information in the proper sequence.StevePoling wrote: This problem has been solved before with a bit of scripting here or there. I don't quite think we have a "canonical" solution, that'll merge two directories' image files into an output directory with everything numbered exactly like you like it.
Yes, and this is why I think there may be a need for the kind of software you're proposing here. The batch renamers -- including the scripts I use myself -- work fine as long as there is a complete set of R and L pages. And really, with Scan Tailor and a "sequence is sufficient" attitude, pages which are shot twice are not a problem -- you just delete the duplicates from the ST project, and they disappear from the final product.StevePoling wrote: Sometimes when I'm scanning I find that I goof up and either double-clutch (imaging a pair of pages twice), or skip. After I discover this, I need to insert or delete some pages. And it's annoying to rename files afterwards.
The real problem comes in when pages are missing. If you're missing a single image, say Right-42, your batch renamer has probably messed up the order for everything that comes after -- now Left-41 is followed by Right-44, which is followed by Left-43, etc. This can happen if one camera fails to fire for some reason, and it can be difficult to correct if you only notice it after a batch rename/merge and you haven't kept separate L and R originals.
If you're missing a pair of pages, because you turned two pages instead of one, you have to insert a pair of pages.
If your batch rename uses names like 0001L 0001R etc. and you've kept separate originals, inserting can be relatively painless -- just add the images to the L or R directory in the proper place (i.e., add 0043M between 0043L and 0044L in the "Left" directory), run your batch renamer making 0043M the new 0044L etc., and re-merge. This can still be a problem if you've already done Scan Tailor processing and saved the project, since Scan Tailor will keep information on each of the images by name, and you've now renamed them behind its back.
In such cases, I think the proper thing to do is to add the missing image to the Scan Tailor project with a new name, without renaming images ST has already processed. Then, do your sequencing rename on the output, maybe even after the TIFF files have been converted to PDFs.
I like software that works all the time. I don't like to have to worry about "exceptions" or discover them after the fact. It's possible, by keeping left and right originals and employing a sensible renaming scheme, to handle deletions and insertions with a batch renamer. The problem is, it requires me to stop and think, when what I'd really like to do is say "Page 47 missing? I've got your page 47 right here..." and have the software take care of all the messy details behind the scene. Those messy details may include a Scan Tailor project file keyed by name to a lot of processing which is already complete.
Hmmm, now that I think about it, "I've got your page 47 right here" does seem to make a case for naming page 47 something like 00047, whether the reader recognizes it or not... I still don't like an OCR renamer, because as I said, I want software that works every time, without requiring me to stop and think about exceptions. I have books in which page 46 is followed by a dozen unnumbered picture pages before page 47. I have art books in which almost none of the pages display a page number. An OCR renamer can keep them in order, but that implies I've already done one rename that PUT them in order.
I'm rambling, thinking in text here. I guess what I'm saying is, I think there is a need for something like you propose, but doing it right -- covering all the bases, so it always works and I never have to think beyond "insert this here" to use it -- may be more trouble than it's worth. And if you can only do it almost right, I'm not sure I wouldn't prefer to stay with a batch renamer (script, in my case) and a system I understand.
Re: Software Requests (Besides Dewarping)?
Yeah, the batch renamers are really useful for complete collections of pages, but they fail when a single page is missing (I actually use that feature to figure out what pages I'm missing; when the page numbers != file numbers, I just work backwards to find out the culprit). I've found Métamorphose (it's completely open-source, written in Python) to have the most complete feature set (the Beta is amazing, and not a single crash yet!): http://file-folder-ren.sourceforge.net/.
As for the Scan Tailor page-splitting issues, may I ask how ST does it? Before I discovered ST, I wrote a script with ImageMagick and Bash which basically graphs the colors of the image, finds the maximum (it's pretty cool how it looks; the text is a jagged line, then there is a smooth break, a pointed curve in the middle, and the jagged lines again). I've never gone around to implement this fully into any program, but I'll post a sample graph. It's pretty cool.
As for the Scan Tailor page-splitting issues, may I ask how ST does it? Before I discovered ST, I wrote a script with ImageMagick and Bash which basically graphs the colors of the image, finds the maximum (it's pretty cool how it looks; the text is a jagged line, then there is a smooth break, a pointed curve in the middle, and the jagged lines again). I've never gone around to implement this fully into any program, but I'll post a sample graph. It's pretty cool.
Re: Software Requests (Besides Dewarping)?
Okay, I've actually implemented the graphing into a quick Python script.
Here is the original image (thanks Google):
Here is the intensity graph of that same image (normal):
Here is the intensity graph of the image in bitonal:
You can clearly see the interesting parts of the book from the graphs. I'm seeing if I can use this to detect text, seams, etc.
Here is the original image (thanks Google):
Here is the intensity graph of that same image (normal):
Here is the intensity graph of the image in bitonal:
You can clearly see the interesting parts of the book from the graphs. I'm seeing if I can use this to detect text, seams, etc.
Re: Software Requests (Besides Dewarping)?
Here's another sample. This one is a bit more clear, and I applied a Gaussian Smoothing function to the data. It is really obvious now.
All that has to be done now is the derivative of such function is taken, and that is analyzed. It is quite fun!
All that has to be done now is the derivative of such function is taken, and that is analyzed. It is quite fun!
- daniel_reetz
- Posts: 2812
- Joined: 03 Jun 2009, 13:56
- E-book readers owned: Used to have a PRS-500
- Number of books owned: 600
- Country: United States
- Contact:
Re: Software Requests (Besides Dewarping)?
This is a really neat demo... but will it work on pages from our camera based scanners? Can someone share some scans/data with Anonymous?
We really need a DIY Book Scanner dataset of page images for projects like this.
We really need a DIY Book Scanner dataset of page images for projects like this.
Re: Software Requests (Besides Dewarping)?
I went DIY as I possibly could; I took a picture of a book laying on my floor, cropped it, and centered the seam. The results are almost identical (here's a composite):
I'm just going to see how I can extract that data from the graph mathematically, as I can easily do it visually...
I'm just going to see how I can extract that data from the graph mathematically, as I can easily do it visually...
Re: Software Requests (Besides Dewarping)?
Hi,
could you post the last image without the red graph, i could try also an combination of filters to extract the data
could you post the last image without the red graph, i could try also an combination of filters to extract the data