When I started scanning books, I didn't care about browsing them afterwards. I just scanned and stored. But I scan a lot of scientific literature, and for me it is very important to be able to quickly locate the sections I want to read or to print in the document, which isn't always easy in 500+ page books.
One day, though, I was playing with FineReader PDF creating settings and ticked "Create Outline". I was lucky enough to do it on an book with an (apparently) very OCR friendly typography and FineReader nailed it. I ended with the whole book almost prefectly bookmarked. Every chapter and subsection was there.
It was very good, but not perfect. And I spent a whole hour fixing those minor things FR missed or left out with JPDFBookmarks. I thought then that having the whole book bookmarked was not really worth the effort of doing it manually. I later found out that autobookmarking in FR's PDF creation relies, in the MS Word way, in document styles. Obviously, it is very rare to get a good bookmarking that way.
But I still wanted bookmarks, and I finally found a way to get them quickly and with little to no effort. I want to share this.
It's usually a 5 minutes job. The accuracy and deepness of the bookmarks is as as big (or as low) as the book I'm scanning allows, because the bookmarks will be just the very same index of the book embedded in a PDF as bookmarks.
The way for getting quick and easy bookmarks is just to format that index in a way suitable to embed them in a PDF as bookmarks. I'll try to sumarise it. Everything can be done with free software, even when in my explanation I am mentioning Excel and ABBYY FineReader, which are non-free software.
1. First step: extract the index from the book as plain text and load it in a text editor. It is done by simply copying and pasting from the OCR'd images of the index. My text editor of choice is Notepad++, but any text editor will work as long as it supports regular expressions.
2. Second step: format the text and adjust page numbers.
The whole trick in creating bookmarks this way lies in the fact that the list PDF bookmarks can be saved as plain text file which can be exported from or imported to PDFs using the right software. JPDFBookmarks is able to import and export bookmarks, but it uses a format which I find a little bit messy to work with, so I'm using JPDFTweak, which manipulates PDFs through a library called iText and works with a much handier bookmarks format.
So let's imagine that our OCR program outputs this:
Code: Select all
1. First chapter........................................5
1.1. First section...................................5
1.1.1. First subsection.........................7
1.1.2. Second subsection......................8
1.2. Second section...............................9
The JPDFTweak format is a comma separated value plain text file which consists in, AFAIK, the following sections. I wasn't able to find detailed info for this, but maybe it is buried somewhere in the iText documentation:
Code: Select all
Depth;State and style;Name;Page the bookmark points at<space>Position in the page, zoom, etc.
The state and style of the bookmark tells the PDF reader how to display the bookmarks. AFAIK, possible values are O (for opening the branch of the tree), I (italics) and B (bold). They can be combined, so the second section can look like "O", "OI", "OB", "OIB", etc.
The name is the name of the bookmark. The text we'll see in the bookmarks panel in our reader for that bookmark.
The page is... well, the page of the PDF. There is a (big) chance that the page number in the PDF will be different from the page number of the book. For example, if we add a cover page, the page number 5 in the book will be the page number 6 in the PDF. We will fix that with Excel.
The positioning and zoom info comes after the page number, and is not separated from it by a semicolon but by an space. Here you can point to specific sections of the page, set levels of zoom, etc., but this requires manual edition of every bookmark. I just put "Fit" in this section, which simply fits the page to the PDF reader window.
So the former example formatted to JPDFTweak will look like this:
Code: Select all
1;O;1. First chapter;5 Fit
2;O;1.1. First section;5 Fit
3;O;1.1.1. First subsection;7 Fit
3;O;1.1.2. Second subsection;8 Fit
2;O;1.2. Second section;9 Fit
First step I always do is cleaning the text that comes from the PDF. This is mainly 3 things:
a) Remove blank lines. This is done automatically in Notepad++ by selecting "Edit > Line Operations > Remove Blank Lines".
b) Get rid of tabs. "Edit > Blank Operations > TAB to Space".
c) Remove double spaces. "Search > Replace..." (or Ctrl + H). In the dialog that pops up, select "Normal mode". Search for two spaces (" ") and replace with one space (" "). Repeat until no more double spaces found.
Second step is getting the page number close to the text. Easy with regular expressions.
Third step is building the depth and state sections. Almost always, the sections names show a pattern which we can use to get the depth. In the former example, level 1 always is <start of the line><number><dot><space><capital letter>. With regular expressions we can store and reuse (in the replace string) parts of the search string , so we can search for that pattern and replace it with "1;O;<number><dot><space><capital letter>". It might look confusing, but once familiar with regular expressions is easy.
Fourth step is importing into Excel or any other spreadsheet software as a semicolon-delimited file. There we adjust the page numbers (adding, for example, 1 to all page numbers) and export the file as csv.
Fifth step is adding the positioning and zoom section. Back to Notepad++ with the number-corrected csv, just search (in regular expressions mode) for "$" (end line character) and replace it with " Fit".
3. Third and final step: embedding the bookmarks into the PDF. JPDFTweak does this. Self explanatory once in the program. It will also complain about any mistakes we might have done. Afterwards, the only remaining thing is rewiewing the file to check that all the bookmarks point where they should. I've even found books whose indexes contained errors and pointed to the wrong page.
I'm using this for bookmarking my PDFs, but the same principle can be applied to any other document format supporting bookmarks. It's just a matter of finding the right software and adapting the bookmarks format.
OK, it's been a long post, but I hope that it can be useful to us bookmarkers.
Bye.