Quick and easy bookmarks

Benedictus · Post by **Benedictus** » 25 Apr 2013, 21:23

Hello everybody.

When I started scanning books, I didn't care about browsing them afterwards. I just scanned and stored. But I scan a lot of scientific literature, and for me it is very important to be able to quickly locate the sections I want to read or to print in the document, which isn't always easy in 500+ page books.

One day, though, I was playing with FineReader PDF creating settings and ticked "Create Outline". I was lucky enough to do it on an book with an (apparently) very OCR friendly typography and FineReader nailed it. I ended with the whole book almost prefectly bookmarked. Every chapter and subsection was there.

It was very good, but not perfect. And I spent a whole hour fixing those minor things FR missed or left out with JPDFBookmarks. I thought then that having the whole book bookmarked was not really worth the effort of doing it manually. I later found out that autobookmarking in FR's PDF creation relies, in the MS Word way, in document styles. Obviously, it is very rare to get a good bookmarking that way.

But I still wanted bookmarks, and I finally found a way to get them quickly and with little to no effort. I want to share this.

It's usually a 5 minutes job. The accuracy and deepness of the bookmarks is as as big (or as low) as the book I'm scanning allows, because the bookmarks will be just the very same index of the book embedded in a PDF as bookmarks.

The way for getting quick and easy bookmarks is just to format that index in a way suitable to embed them in a PDF as bookmarks. I'll try to sumarise it. Everything can be done with free software, even when in my explanation I am mentioning Excel and ABBYY FineReader, which are non-free software.

1. First step: extract the index from the book as plain text and load it in a text editor. It is done by simply copying and pasting from the OCR'd images of the index. My text editor of choice is Notepad++, but any text editor will work as long as it supports regular expressions.

2. Second step: format the text and adjust page numbers.

The whole trick in creating bookmarks this way lies in the fact that the list PDF bookmarks can be saved as plain text file which can be exported from or imported to PDFs using the right software. JPDFBookmarks is able to import and export bookmarks, but it uses a format which I find a little bit messy to work with, so I'm using JPDFTweak, which manipulates PDFs through a library called iText and works with a much handier bookmarks format.

So let's imagine that our OCR program outputs this:

Code: Select all

1. First chapter........................................5
    1.1. First section...................................5
        1.1.1. First subsection.........................7
        1.1.2. Second subsection......................8
    1.2. Second section...............................9

What we have to do now is to adapt this to the the JPDFTweak format, so we can will end up with, for example, a bookmark named "1. First section" pointing to the right page in the PDF, another called "1.1. First section" pointing to its page, etc.

The JPDFTweak format is a comma separated value plain text file which consists in, AFAIK, the following sections. I wasn't able to find detailed info for this, but maybe it is buried somewhere in the iText documentation:

Code: Select all

Depth;State and style;Name;Page the bookmark points at<space>Position in the page, zoom, etc.

The depth is the level of the bookmark in the hierarchy tree. In our example "1.." is 1, "1.1..." and "1.2..." are 2 and "1.1.1..." and "1.1.2..." are 3.

The state and style of the bookmark tells the PDF reader how to display the bookmarks. AFAIK, possible values are O (for opening the branch of the tree), I (italics) and B (bold). They can be combined, so the second section can look like "O", "OI", "OB", "OIB", etc.

The name is the name of the bookmark. The text we'll see in the bookmarks panel in our reader for that bookmark.

The page is... well, the page of the PDF. There is a (big) chance that the page number in the PDF will be different from the page number of the book. For example, if we add a cover page, the page number 5 in the book will be the page number 6 in the PDF. We will fix that with Excel.

The positioning and zoom info comes after the page number, and is not separated from it by a semicolon but by an space. Here you can point to specific sections of the page, set levels of zoom, etc., but this requires manual edition of every bookmark. I just put "Fit" in this section, which simply fits the page to the PDF reader window.

So the former example formatted to JPDFTweak will look like this:

Code: Select all

1;O;1. First chapter;5 Fit
2;O;1.1. First section;5 Fit
3;O;1.1.1. First subsection;7 Fit
3;O;1.1.2. Second subsection;8 Fit
2;O;1.2. Second section;9 Fit

Using regular expressions the OCR text can be transformed into the JPDFTweak format in just a couple of minutes. I can't go into details about this because it requires some learning, but I think it is worth it to spend a couple of hours learning it. I'll put here just some tips I found out during my experience. Everything here is done with Notepad++ and deals with the output of ABBYY FineReader. Its vices (tabs, double spaces, etc) can be different to the vices of other OCR programs.

First step I always do is cleaning the text that comes from the PDF. This is mainly 3 things:

a) Remove blank lines. This is done automatically in Notepad++ by selecting "Edit > Line Operations > Remove Blank Lines".
b) Get rid of tabs. "Edit > Blank Operations > TAB to Space".
c) Remove double spaces. "Search > Replace..." (or Ctrl + H). In the dialog that pops up, select "Normal mode". Search for two spaces (" ") and replace with one space (" "). Repeat until no more double spaces found.

Second step is getting the page number close to the text. Easy with regular expressions.

Third step is building the depth and state sections. Almost always, the sections names show a pattern which we can use to get the depth. In the former example, level 1 always is <start of the line><number><dot><space><capital letter>. With regular expressions we can store and reuse (in the replace string) parts of the search string , so we can search for that pattern and replace it with "1;O;<number><dot><space><capital letter>". It might look confusing, but once familiar with regular expressions is easy.

Fourth step is importing into Excel or any other spreadsheet software as a semicolon-delimited file. There we adjust the page numbers (adding, for example, 1 to all page numbers) and export the file as csv.

Fifth step is adding the positioning and zoom section. Back to Notepad++ with the number-corrected csv, just search (in regular expressions mode) for "$" (end line character) and replace it with " Fit".

3. Third and final step: embedding the bookmarks into the PDF. JPDFTweak does this. Self explanatory once in the program. It will also complain about any mistakes we might have done. Afterwards, the only remaining thing is rewiewing the file to check that all the bookmarks point where they should. I've even found books whose indexes contained errors and pointed to the wrong page.

I'm using this for bookmarking my PDFs, but the same principle can be applied to any other document format supporting bookmarks. It's just a matter of finding the right software and adapting the bookmarks format.

OK, it's been a long post, but I hope that it can be useful to us bookmarkers.

Bye.

dtic · Post by **dtic** » 26 Apr 2013, 18:29

Benedictus wrote:JPDFBookmarks is able to import and export bookmarks, but it uses a format which I find a little bit messy to work with, so I'm using JPDFTweak, which manipulates PDFs through a library called iText and works with a much handier bookmarks format.

The JPDFBookmarks format can be made very slim. The help page gives this example

Code: Select all

Chapter 1/23
[TAB]Para 1.1/25,FitWidth,96
[TAB][TAB]Para 1.1.1/26,FitHeight,43
Chapter 2/30,TopLeft,120,42
[TAB]Para 2.1/32,FitPage.

But if you only need the bookmark to link to entire pages and won't use bold, italic, text color or any such fancy stuff you can trim it down to

Code: Select all

Chapter 1/23
[TAB]Paragraph A/25
[TAB]Paragraph B/26
Chapter 2/30
Index/32

And if you don't need hierarchy among the bookmarks just use

Code: Select all

Chapter 1/23
Paragraph A/25
Paragraph B/26
Chapter 2/30
Index/32

Benedictus · Post by **Benedictus** » 27 Apr 2013, 01:27

Interesting.

JPDFBookmarks also allows the user to set the symbols to separate each part of the bookmark, and also makes unnecessary to escape certain symbols that must be escaped in the JPDFTweak format (semicolon, for example).

But I went to JPDFTweak mainly because I've seen JPDFBookmarks dump incomplete bookmarks several times. I don't know why. I've never seen JPDFTweak do it. I'll give it one more chance.

The deciding factor might be error handling. JPDFTweak just pops a rather obscure error dialog with a resume of what it has done until it failed, but it doesn't show the actual error nor where in the file it has occurred. Does JPDFBookmarks this any better?

dtic · Post by **dtic** » 27 Apr 2013, 08:22

I haven't used jpdfbookmarks much and haven't gotten any errors yet.

I made an autohotkey script for Windows to automate some steps in the jpdfbookmarks import. Download here (zip with exe and source)

Requires: jpdfbookmarks
Setup: on first run the ini file is shown. Add paths to it.
Use: drop a pdf on toc.exe to import bookmarks from a previously saved toc.txt
- first line in toc.txt must be blank or a positive or negative offset value e.g. 5 or -4 or...
- additional lines in toc.txt should be in jpdfbookmarks bookmark data format, but / can be replaced with [space]
Sample toc.txt

Code: Select all

5
foreword/4
chapter 1 17
index 71

The script automatically offsets and adds / where needed:

Code: Select all

foreword/9
chapter 1/22
index/76

So all the user does manually is:
- copy table of contents from the pdf or the web into toc.txt
- make sure a page number is at the end of each line with a frontslash or space in before it
- write an offset value to the first line or leave the first line blank
- save toc.txt
- drop a pdf on toc.exe

arronlee · Post by **arronlee** » 13 May 2013, 23:14

Yes, I do think it's a piece of good news for us.
Now many PDF readers offer more and more apps for simplifying our reading process.
I believe in the near future. It will allow users to do more.
Thanks again for your sharing.

DIY Book Scanner

Quick and easy bookmarks

Quick and easy bookmarks

Re: Quick and easy bookmarks

Re: Quick and easy bookmarks

Re: Quick and easy bookmarks

Re: Quick and easy bookmarks