Uploading books to the Internet Archive

Discussion about Steve DeVore's Book Scan Wizard, a power-user package to automate scan processing.

Moderator: peterZ

Post Reply
steve1066d
Posts: 296
Joined: 27 Nov 2010, 02:26
E-book readers owned: PRS-505
Number of books owned: 1250
Location: Minneapolis, MN
Contact:

Uploading books to the Internet Archive

Post by steve1066d »

Wouldn't it be nice to be able to share those out-of-copyright books that you've saved from moldering bookshelves by scanning them? Or perhaps you would like a simple way to convert scans of your books to text, PDF, DjVu, or other formats for your Kindle or other e-reader? The Internet Archive and Book Scan Wizard makes it possible to do just that.

To upload an item you need an Internet Archive “library card.” (basically an account). It’s easy enough to do so, but realize that whatever email you use will be listed as the originator of the document, and will be publicly available. So if privacy is a concern you may want to use a throwaway email address.

What can be uploaded:

All uploads using this interface to the Archive are public, and may be downloaded by anyone. However, both the BSW form, as well as the Archive upload form, have a checkbox to indicate that it is a test item. Test items will be processed, OCR’ed and made available, but not be indexed (or searchable). They also will be deleted after 30 days. Marking it as a test item can be useful for testing the process, or if you are uploading things only for the purpose of OCR’ing them and shouldn’t become part of the permanent archive.

Uploading books using the Archive.org website:

One option is to upload a book using the Archive’s own interface. You click on the “upload” button at the top of the archive. Its url is http://www.archive.org/create/
The Archive recommends uploading a pdf file. However a zip file that contains the pages that ends with _images.zip can also be used. The archive will accept a zip with jpeg, tiff, or jpeg2000 images. As part of the upload process you fill in metadata (things like the author, title, date etc).

Uploading books using Book Scan Wizard.

Book Scan Wizard has a new feature that allows you to easily upload books to the Internet Archive. It can be run either interactively or as part of a batch process. The easiest way to start it is by using the Web Start version which can be accessed from this link: http://bookscanwizard.sourceforge.net/run

For an example of what a book created with the upload feature, see this book. It was created by using a “New Standard” book scanner, Book Scan Wizard, and a pair of Canon A480 cameras.

Here’s the process: In the menu under tools, choose “Prepare for Uploading…” and it will bring up the following screen:
upload.png
upload.png (15.77 KiB) Viewed 36698 times

Fill in the information for the book, and it will add to the BSW script the metadata and commands to create a zip file for uploading to the Archive.

The access key and secret key are a special id and password only used for transfers. You get them from here. (Or press the “Lookup Keys” button which will also bring you to the right page).

The identifier becomes part of the url for the book. On the archive books it is usually a combination of the title and the author of the book, but it can be whatever you want. Letters, numbers, periods (.), hyphens (-), and underscores(_) are permitted values for the identifier. All other fields can accept any characters. If needed, multiple lines can be used. For example, if there are multiple authors, you can add the additional authors by adding additional “creator” lines to the other metadata section.

Once you press Ok, the following configuration will be added automatically:

Code: Select all

Metadata = identifier: BigBookOfFairyTalesA
Metadata = title: Big Book of Fairy Tales
Metadata = creator: Gustave Doré
Metadata = date: 1896
Metadata = subject: Childrens fairy tales
Metadata = description: Hardcover title is Favorite Fairy Tales
Metadata = keywords: childrens, fairy tales
CreateArchiveZip = archive.zip 10:1
# Uncomment the following line to send to the archive as part of this job.
#SaveToArchive = archive.zip xxxxxxxxxxxxxxxxx xxxxxxxxxxx
To actually send it, you can do it as part of the processing by uncommenting SaveToArchive. Or if you have previously created the zip file, you can upload it by choosing from the menu Tools, Upload to the archive. Another options is to queue up your books and send them as a batch by using the ‑upload feature from the command line. (See the command line help for more information).

You can also create a zip file some other way, then use the command line option to send it to the archive. To do that, zip up your images, and include an xml file with the metadata. The images can be called whatever you like and will be saved in alphabetical order.

If you want to see an estimate of the size the zip file will be, you can right-click the CreateArchiveZip line. It will return this:
size.png
size.png (2.4 KiB) Viewed 36698 times


Then adjust the compression setting (the 10:1 in the example above) until you have a result you like.

How to Scan Books for the Archive:
While the Archive will accept any sort of scans, it is nice to provide the scans in a way that matches their own works. For that, it is best if the books meet the following criteria:
  • It should have a resolution of 300-600 DPI.
  • It should be done as a full color image that closely resembles the actual book image. The Internet Archive prefers color images because they have found people like reading the book with the original look intact.
  • The book should be deskewed, and cropped.
  • You should provide good metadata such as title, author, date, subject, keywords, etc.
Tips for creating good scans to send to the Archive:

To make good full color images it often takes a bit of tweaking to look really good. Ideally you want the left and right pages to be consistent with each other, and have the colors match the original. BSW can help with that.

Once you have corrected for perspective distortion and cropped the image, it is good to increase the contrast a bit of the image. Try right clicking the image and choose “autolevels.” This will give you a good starting point, but feel free to adjust the black and white levels until they appear accurate. The books done with Internet Archive’s Scribe scanners use the equivalent of the following, and may be helpful as a starting point if you are starting with well exposed images:
Levels = 12 94

Also, if the saturation doesn’t look right (like there is more color in the image than there was in the original, the Saturation command can be used. Or if the brightness is off, try adjusting it with the Brightness command. If your lighting isn’t quite consistent, it is sometimes necessary to adjust only the left or right images to make them match better. Its pretty much trial and error until you get the results looking the way you like. The good thing is once you figure out the settings that work for you, you will not need to adjust it much for other books.

It’s recommended that a lossy compression that results in a compression between 10:1 and 20:1 is used for the transfer. For example at 10:1, if an image was a 10 meg uncompressed tiff, it would be about a 1 meg .jp2 file. BSW will default to a 10:1 compression, which works well for 300 DPI images. If you are providing scans closer to 600 you will probably want to use a higher compression to keep the transfer sizes manageable.

The archive will accept a zip file containing jpegs, tiffs, and jp2 files. BSW uses jp2 as it gives the most control over the files size and a bit better compression than Jpeg files.

While it is preferable to transfer color images, there may be times where you need to do the transfer as grayscale or black and white. Color images are quite large, and if you a slow connection it might not be feasible to transfer them. Grayscale images are about a third the size of full color, and black and white are even smaller. Or if you can’t get a good color image it may be best to save it grayscale or black and white.

How long will it take to process?

Depending on what kind of compression you are using, and the length of the book the zip files will be around 200-800 megs, so it can take quite a while to transfer, depending on your connection.

After the file is uploaded, it starts in motion a bunch of steps that end with the book OCR’ed and converted to pdf, DjVu, Kindle, and other files. The process will take anywhere from an hour or so to a few days depending on how backed up the Archive is. You can check on the progress by logging into the archive, choosing patron info, then choosing tasks that are not yet completed.


For further information:

For more information about uploading books to the archive you can check these links out:

General overview on uploading content:
http://www.archive.org/about/faqs.php#Uploading_Content

Information on the _images.zip format:
http://raj.blog.archive.org/2011/02/24/ ... e-uploads/

Detailed information for Internet Archive partners. This has some good information on the Internet Archive process for scanning documents:
http://www.archive.org/details/ProcessDocument

Information on the protocol Book Scan Wizard uses to communicate with the Archive:
http://www.archive.org/help/abouts3.txt
Steve Devore
BookScanWizard, a flexible book post-processor.
StevePoling
Posts: 290
Joined: 20 Jun 2009, 12:19
E-book readers owned: SONY PRS-505, Kindle DX
Number of books owned: 9999
Location: Grand Rapids, MI
Contact:

Re: Uploading books to the Internet Archive

Post by StevePoling »

Bravo!
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Uploading books to the Internet Archive

Post by daniel_reetz »

Unbelievable! Fantastic!

I want to point out to everyone here that this is a free OCR option for everyone on the forum - upload a book using Book Scan Wizard and the Archive will return an OCR'd copy.
Ann

Re: Uploading books to the Internet Archive

Post by Ann »

Hey Dan - This is great for the DIY community.

After reading the whole thing, it describes what I'm already doing, so I don't know if I'll use it - I have my system down with unskewing, WB, Tone, Clarity, etc. And, since no OCR program can OCR hand-writing, I'm stuck with transcribing everything.

Is there something I'm missing, though? Is this different than what I've been doing? Thanks, Ann
User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: Uploading books to the Internet Archive

Post by Misty »

This is totally awesome forever, Steve.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: Uploading books to the Internet Archive

Post by rob »

It's not just for OCR! Uploading to the Internet Archive lets everyone access your book (or other paper ephemera, which differentiaties IA from Google Books). I've been uploading old postcards from the early 20th and late 19th. One day someone might want to look at them, or sample them for their fonts, or use them for historical research. The point being that if I hadn't uploaded these postcards, they would be inaccessible, or could be lost forever.

Well, those are my reasons, anyway.

As for having your own workflow, that's not a problem: when you're done, just give the images to Book Scan Wizard, have it do no processing, and then you can go directly to the upload step.
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
Post Reply