Just as my books sitting in the corner have no utility if they are not being read, book scan images have no value if they are not shared and archived.
Needed:
1. Space to host images, websites, databases (for example: couchdb is easily replicated to other databases).
2. A uniform imaging naming system to allow retaining important information about the scan.
Suggested sample:
(BT)BookTitle_(ALN)AuthorLastName_(AFN)AuthorFirstname_(ISBN10)ISBN10-XXXX_(PN)PageNumber_(PT)PageTotal_(SID)-IDofPersonScanning_(SYMD)DateScanned_F.Format
BT-Hackers:-Heroes-of-the-Computer-Revolution_ALN-Levy_AFN-Steven_ISBN10-0141000511_ISBN13-978-0141000510_PN-1_PT-464_PB-Penguin_PBD_2001-01-02_SID-123456789_L-en_SYMD-2010-03-21_F.tiff
BT Hackers Heroes of the Computer Revolution
ALN Levy
AFN Steven
ISBN-10 0141000511
ISBN-13 978-0141000510
PN 1
PT 464
PB Penguin
PBD 2001-01-02
SID 123456789
L en
SYMD 2010-03-21
F tiff
Title Hackers: Heroes of the Computer Revolution
Paperback: 464 pages
Publisher: Penguin (Non-Classics); Updated edition (January 2, 2001)
Language: English
ISBN-10: 0141000511
ISBN-13: 978-0141000510
This naming convention can be easily split along the underscores and the output loaded into a database with a script.
It is also human readable.
Reasons for a standardized file naming convention:
If a group is to share images for example to OCR
One part of the group could scan, others could handle naming, another part ocr, another group do quality control, another group handles archiving, another for replication and backup, and others for access and sharing either images or final output, for example.
I have some space in an Ubuntu 9.04 VPS that could be used for testing purposes.
I have root access and can install any software for image processing, OCR (tesseract, ocrad).
We will need people to generate images and contribute high quality images with a naming format that is robust, uniform, humanly readable, and machine processable.
Is anybody interested in collaborating?
Please post comments here if interested in contributing to the building of a distributed digital library.
