Creating smaller PDFs

Don't know where to start, or stuck on a certain problem? Drop by and tell us about it. Feel like helping others? Start here.

Moderator: peterZ

cday
Posts: 451
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Creating smaller PDFs

Post by cday »

I have now had a closer look at the Abbyy FineReader output files above created by BruceG using a later version than my own version. While I haven't made a detailed side-by-side comparison with the Adobe Acrobat Standard XI files I posted, I am certainly impressed and can see that one or more of the FineReader files could probably provide an equivalent or better solution to Oliver's issue, with the advantage of a significantly smaller filesize. However, Acrobat has also been further developed, and I believe the current version may attempt to identify the font(s) used in the source file, with the likelihood to reduce filesizes towards those in the FineReader versions.

I assume that the alternative PDF outputs probably all compress the image content in the same way, although that might not be the case, and that any filesize reduction comes from a reduction of the size of the text content. In that respect I would be interested to hear from BruceG which file format and compression options he set for the images. Also which font he selected for the text: unless the font used has been embedded, the text appearance and possibly line breaks might depend on the computer on which the output is viewed. And to be sure, whether he needed to edit the FineReader text recognition results, which could potentially substantially increase the time required to process each source file. And finally, whether he was able to set up FineReader settings which would enable multiple source files to be processed without any further changes.

On a small point, when I was assisting with a club archive project I avoided content being saved using MS Word formats, as when viewed on a shared computer the content could possibly be inadvertently modified and then resaved. That shouldn't be an issue for single user, and I don't know whether the docx format might possibly have an option to lock the content to avoid that possible issue.
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: Creating smaller PDFs

Post by BruceG »

Christmas day has started here, so I will not have time today. Tomorrow I will have more time.
dpc
Posts: 379
Joined: 01 Apr 2011, 18:05
Number of books owned: 0
Location: Issaquah, WA

Re: Creating smaller PDFs

Post by dpc »

assume that the alternative PDF outputs probably all compress the image content in the same way, although that might not be the case, and that any filesize reduction comes from a reduction of the size of the text content.
It would be interesting to remove the four pages in that sample PDF that contain the images and then see what sort of file sizes you get.

I'm assuming that if you hand Adobe Acrobat an image to add to a PDF, that any sort of compression it performs on that image will be lossless? Might be worthwhile to lower the JPEG quality before adding the image to the PDF and see what that gains you.
cday
Posts: 451
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Creating smaller PDFs

Post by cday »

dpc wrote: 25 Dec 2022, 02:35
[cday wrote] I assume that the alternative PDF outputs probably all compress the image content in the same way, although that might not be the case, and that any filesize reduction comes from a reduction of the size of the text content.
It would be interesting to remove the four pages in that sample PDF that contain the images and then see what sort of file sizes you get.
When FND excerpt.pdf above was first posted I actually split a copy into separate PDF files for the text and image pages, reassuringly the two file sizes totalled the same size as the original file. As I am mainly in Linux now I used the free basic version of PDFsam, which is actually also available for Windows. I then opened the images file in my go-to image editor XnView MP which uses Ghostscript to rasterise PDF files, and set the DPI for the rasterisation to 300 DPI, the declared value for the scans it contained.

My idea was to test resizing the original four-page file with different JPEG quality values, and I soon came to the conclusion from the file sizes obtained that the Abbyy FineReader output was probably already well compressed. But testing higher and higher compression settings, even improbably high values, I managed to further reduce it without very much obvious loss of quality.

The PDF format supports most common file formats, although the publicly available ISO 32000 specification refers to compression options rather file formats. Be warned that it is not an easy read without some background knowledge! The JPEG 2000 format which can produce smaller file sizes for colour is supported although it never gained much traction, and on a slow (now probably mainly older) computer the substantial extra decoding required can result in a visible delay when stepping thriugh pages of a PDF file.

I'm assuming that if you hand Adobe Acrobat an image to add to a PDF, that any sort of compression it performs on that image will be lossless? Might be worthwhile to lower the JPEG quality before adding the image to the PDF and see what that gains you.
I wouldn't assume that, although that might be the case! In the case of the JPEG format (or compression method used in JPEG files) 'lossless' probably doesn't have any meaning, and as a practical matter, very high quality values will rarely be appropriate when file size is important: the increase in file size is quite disproportional to any perceived gain in quality.
cday
Posts: 451
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Creating smaller PDFs

Post by cday »

Second post today:

Returning to Oliver's file size issue, when the text content is vectorised by whichever software, the file size of the final PDF will be largely determined by the size of the images it contains. Based on quick tests that I have performed, I suspect that acceptable image quality, about equal to the image quality in the original PDF file uploaded, and a substantial reduction in file size, could probably be obtained by scanning at a lower resolution of 200 or even 150 DPI, and adding a small sharpening stage to the processing used.

Scanning at a lower resolution is unlikely to significantly affect the quality of vectorised text, based on my past experience with Adobe Acrobat at least, and would probably more than halve the file size of the final PDF image content, and on a flatbed scanner should also be usefully quicker.

Strip 1.jpg

In the above combined image, the image on the left is the image in the original example file provided when rasterised in XnView MP at the original 300 DPI at which it was created, and the image on the right is the same image down-sampled to 150 DPI and with a small amount of sharpening applied. The sharpening seems to adequately recover the slight loss of focus seen in my earlier test of the same image down-sampled to 150 DPI, and the resulting file size of the processed image is 40% of that of the original uploaded. So if confirmed in further tests, when combined with vectorised text, a possible substantial reduction in the overall size of the final archive PDFs. Best confirmed with the same page rescanned at 150 or 200 DPI, or better a short run of images pages, though.
Oliver
Posts: 7
Joined: 19 Dec 2022, 14:14
E-book readers owned: Kindle
Number of books owned: 300
Country: Deutschland

Re: Creating smaller PDFs

Post by Oliver »

Hello together,

thank you all for your rich input. Searchability is also very important for me. I realised that it isn't an option for me to convert the scanned text into word and then into a pdf again, because this creates too much errors. In the last days I tried to understand how to use OCRmyPDF and this program ist really good for making a pdf searchable in my opinion.

The approach of saving the images with 150dpi and then sharpen them seems to be a good method for reducing file sizes in my example. I'll will try that out. Especially because an another scan I did of a different book of the same series about the same genus of fungi turned out with very bad black and white drawings.

But over all you all are right, that big files aren't that big of a problem in 2022 anymore. That's why I won't do everything to get the file as small as possible, but because I like to use those files online and share them privately, small sizes are very handy in that respect.

Greetings
Oliver
Oliver
Posts: 7
Joined: 19 Dec 2022, 14:14
E-book readers owned: Kindle
Number of books owned: 300
Country: Deutschland

Re: Creating smaller PDFs

Post by Oliver »

That's an excerpt of the file where the drawings look pretty bad. I'll look into sharpening of pdfs and will tell you in the next days, how it went and how much I could safe on file sizes.
FNDexcerpt_2ocr.pdf
(349.19 KiB) Downloaded 36 times
I really wasn't expecting that book scanning would be that of a complex topic overall, but it's fascinating how much is possible today.
cday
Posts: 451
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Creating smaller PDFs

Post by cday »

Oliver wrote: 25 Dec 2022, 17:54 ... another scan I did of a different book of the same series about the same genus of fungi turned out with very bad black and white drawings.
If you scan in black and white you will probably need to scan at higher resolution, 400 DPI or even 600 DPI, in order to obtain reasonable quality, as there will be no anti-aliasing at the edges of characters. That will be slower on a flatbed scanner, but if you then save the black and white files with CCITT G4 compression (or better JBIG2 if available) the resulting file sizes could be much smaller. But scanning in grayscale at a lower resolution will probably produce better results if the file sizes are acceptable.
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: Creating smaller PDFs

Post by BruceG »

Hi
When I first started doing OCR I used a version of Omnipage that came with the scanner I was using at the time. I then volunteered to convert 90+ years of magazines to help easy searching to extract material for 100 year anniversary book. The more recent years were saved digitally so scanning them was not required. A paid version was used for this. Omnipage only had a single layer of text and pictures. Fonts were a problem as a single line could have a number fonts as a result of OCR. Within ‘Text boxes’ changes in formatting/justifying was easy to do, which was good as the edited layer is what was used to produce the pdf. A fair amount of editing was required. A years magazines were bound with a hard cover, this did not help those doing the scanning and the OCR because of the low dpi chosen. As there was only one layer so every year looked the same, text/picture on a white background. The fact that some were nearly 100 yrs was hard to see. Search ability was the thing so index files were created by Acrobat so all the years could be searched at the same time with anyone with Adobe reader.
For myself I OCR-ed books for use with my ereader so text again was the thing. The image was not required. Fonts in books are usually simple, a few from the same family. And for a ereader this is ideal.
I then volunteered at a Archive where the image of the magazine/book/newspaper etc was equally important so this is when I found a different OCR program was needed. Abbyy Finereader was found and I have used it ever since. Instead of a single layer it has two layers. Omnipage may have two layers, the image to see on the monitor and text that can be copied and pasted (ie.both being the same). The original image of the document is discarded. In Abbyy they are different, this is also true for Adobe Acrobat and all pdf readers. Both layers can be edited separately in a few programs. Abbyy is one of these. Abbyy also allows either layer to be on top whereas most it is the image layer on top. This is why it is difficult to see how good the OCR is unless you copy and paste into something like word.
Scanning quality is important so I usually scan at 400dpi books/magazines and 600dpi for newspapers/small fonts and in colour. Cday asked about compression in Abbyy, there are 4 settings for images Best, Balanced, Compact and custom which has more options, incl dpi & loss allowed or not allowed. The custom one was set to 300 dpi and loss not allowed.
Examples of all
First 7 pages of text a very small file
FND excerptAbbyy7pageTonT.pdf
(110.79 KiB) Downloaded 32 times
The 3 image levels
FND excerpt. 4pageCompactpdf.pdf
(742.82 KiB) Downloaded 33 times
FND excerpt 4pageBalanced.pdf
(1.08 MiB) Downloaded 28 times
FND excerpt 4pageBest.pdf
(1.42 MiB) Downloaded 38 times
Custom settings 300 dpi No loss of quality This increased size by about 15 times
FND excerpt 4pageCustomNoLoss.pdf
(18.67 MiB) Downloaded 34 times
While saving to pdf Abbyy suggested scanning at 600 dpi would produce better results.
As for format, is there pain pdf, I do not save as PDF/A or PDF/UA. MRC compression on/off
My setting are Image quality = Balanced with MRC compression.
I let Abbyy decide what fonts to use. I see that 31 are ticked, not sure if this the default. Sometimes I try to match a font in magazines by select text and then selecting the font while editing. It is here that symbols are added if not auto selected. For the document, Arial, CourierNew and TimesNewRoman were chosen. I do not find fonts are much of a problem unlike in Omnipage. Fonts can be Embeded as well. If the image is on top, fonts do not matter much unless for copy and paste.
Text on top is not something I have seen other than with Abbyy, for me I find it useful when editing sometimes. Text on top will show the image of text though the text layer if there is no text where it should be. An example is p4 of the document.
File size is a problem in emailing. Dropbox, OneDrive or a similar product will overcome this problem. Dropbox is free for 2GB. A newspaper I worked on ranges from 60+ to 400+ meg per year to make 20GB incl the index so I use OneDrive.
I edit all all scans to make sure text and pictures have been picked up correctly. Sometimes text go across columns instead of up and down or picture is seen as text etc.
For those documents that the text is important I let Abbyy show me those it thinks may be wrong for me to say ok or correct the error.
A few documents I look at each page to see if there are errors Abbyy missed.

Editing does take time - I might do it too much. Having a project file as Abbyy does I can return to it at any time to do more editing.

I do not change settings between jobs as I have not seen the need. Some times I turn off deskew and input the the pages again mostly for magazine covers because of angled lines or text.
As most of the scanning was completed for the magazine I first mentioned I did incremental improvements over 11 versions as a searchable version was needed quickly instead of finishing one year before going on to the next.
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: Creating smaller PDFs

Post by BruceG »

The subject of making pdf's smaller has raised my interest. I have wondered for a while now about having or not having pictures in what I call the text or not seen layer. Pictures are not searchable so what value are they? You can see them when you read a pdf file so is there any value of having them in the second layer along with text that can be copied & pasted. If they weren't there, would it save on the size of a pdf.

Today was the day to find out. I picked a year of monthly magazines from 1997. 417 pages all which incl. 1 about the coming year. 81,261 KB.
Off I started with the copy of the Abbyy project file deleting all pictures or turning them into text if that was what they really were. I wondered if this was a waste of time as I progressed. I had thought of doing the 'FND excerpt' file but thought, stay in the real world with original scans at a decent dpi. As I continued my wondering got the better of me, so I attacked Oliver's file.
This is the result
FND excerptSmallestNormal.pdf
(1.13 MiB) Downloaded 30 times
1156 KB. The original was 2439 KB and my smallest previous one 1211 KB.
It looked like the afternoon was going to be a waste of time if this was a typical saving.
I pressed on thinking I was doing a real life test and it was worth while to find out what was the truth. Which would change depending what percentage of pictures where in the original material. Was there any value of having pictures in this layer, copy and paste maybe? Have you copied and pasted a picture from a pdf.
Original size 81,261 KB
With pictures removed 41,026 KB
Not quite a 50% reduction.
Would I make this part my process, I would have to think about that some more. Is there more to it that I haven't thought of yet?
Post Reply