Web optimised PDF

Scribus' PDF is optimized for print shops and Scribus takes pains to ensure that the printed output will look identical on different printing presses. The price for this consistency is a file size several times larger than straight-forward PDF would have. This tip shows a way to reduce the file size so you can put the PDF file on the web or even distribute it by e-mail.

My wife edits and lays out a newsletter that, being mostly distributed via e-mail and the web, must be below ~ 1 MB, desirably even smaller. Since the Scribus options for minimizing the size (subsetting fonts and downsampling images) did not meet that requirement, I looked for a route to PostScript and back to PDF that would give better compression, and the following does a surprisingly good job:


 * 1) Export as PDF (1.3 or 1.4), embedding all fonts, no font subsetting, no image subsampling → newsletter_scribus.pdf [~2.8 MB]
 * 2) Convert to PS using `pdftops -level3' → newsletter.ps [huge]
 * 3) Convert back to PDF using ghostscript (subsetting fonts, subsampling images) → newsletter_compact.pdf [~500 kB]. (You can use ps2pdf13 [or ps2pdf14] for this step; a more fine-grained solution is to use my script below.)
 * 4) If you feel like it, use pdfopt to linearize the PDF, so Acroread can start showing the first pages while the rest is still being downloaded (I have never tried this feature).

=Notes=
 * As you see, the resulting PDF is more than 5 times smaller.
 * For step 2, Acroread's print-to-file PS export will not work ― you need to use Xpdf/pdftops.
 * For step 3, you need a 8.x version of ghostscript; I have successfully tried AFPL 8.15 and 8.53 and GNU 8.16, while ESP 7.07 does not work.
 * Marking and searching of text works fine in the new file, while there were many gaps between letters in the original (due to the way Scribus ensures precise text placement).
 * I also found that the compressed PDF file looks much nicer in Acroread 7 (but not in Xpdf or Gv), but your mileage may vary:
 * True-type fonts were not antialiased in the original PDF, but were in the compressed one. However, in simple test documents, antialiasing works for both PDF files, so I don't know what the problem was in the first place.
 * Images looked too dark in the original (apparently a transparency bug in Acroread 6 and 7), but are fine in the compressed file. Again, I have some difficulties in reproducing this in a simple test document.
 * There is one downside: As a result of the transformation, in Xpdf non-ascii letters (like `é' or `½') get lost. Acroread 7 or Gv have no such problem, so it might be a bug in Xpdf/Poppler. For our newsletter, the number of readers using Xpdf is close to one (myself), so this is not much of an issue.
 * Also, the transformation loses meta information (creation date, creator, ...), bookmarks, PDF annotations, hyperlinks, etc. The script below fills in some of the meta information itself; I have tried to extract and restore bookmarks, but the resulting PDFs caused trouble with Acroread 7.

=Additional information=


 * If anybody is willing to run the resulting file through a PDF checker, you can try the latest newsletter at http://www.calgarymulti.com/index.php?id=79

The following Perl script calls, in sequence If you lack the first or last program, you'll need to comment out the corresponding lines.
 * 1) pdftk to extract some meta information
 * 2) pdftops
 * 3) gs
 * 4) pdfopt

You get a usage overview with

compress-newsletter.pl -h

and before using it in production, you should fill in your details in the lines marked with `[Insert ... here]'.

The script os not very polished, but it works for me.

=compress-newsletter.pl= if [ -x /usr/local/bin/perl ]; then perl=/usr/local/bin/perl elif [ -x /usr/bin/perl ]; then perl=/usr/bin/perl else perl=`which perl| sed 's/.*aliased to *//'` fi
 * 1) !/bin/sh
 * 2)  -*-Perl-*-
 * 3) Run the right perl version:
 * 1) Run the right perl version:

exec $perl -x -S $0 "$@"    # -x: start from the following line
 * 1) ! /Good_Path/perl -w
 * 2) line 17
 * 1) line 17

use strict; use File::Temp qw/ :mktemp /;
 * 1) Name:   compress-newsletter
 * 2) Author: wd (Wolfgang.Dobler@ucalgary.ca)
 * 3) Date:   03-Oct-2005
 * 4) Description:
 * 5)   Use ghostscript's pdfwrite device (à la ps2pdf) to reduce the
 * 6)   Newsletter's PDF file size, and add meta information like author,
 * 7)   date, etc.
 * 8)   The preferred route is currently:
 * 9)                 [scribus>=1.2.3]
 * 10)                    file.pdf
 * 11)                 [pdftops>=3.00]
 * 12)                     file.ps
 * 13)            [pstopdf14 (gs-gnu-8.16 or higher)]
 * 14)                        V
 * 15)                    final.pdf
 * 16) Usage:
 * 17)   compress-newletter [-i col:gray:mono] Newsletter_big.pdf
 * 18) Options:
 * 19)   -i col:gray:mono
 * 20)   --imgres=col:gray:mono   Set resolution for downsampling color,
 * 21)                            grayscale and black-and-white images
 * 22)                            (default is 144:300:300)
 * 23)   --debug                  Be verbose and keep temporary files around
 * 1)   -i col:gray:mono
 * 2)   --imgres=col:gray:mono   Set resolution for downsampling color,
 * 3)                            grayscale and black-and-white images
 * 4)                            (default is 144:300:300)
 * 5)   --debug                  Be verbose and keep temporary files around

use Getopt::Long; Getopt::Long::config("bundling");
 * 1) Allow for `-Plp' as equivalent to `-P lp' etc:

my (%opts);			# Options hash for GetOptions my $doll='\$';			# Need this to trick CVS

GetOptions(\%opts,	  qw( -h   --help -i=s --imgres=s --debug -q  --quiet -v  --version ));
 * 1) Process command line

my $debug = ($opts{'debug'} ? 1 : 0 ); # undocumented debug option if ($debug) { printopts(\%opts); print "\@ARGV = `@ARGV'\n"; }

if ($opts{'h'} || $opts{'help'})   { die usage;   } if ($opts{'v'} || $opts{'version'}) { die version; }

my $quiet = ($opts{'q'} || $opts{'quiet'}  || ''           ); my $imgres = ($opts{'i'} || $opts{'imgres'} || '144:300:300');

my ($gs,     @gsargs     ) = ('gs'     ); my ($pdftops, @pdftopsargs) = ('pdftops'); my ($pdfopt, @pdfoptargs ) = ('pdfopt' );

my $infile = shift or die usage; (my $root=$infile) =~ s/\.(pdf|ps).*//; (my $outfile=$infile) =~ s/(.*)(\.(pdf|ps))/${1}_new${2}/; my $tmpfile = mktemp("${root}.tmp_XXXXXX");


 * 1) 0. Extract all sorts of information

print "Running pdftk ...\n"; print STDERR "pdftk $infile dump_data output\n" if ($debug); my $meta = `pdftk $infile dump_data output -`; my ($creator) = ( $meta =~		 m{InfoKey: Creator\s+InfoValue:\s*(.+)$}m		); $creator = 'Scribus 1.2.3' unless defined($creator); my $datestring = extract_CreationDate($meta); my @bookmarks = extract_bookmarks($meta);
 * 1) Extract Scribus version, creation date, bookmarks from original PDF:

my ($colres,$grayres,$monores) = ($imgres =~ /([0-9]+):([0-9]+):([0-9]+)/); die "Image resolution must be of form `col:gray:mono'\n" unless defined($monores);
 * 1) Extract desired image resolutions

push @pdftopsargs, "-level3"; my $psfile = mktemp("${root}.ps_XXXXXX"); push @pdftopsargs, $infile, $psfile; print "Running pdftops ...\n"; print STDERR "$pdftops @pdftopsargs\n" if ($debug); system($pdftops,@pdftopsargs);
 * 1) 1. Run pdftops

push @gsargs, qw{-q -dNOPAUSE -dBATCH}; push @gsargs, '-sDEVICE=pdfwrite'; push @gsargs, '-dCompatibilityLevel=1.3'; push @gsargs, '-dPDFSETTINGS=/screen'; push @gsargs, '-dEmbedAllFonts=true'; push @gsargs, '-dSubsetFonts=true'; push @gsargs, '-dColorImageDownsampleType=/Bicubic'; push @gsargs, "-dColorImageResolution=$colres"; push @gsargs, '-dGrayImageDownsampleType=/Bicubic'; push @gsargs, "-dGrayImageResolution=$grayres"; push @gsargs, '-dMonoImageDownsampleType=/Bicubic'; push @gsargs, "-dMonoImageResolution=$monores"; push @gsargs, "-sOutputFile=$tmpfile"; push @gsargs, "-c .setpdfwrite";
 * 1) 2. Run gs
 * 2) a) Prepare options
 * 1) One of /printer, /screen, /prepress, /ebook, /default; see Ps2pdf.htm:

my $metafile = "${root}.meta"; open(META, "> $metafile"); print META <<"DEAD_PARROT"; % Document information [% /CreationDate (D:$datestring) /ModDate (D:$datestring) /Creator ($creator) /Title ([Insert your document title here]) /Subject ([Insert the Subject here]) /Keywords ([Insert key words here]) /Author ([Insert author' nsme here]) /DOCINFO pdfmark
 * 1) b) Write meta information to temporary file
 * 2) my $metafile = mktemp("metainfo.tmp_XXXXXX");

% Initial view on opening the document [/View [/Fit] % Fit page in window /Page 1 % /PageMode /UseOutlines % /UseNone /UserOutlines /UseThumbs /FullScreen /DOCVIEW pdfmark

DEAD_PARROT


 * 1) Bookmarks. [Commented out for acroread 7.0 has problems] Currently at
 * 2) the mercy of the original bookmarks (and Scribus 1.2.2 does not allow
 * 3) to edit the bookmark names) and the encoding that pdftk understands
 * 4) (most quotation marks get mapped to `?').
 * 5) Ideally, one would write out the meta information file with
 * 6) `compress-newsletter -m CC.pdf' and use it then with
 * 7) `compress-newsletter CC.pdf'.
 * 8) % Bookmarks: @bookmarks

push @gsargs, '-f', $psfile, $metafile; print "Running gs ...\n"; print STDERR "$gs @gsargs\n" if ($debug); system($gs,@gsargs);

print "Running pdfopt ...\n"; print STDERR "$pdfopt @pdfoptargs $tmpfile $outfile\n" if ($debug); system($pdfopt,@pdfoptargs,$tmpfile,$outfile);
 * 1) 3. Run pdfopt

system('ls', '-l', $infile, $psfile, $tmpfile, $outfile);
 * 1) Some diagnostics:

END { # Clean up even in case of an error: unless ($debug) { foreach my $file ($psfile,$tmpfile) { unlink $file if (defined($file) && -f $file); }   } }

sub extract_CreationDate {

use POSIX qw(strftime);

my $meta = shift;

my ($cdate) = ( $meta =~		   m{InfoKey: CreationDate\s+InfoValue:\s*(.+)$}m		  ); # Time string: need to splice in "'" after hours and minutes of time zone # definition. To me this looks like the technical documentation was taken # too literally and now applications (and Acroread 7) insist on these # stupid markers. my $datestring; if ($cdate =~ /[0-9]{14}/) { # managed to extract CreationDate from $meta $datestring = "$cdate-06'00'"; } else {		        # Creation date unknown -- use current date my $tz = strftime "%z", localtime; $tz =~ s/([0-9][0-9])([0-9][0-9])/$1'$2'/; $datestring = strftime "%Y%m%d%H%M%S$tz", localtime; }

$datestring; } sub extract_bookmarks {

my $meta = shift;

my @bm;

while ($meta =~ /^BookmarkTitle:     \s* (.*) \n                      BookmarkLevel:      \s* (.*) \n                      BookmarkPageNumber: \s* (.*) /xmg) { my ($title,$level,$page) = ($1,$2,$3); push @bm, "[/Title ($title /Page $page /OUT pdfmark\n";   }

} sub printopts { my $optsref = shift; my %opts = %$optsref; foreach my $opt (keys(%opts)) { print STDERR "\$opts{$opt} = `$opts{$opt}'\n"; } } sub usage { my $thisfile = __FILE__; local $/ = '';             # Read paragraphs open(FILE, "<$thisfile") or die "Cannot open $thisfile\n"; while () { # Paragraph _must_ contain `Description:' or `Usage:' next unless /^\s*\#\s*(Description|Usage):/m; # Drop `Author:', etc. (anything before `Description:' or `Usage:') s/.*?\n(\s*\#\s*(Description|Usage):\s*\n.*)/$1/s; # Don't print comment sign: s/^\s*# ?//mg; last;                       # ignore body }   $_ or "\n"; } sub version { my $doll='\$';		# Need this to trick CVS my $cmdname = (split('/', $0))[-1]; my $rev = '$Revision: 1.8 $'; my $date = '$Date: 2006/02/02 09:38:52 $'; $rev =~ s/${doll}Revision:\s*(\S+).*/$1/; $date =~ s/${doll}Date:\s*(\S+).*/$1/; "$cmdname version $rev ($date)\n"; }
 * 1) Print command line options
 * 1) Extract description and usage information from this file's header.
 * 1) Return CVS data and version info.


 * 1) End of file compress-newsletter