Web optimised PDF

Scribus' PDF is optimized for print shops and Scribus takes pains to ensure that the printed output will look identical on different printing presses. The price for this consistency is a file size several times larger than straight-forward PDF would have. This page shows various ways to reduce the file size so you can put the PDF file on the web or even distribute it by e-mail.

= First tip =

My wife edits and lays out a newsletter that, being mostly distributed via e-mail and the web, must be below ~ 1 MB, desirably even smaller. Since the Scribus options for minimizing the size (subsetting fonts and downsampling images) did not meet that requirement, I looked for a route to PostScript and back to PDF that would give better compression, and the following does a surprisingly good job:


 * 1) Export as PDF (1.3 or 1.4), embedding all fonts, no font subsetting, no image subsampling → newsletter_scribus.pdf [~2.8 MB]
 * 2) Convert to PS using `pdftops -level3' → newsletter.ps [huge]
 * 3) Convert back to PDF using ghostscript (subsetting fonts, subsampling images) → newsletter_compact.pdf [~500 kB]. (You can use ps2pdf13 [or ps2pdf14] for this step; a more fine-grained solution is to use my script below.)
 * 4) If you feel like it, use pdfopt to linearize the PDF, so Acroread can start showing the first pages while the rest is still being downloaded (I have never tried this feature).

compress-newsletter.pl
if [ -x /usr/local/bin/perl ]; then perl=/usr/local/bin/perl elif [ -x /usr/bin/perl ]; then perl=/usr/bin/perl else perl=`which perl| sed 's/.*aliased to *//'` fi
 * 1) !/bin/sh
 * 2)  -*-Perl-*-
 * 3) Run the right perl version:
 * 1) Run the right perl version:

exec $perl -x -S $0 "$@"    # -x: start from the following line
 * 1) ! /Good_Path/perl -w
 * 2) line 17
 * 1) line 17

use strict; use File::Temp qw/ :mktemp /;
 * 1) Name:   compress-newsletter
 * 2) Author: wd (Wolfgang.Dobler@ucalgary.ca)
 * 3) Date:   03-Oct-2005
 * 4) Description:
 * 5)   Use ghostscript's pdfwrite device (à la ps2pdf) to reduce the
 * 6)   Newsletter's PDF file size, and add meta information like author,
 * 7)   date, etc.
 * 8)   The preferred route is currently:
 * 9)                 [scribus>=1.2.3]
 * 10)                    file.pdf
 * 11)                 [pdftops>=3.00]
 * 12)                     file.ps
 * 13)            [pstopdf14 (gs-gnu-8.16 or higher)]
 * 14)                        V
 * 15)                    final.pdf
 * 16) Usage:
 * 17)   compress-newletter [-i col:gray:mono] Newsletter_big.pdf
 * 18) Options:
 * 19)   -i col:gray:mono
 * 20)   --imgres=col:gray:mono   Set resolution for downsampling color,
 * 21)                            grayscale and black-and-white images
 * 22)                            (default is 144:300:300)
 * 23)   --debug                  Be verbose and keep temporary files around
 * 1)   -i col:gray:mono
 * 2)   --imgres=col:gray:mono   Set resolution for downsampling color,
 * 3)                            grayscale and black-and-white images
 * 4)                            (default is 144:300:300)
 * 5)   --debug                  Be verbose and keep temporary files around

use Getopt::Long; Getopt::Long::config("bundling");
 * 1) Allow for `-Plp' as equivalent to `-P lp' etc:

my (%opts);			# Options hash for GetOptions my $doll='\$';			# Need this to trick CVS

GetOptions(\%opts,	  qw( -h   --help -i=s --imgres=s --debug -q  --quiet -v  --version ));
 * 1) Process command line

my $debug = ($opts{'debug'} ? 1 : 0 ); # undocumented debug option if ($debug) { printopts(\%opts); print "\@ARGV = `@ARGV'\n"; }

if ($opts{'h'} || $opts{'help'})   { die usage;   } if ($opts{'v'} || $opts{'version'}) { die version; }

my $quiet = ($opts{'q'} || $opts{'quiet'}  || ''           ); my $imgres = ($opts{'i'} || $opts{'imgres'} || '144:300:300');

my ($gs,     @gsargs     ) = ('gs'     ); my ($pdftops, @pdftopsargs) = ('pdftops'); my ($pdfopt, @pdfoptargs ) = ('pdfopt' );

my $infile = shift or die usage; (my $root=$infile) =~ s/\.(pdf|ps).*//; (my $outfile=$infile) =~ s/(.*)(\.(pdf|ps))/${1}_new${2}/; my $tmpfile = mktemp("${root}.tmp_XXXXXX");


 * 1) 0. Extract all sorts of information

print "Running pdftk ...\n"; print STDERR "pdftk $infile dump_data output\n" if ($debug); my $meta = `pdftk $infile dump_data output -`; my ($creator) = ( $meta =~		 m{InfoKey: Creator\s+InfoValue:\s*(.+)$}m		); $creator = 'Scribus 1.2.3' unless defined($creator); my $datestring = extract_CreationDate($meta); my @bookmarks = extract_bookmarks($meta);
 * 1) Extract Scribus version, creation date, bookmarks from original PDF:

my ($colres,$grayres,$monores) = ($imgres =~ /([0-9]+):([0-9]+):([0-9]+)/); die "Image resolution must be of form `col:gray:mono'\n" unless defined($monores);
 * 1) Extract desired image resolutions

push @pdftopsargs, "-level3"; my $psfile = mktemp("${root}.ps_XXXXXX"); push @pdftopsargs, $infile, $psfile; print "Running pdftops ...\n"; print STDERR "$pdftops @pdftopsargs\n" if ($debug); system($pdftops,@pdftopsargs);
 * 1) 1. Run pdftops

push @gsargs, qw{-q -dNOPAUSE -dBATCH}; push @gsargs, '-sDEVICE=pdfwrite'; push @gsargs, '-dCompatibilityLevel=1.3'; push @gsargs, '-dPDFSETTINGS=/screen'; push @gsargs, '-dEmbedAllFonts=true'; push @gsargs, '-dSubsetFonts=true'; push @gsargs, '-dColorImageDownsampleType=/Bicubic'; push @gsargs, "-dColorImageResolution=$colres"; push @gsargs, '-dGrayImageDownsampleType=/Bicubic'; push @gsargs, "-dGrayImageResolution=$grayres"; push @gsargs, '-dMonoImageDownsampleType=/Bicubic'; push @gsargs, "-dMonoImageResolution=$monores"; push @gsargs, "-sOutputFile=$tmpfile"; push @gsargs, "-c .setpdfwrite";
 * 1) 2. Run gs
 * 2) a) Prepare options
 * 1) One of /printer, /screen, /prepress, /ebook, /default; see Ps2pdf.htm:

my $metafile = "${root}.meta"; open(META, "> $metafile"); print META <<"DEAD_PARROT"; % Document information [% /CreationDate (D:$datestring) /ModDate (D:$datestring) /Creator ($creator) /Title ([Insert your document title here]) /Subject ([Insert the Subject here]) /Keywords ([Insert key words here]) /Author ([Insert author' nsme here]) /DOCINFO pdfmark
 * 1) b) Write meta information to temporary file
 * 2) my $metafile = mktemp("metainfo.tmp_XXXXXX");

% Initial view on opening the document [/View [/Fit] % Fit page in window /Page 1 % /PageMode /UseOutlines % /UseNone /UserOutlines /UseThumbs /FullScreen /DOCVIEW pdfmark

DEAD_PARROT


 * 1) Bookmarks. [Commented out for acroread 7.0 has problems] Currently at
 * 2) the mercy of the original bookmarks (and Scribus 1.2.2 does not allow
 * 3) to edit the bookmark names) and the encoding that pdftk understands
 * 4) (most quotation marks get mapped to `?').
 * 5) Ideally, one would write out the meta information file with
 * 6) `compress-newsletter -m CC.pdf' and use it then with
 * 7) `compress-newsletter CC.pdf'.
 * 8) % Bookmarks: @bookmarks

push @gsargs, '-f', $psfile, $metafile; print "Running gs ...\n"; print STDERR "$gs @gsargs\n" if ($debug); system($gs,@gsargs);

print "Running pdfopt ...\n"; print STDERR "$pdfopt @pdfoptargs $tmpfile $outfile\n" if ($debug); system($pdfopt,@pdfoptargs,$tmpfile,$outfile);
 * 1) 3. Run pdfopt

system('ls', '-l', $infile, $psfile, $tmpfile, $outfile);
 * 1) Some diagnostics:

END { # Clean up even in case of an error: unless ($debug) { foreach my $file ($psfile,$tmpfile) { unlink $file if (defined($file) && -f $file); }   } }

sub extract_CreationDate {

use POSIX qw(strftime);

my $meta = shift;

my ($cdate) = ( $meta =~		   m{InfoKey: CreationDate\s+InfoValue:\s*(.+)$}m		  ); # Time string: need to splice in "'" after hours and minutes of time zone # definition. To me this looks like the technical documentation was taken # too literally and now applications (and Acroread 7) insist on these # stupid markers. my $datestring; if ($cdate =~ /[0-9]{14}/) { # managed to extract CreationDate from $meta $datestring = "$cdate-06'00'"; } else {		        # Creation date unknown -- use current date my $tz = strftime "%z", localtime; $tz =~ s/([0-9][0-9])([0-9][0-9])/$1'$2'/; $datestring = strftime "%Y%m%d%H%M%S$tz", localtime; }

$datestring; } sub extract_bookmarks {

my $meta = shift;

my @bm;

while ($meta =~ /^BookmarkTitle:     \s* (.*) \n                      BookmarkLevel:      \s* (.*) \n                      BookmarkPageNumber: \s* (.*) /xmg) { my ($title,$level,$page) = ($1,$2,$3); push @bm, "[/Title ($title /Page $page /OUT pdfmark\n";   }

} sub printopts { my $optsref = shift; my %opts = %$optsref; foreach my $opt (keys(%opts)) { print STDERR "\$opts{$opt} = `$opts{$opt}'\n"; } } sub usage { my $thisfile = __FILE__; local $/ = '';             # Read paragraphs open(FILE, "<$thisfile") or die "Cannot open $thisfile\n"; while () { # Paragraph _must_ contain `Description:' or `Usage:' next unless /^\s*\#\s*(Description|Usage):/m; # Drop `Author:', etc. (anything before `Description:' or `Usage:') s/.*?\n(\s*\#\s*(Description|Usage):\s*\n.*)/$1/s; # Don't print comment sign: s/^\s*# ?//mg; last;                       # ignore body }   $_ or "\n"; } sub version { my $doll='\$';		# Need this to trick CVS my $cmdname = (split('/', $0))[-1]; my $rev = '$Revision: 1.8 $'; my $date = '$Date: 2006/02/02 09:38:52 $'; $rev =~ s/${doll}Revision:\s*(\S+).*/$1/; $date =~ s/${doll}Date:\s*(\S+).*/$1/; "$cmdname version $rev ($date)\n"; }
 * 1) Print command line options
 * 1) Extract description and usage information from this file's header.
 * 1) Return CVS data and version info.


 * 1) End of file compress-newsletter

=Alternative solutions=

Using Ghostcript or Python, pdftops and ps2pdf14
In the script collection you can find a python script to Reduce_the_size_of_Scribus_generated_PDFs

A simple script
ps2pdf -dEmbedAllFonts=true -dUseFlateCompression=true -sPAPERSIZE=letter $1
 * create a postscript file (directly from Scribus, or by printing the pdf to a file, or by using the "pdftops -level3" command on your PDF file)
 * run the following script:
 * 1) make USLetter-sized PDF files using ps2pdf
 * 2) USAGE: mkpdf " .ps"

or just plain "ps2pdf -dEmbedAllFonts=true -dUseFlateCompression=true" on the PS produced file.

= Trials and results =

First tested file : 128 A5 pages, 36.9 Mo

 * Original file "pppOrg.pdf" was produced by the concatenation of 10 Scribus produced files (using the pdftk cat command).

gs -dPDFSETTINGS=/prepress -dSAFER -dCompatibilityLevel=1.5 -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sstdout=%stderr -dGrayImageResolution=300 -dMonoImageResolution=300 -dColorImageResolution=300 -sOutputFile=PPP_gs_command.pdf -c .setpdfwrite -f pppOrg.pdf
 * Directly through ghostscript command : 20,6 Mo (and many warnings)

gs -dPDFSETTINGS=/prepress -dSAFER -dCompatibilityLevel=1.5 -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sstdout=%stderr -dGrayImageResolution=300 -dMonoImageResolution=300 -dColorImageResolution=300 -sOutputFile=PPPps_gs_command.pdf -c .setpdfwrite -f ppp.ps
 * Converted to PS using pdftops -level3 (size : 327.2 Mo; Name : ppp.ps) and then fed to ghostcript : 18Mo (and no warnings)

* pdftk detects lots of mysterious repeated errors : "Error: Could not parse ligature component "cyrillic" of "cyrillic_otmark" in parseCharName" and "Error: Could not parse ligature component "otmark" of "cyrillic_otmark" in parseCharName" * gs runs ok * pdfopt detects errors such as : "Considering object with an invalid number 1006 as null." * produced pdf seems to look ok
 * Using perl script : 7.9 Mo : WIN !