GSoC 2009 Improving PDF Export by thachtran

= Abstract = Over the course of developing Scribus, a lot of work has been done to make Scribus's output formats satisfy a wide range of user needs. As PDF is the de facto standard for outputting electronic documents, making Scribus's PDF output more flexible in term of users' requirements is highly desirable. This project aims to address a few important improvements that could be applied to Scribus's PDF exporting features. The main focus will be supporting PDF/X-1a, PDF/X-4 export and improving the embedding PDFs feature of Scribus. Depending on the actual progress of this project, it could extend to cover the implementation of embedding fonts' subsets when exporting PDFs.

= Brief Background =

About PDF/X
Scribus is currently be able to verify and produce PDF/X-3 documents. PDF/X-3 is one particular standard of PDF/X&mdash;a subset of PDF which concentrates on setting a standard (or rather an agreement) for exchanging documents between the creators and commercial printers, so that the creators can be assured (to some extent) about the fidelity of the final printed document. Basically, PDF/X requires that
 * all fonts are embedded.
 * only printing content is presented and any extra live content like security, forms, annotations, bookmarks, sound, video, etc. is prohibited.
 * printing conditions must be present (output intent).

PDF/X-3
The PDF/X-3 is an ISO standard that has a capability of attaching an ICC color profile to a document to explicitly manage the color of the document as opposed to leaving the color to be device-dependent. This feature provides extra assurance for the users in term of the fidelity of the color representation in the final document. See for more information on how Scribus supports this feature.

PDF/X-1a
This particular standard of PDF/X is a strict subset of PDF/X-3, where the main restriction is that the color data has to be converted to CMYK before exchange is performed. This works best if the CMYK color profile to be converted into is clearly defined (this is usually the case in the US). It is also worth noticing that a PDF/X-1a compliant file is also PDF/X-3 compliant, since the latter standard clearly supports this idea.

PDF/X-4
This, on another hand, is a superset of PDF/X-3, based on PDF 1.4, in which the main extension is allowing transparency and layers. The main reasons for the restriction of transparency in PDF/X-3 is because over the years, most of the printers were PostScript printers with their RIPs (Raster Image Processors) working with PostScript representations of the documents. Since PostScript was developed before the notion of transparency was introduced, it naturally could not handle transparency content (hence the restriction in PDF/X-3). Lately, high-end printers became capable of working directly with PDFs at the RIP level and therefore could handle transparency and layers in PDF files. PDF/X-4 was consequently created in order to take advantage of this new technology.

PDF Embedding
At the moment, users can select PDF files to be added into the Scibus document's content (via "Insert Image Frame" and choosing the PDFs). By the PDF export time, users have the option of "truly" embedding these PDF contents into the output or just simply including them as raster images.

Unsurprisingly, choosing to embed these contents is more desirable since it preserves the original content and therefore gives a better result when it comes to high-end printing. This is made possible by including the PDF contents as Form XObjects into the resulting PDF. One main focus of this project is an attempt to improve this feature, especially with respect to color spaces.

Font Subsetting in PDF Export
In the current version of Scribus, when exporting to PDF, users have the choice of embedding fonts used in the document into the output PDF. This is again very desirable since it preserves the document's appearance better and in fact, PDF/X flavors of PDF require all fonts to be embedded.

The way Scribus handles this at the moment is either by embedding the whole font file or making it into outlines&mdash;the used glyphs appear to be drawn manually (using path construction operators) in the Font's dictionary inside the document's Resources where the font is defined. A more common and natural way of doing this is to embed a subset of the font, so that only used glyphs are included.

= Project Goals =

PDF/X-1a and PDF/X-4 Exporting
This part of the project will implement the features of exporting to PDF/X-1a and PDF/X-4, maybe even PDF/X-5. Upon completion, Scribus should be able to export documents that conform to these standards. This is also meant to extend the functionalities of the Preflight Verifier, so that it will be able to verify documents against these standards.

As I mentioned earlier, supporting PDF/X-1a would be quite straightforward, since Scribus already fully supports PDF/X-3. Thus, by restricting the output to only use CMYK, I could create an otput filter to produce valid PDF/X-1a files.

Supporting PDF/X-4 is a bit more complicated though using PDF/X-3 rules as the basis I should be able to allow live transparency/layers to be included in the documents.

PDF Embedding
The PDF embedding feature of Scribus is basically working at the moment, however, the PDF contents are embedded "as-is" - the PDF content is included as a form XObject in which it will be painted exactly as if it was a stand-alone PDF.

The problem however arises in case we need to explicitly manage the color profile of the final output (e.g. force all colors to CMYK in case of PDF/X-1a) - the colorspaces, which are used in the embedded PDF contents, need to be consistent with the whole document. At the moment, this is not the case in Scribus and therefore, this part of the project will try to solve that. This will involve converting from one colorspace to another and it is anticipated that the level of complexity of this project will be high.

Font Subsetting in PDF Export
This idea will be saved as one possible extension of the project if time permits.

As discussed previously, this part of the project will aim to implement the font subsetting feature for PDF export to only embed a subset of the font containing the glyphs that are actually used in the document. This would improve the efficiency of the PDF output as the resulting document would not have to contain the entire font file.

= Time Line =
 * Now – May 23rd: Preliminary study: PDF, PDF/X, PoDoFo (PDF library used by Scribus), Scribus PDF Exporting source code, Qt.
 * May 23rd – June 15th: Implementing the PDF/X-1a & PDF/X-4 (X-5) export.
 * June 15th – July 15th: Improving the PDF embedding feature.
 * July 15th – Aug. 03rd: If the PDF embedding feature is done, move on to implement the font subsetting feature; if not, carry on with the PDF embedding.
 * Aug. 03rd – Aug. 17th: Finishing off the project, documentation and testing.

= Participant Information =
 * Name / University / current enrollment information
 * Name: Thach Tran
 * University: University of Nottingham, UK
 * Enrollment status: 3rd (last) year, Bachelor of Science (Honours) Computer Science
 * Biographical sketch
 * I'm an undergraduate student at University of Nottingham, UK. My major is Computer Science. I have great interest in digital documents, enterprise computing and programming languages in general.
 * I'm finishing the dissertation for my degree at the moment, where I developed a tool to convert documents from PDFXML (a.k.a. Mars) to PDF. See the project's page at Google Code and my interim report
 * Last year, I also participated in a group project as part of my study. The project was aimed to develop an Audio DSP software (called PASTA :-) i.e. Palette Augmented Sound Transformation Application) which can apply different effects to sound signals. Our team finished quite impressively and got a very high grade. See and  for more details.
 * For your information, feel free to take a look at my resume.
 * Did you ever code in C, C++ or Python? Please provide examples of code
 * I have coded in C and C++ for over a year now. However, I have never got a chance to study/code in Python. I wish I could start learning it soon.
 * C and C++ code examples. These are coursework that I did as part of my course.
 * C: Text-based Mastermind game
 * C++: Pacman game
 * Do you use Scribus? Please provide examples if you do.
 * I have never used Scribus before.
 * Do you make other use of Scribus than for laying out articles? Please describe and show
 * As I said earlier, I haven't used Scribus before. Now that I'm aware of it, I certainly wish I have a chance to use Scribus in the near future.
 * Have you been involved in Scribus development in the past? What were your contributions?
 * Unfortunately, no, I haven't.
 * Have you been involved in other Open Source development projects in the past? If yes, please tell us project, when and in what role were you involved.
 * No, I haven't.
 * I have actually used a lot of software/tools from the Open Source community, but the closest I ever got to interact with the community is subscribing to some mailing lists to participate in discussions. I have been involved in coding for up to 4 years now (since I started going to college) and I would think now is the time I could give back something to the community.
 * Why have you chosen your development idea and what do you expect from your implementation?
 * Since I started using computers to prepare my documents, I have always been impressed by the fact that PDF helps to preserve the appearance of my documents efficiently. Whether I have to bring my documents to a print shop to get them printed out or I have to send them electronically to different people, I always want to be assured that my work has a consistent look.
 * Along that line, I have developed my interest in PDF and in digital documents in general. I had my chance to actually work with PDF via the project I'm doing now at university, and since I enjoyed it so much, I wish I could follow the trend here in GSoC 2009.
 * Digital publishing in general is a challenging field where it requires great attention to tiny details in order to produce beautiful and professional digital outputs. Scribus, as a famous software in this field, is a magnificent tool which helps users to design compelling page layouts and sensational publishing documents with ease. Since the ultimate purpose of using the software is to "publish" the work, exporting the result to PDF plays a key role in the software.
 * As everyone who has been to the low-level details of PDF can testify, PDF as well as colorspace management, font embedding and such are very challenging and might even seem tedious to a lot of people. This is exactly where I find my inspiration; I enjoy working with details and low-level programming.
 * While the PDF exporting feature of Scribus is already in good shape, there are still a lot of things that could be improved to make PDFs produced by Scribus more robust and more "press-ready". This is the main motivation behind the project.
 * I would expect the project, upon completion, to add some useful features to Scribus and consequently, satisfy the demand from users. As of for myself, I think the project will be a chance to learn more about software development from fellow developers and moreover, to learn more about the great community of Open Source.
 * Are you you ready and willing to sustain a good level of communication with your mentor and the Scribus Team overall and be open and forthcoming about the progress of your project including coding and personal problems related to your GSoC project?
 * Of course, I'm really looking forward to work with the Scribus Team. I had an extremely good experience preparing for this proposal; I have been in contact with the community along the way and I really enjoyed exchanging my ideas with people in the team. We did encounter a little conflict at first but it was easily settled down and I'm sure things will be the same over the course of my project (if I get selected, anyway :-) ).
 * Contact details
 * Email: tranngocthachs@gmail.com
 * Phone: +447942606550
 * IMs: tranngocthachs on Skype and thachtran on freenode
 * Time zone: British Summer Time (GMT + 1)
 * I will be staying in the UK up to 23rd of July to attend my graduation ceremony. After that, I might move back to my home country as my study has finished. In that case, my phone number will be +84905211803 and I'll be in the time zone of Vietnam GMT + 7. I will keep you guys posted on any changes, but I'm sure this will not affect the project whatsoever.