What is PDF file? | Knowledge Base
Introduction
As you already know from the article about “PDLs”, PDF is a static Page Description Language that has a strict unchangeable structure.
PDF is one of, if not, the most popular Page Description language due to a huge variety of features that developers from Adobe added to its specification. Moreover, Adobe also provides people with tools able to realize these features in documents. This article is a brief review of the syntax, structure and features of PDF.
What is a PDF file?
The initial goal of developing PDF or abbreviated Portable Document Format was to create a document format that satisfied numerous requirements of digital document interchange in device-independent and resolution-independent environments. These requirements include interactive view, high-performance navigation, low disk space occupation, co-working on documents, support for different media content, encryption, signing, form creation, presentation, and so on. In spite of the initial intent to provide enterprises with the exhaustive format for digital document interchange, high-quality printing features were also added to the specification, though later.
Syntax of PDF file
PDF has an imaging model derived from PostScript’s one, also use 1-2-characters, long operators, as well as in AI format, and also has postfix BNF syntax, where all necessary operands go before the operator.
operand1...operandm operator |
Besides operator length, there are some differences between PDF and PostScript operators. In PDF all necessary operands must precede operators while in Postscript operands are obtained from the PostScript stack. In PDF the operator doesn’t return a result as it can be in PostScript. PDF operator executes some action to compose a page, for example drawing graphics or text or sets some property in a graphics environment. In PostScript, operators do all the work.
Usually, the most of PDF files content is compressed with Flate encoding and, this way is binary. Besides compression PDF files also can be encrypted to limit access to document content. Therefore the whole file must be treated as binary. Only in the case when a PDF file is neither compressed nor encrypted and doesn’t contain binary content, such as images, sound, video, etc., it can be considered textual.
PDF specification Objects
In PDF specification object is a synonym of type, while in PostScript there are types that can be primitive and complex and the last can be called “objects”. Therefore all types in PDF, either simple or complex, are objects. PDF language consists of boolean values, integers, real numbers, names, strings, arrays, dictionaries, and streams. Strings can be in literal or hexadecimal format as it is shown below.
( This is a string )<4E6F762073686D6F7A206B6120706F702E> | literal formathexadecimal format |
Arrays are bounded with square brackets. It includes a subtype Rectangle - array with 4 elements.
Dictionaries store the data in key-value pairs where the key is a name or string (for Names dictionary) and the value is object or object reference. It is enclosed in double-angle brackets. Dictionaries have a Type field that shows what data is stored in a given dictionary.
<< /Type /Example /Subtype /DictionaryExample /Version 0 . 01 /IntegerItem 12 /StringItem ( a string ) /Subdictionary << /Item1 0 . 4 /Item2 true /LastItem ( not ! ) /VeryLastItem ( OK ) >>>>endobj |
Objects can be direct and indirect. Indirect objects are those that can be referred from other objects by their ID.
Streams are objects that usually contain binary or encoded data. They are human-unreadable and don’t have limitations on length. Usually, PDF files streams contain compressed page content or images or some other media. Stream object consists of a direct dictionary with a length of the stream and an array of filters used for encoding the stream, and encoded data after keyword stream.
181 0 obj << /Length 473 0 R /Subtype /Image /Width 2 /Height 19 /BitsPerComponent 8 /ColorSpace /DeviceGray /Filter [/ASCII85Decode /FlateDecode] >>streamGb"[2*s<F2i'/7_!,1%/hZ~>endstreamendobj |
PDF Operators
Operators are kind of direct objects that make page graphics and, as we mentioned earlier, are represented by 1- or 2-letters keywords. There are two kinds of PDF operators:
* executing actions or setting properties of the graphics state.
PDF operator x y mx y lx1 y1 x2 y2 x3 y3 chx y width height rea b c d e f cmSsfFWfont size TfcharSpace TcqQlineWidth wlineCap Jfont size TfcharSpace Tc | Description begin a new subpath by moving the current point to coordinates (x, y)append a straight line segment from the current point to the point (x, y)append a cubic Bezier curve to the current pathclose the current subpathappend a rectangle to the current pathmodify the current transformation matrix by concatenating the specified matrixstroke the paththe same, but close pathfill the paththe same, but close pathmodify the current clipping path by intersecting it with the current pathset the text font to font and the text font size to sizeset the character spacing to charSpacesave the current graphics state on the graphics state stackrestore graphics state from the graphics state stackset the line width in the graphics stateset the line cap style in the graphics stateset the text font to font and the text font size to sizeset the character spacing to charSpace |
* grouping
PDF operator BT...ETBI...EIBMC...EMCBX...EX | Description begin and end a text objectbegin and end an image objectbegin and end a marked-content sequencebegin and end a compatibility section |
Special kinds of grouping operators are BX…EX. They enclose portions of page content where unidentified objects must be ignored. Thus, they are equivalents of AI %_ pseudo-comments.
PDF file structure
PDF file has four mandatory structural elements.
- One-line header, where the version of PDF language is written
%PDF-1.5 |
Body that contains document’s objects. Structure of the body will be described later in this article.
Cross-reference table. It is used for quick random access to the document’s objects. It contains an offset in bytes to the beginning of the objects from the start of the file.
xref0 60000000003 65535 f0000000017 00000 n0000000081 00000 n0000000000 00007 f0000000331 00000 n0000000409 00000 n |
- Trailer, points to the last cross-reference table and contains a common quantity of objects in cross-reference tables, The ID of the document and references to:
- previous cross-reference table if there are several ones in the file;
- document Root that represented by Catalog dictionary;
- Meta information dictionary with Author, Creator, Title, Keywords, creation and modification date fields;
- Encryption dictionary if the document is encrypted.
trailer << /Size 15 /Root 2 0 R /Info 1 0 R >>startxref6224 |
A new cross-reference table and trailer are added after every update of the document. It will be described later in the article.
Document structure
The PDF document has a tree-like structure where the root is a Catalog dictionary.
Catalog contains references on the pages description subtree, outline subtree and other document level subtrees and leaf nodes.
2 0 obj << /Type /Catalog /Pages 3 0 R /Outlines 4 0 R /PageMode /UseOutlines /ViewerPreferences 5 0 R /OpenAction [6 0 R /Fit] >>endobj |
Pages tree contains ordering of page-tree nodes and page-leaf nodes. Exactly tree-like structure of a set of pages together with search algorithm allows quick navigating across thousands of pages to find a needed one.
Page dictionary contains reference on Content stream that can be compressed as it is on the figure above or uncompressed. In the last case, we would see PDF operators in human-readable text as in the figure below.
7 0 obj << /Length 8 0 R >>stream1 0 0 1 0 0 cm0 0 m595 0 l595 842 l0 842 lhWnq/Alpha1 gs0 0 0 rg0 0 0 RG0 Jq0.96593 0.25882 -0.25882 0.96593 0 0 cm1 0 0 1 0 0.25882 cm0.02 w-0.96593 0 m0 -0.25882 l0 -0.25882 0 -0.25882 0 -0.25882 c0.14294 -0.25882 0.25882 -0.14294 0.25882 0 c0.25882 0.14294 0.14294 0.25882 0 0.25882 chSQendstreamendobj |
Besides an array of child nodes (it can be page-tree or page nodes) Pages, the dictionary contains reference to Resources dictionary, that in its turn refers to Fonts, ProcSets, Images (XObject), etc.
9 0 obj << /ProcSet 10 0 R /XObject 11 0 R /Font 12 0 R /ExtGState 13 0 R >>endobj |
Annotation and others subtrees will be mentioned casually in Features section
Features
Graphics possibilities of PDF format
No sense in mentioning common for most of Page Description Languages possibilities in drawing graphics and text. We just say that the richness of supported fonts and color spaces are the same as in PostScript.
Fonts - Adobe Type 0- Adobe Type 1- Compact Fonts (CFF)- Chameleon- TrueType- CID-keyed | Color spaces - DeviceGray- DeviceRGB- DeviceCMYK- DeviceN- Separated colors- Spot- CIE-based |
Transparency
PDF supports transparency.
External files
Any media or document file can be embedded to PDF or referred to from a document.
Hyperlinks
Hyperlinks are supported in PDF.
Electoral and interactive view
PDF allows showing only parts of the content and its appearance that are necessary for certain usage and hiding the others. It is useful, for example, when importing Adobe Illustrator graphics that have layers some of which are necessary for working in Adobe Illustrator, but are not necessary for viewing in Adobe Acrobat Reader. Another case of electoral view can be an article written in different languages or represented for users with disabilities but saved in one document. There can also be different variants of usage: one view for viewing, designing, and printing.
An interactive view of PDF includes abilities:
- to view/add annotations to parts of documents;
- to edit the document with the possibility to view all editions;
- to order the content of the document in different threads of articles gathered by certain subjects.
Annotation is a sort of floating box containing some notes, sound, video, or some other content.
Interactive navigation
Navigating between different parts of documents can be realized in several ways:
- moving through the document from the start to necessary page consecutively page-by-page;
- jumping from one part of the document to another one with outline items;
- jumping from one page to another one by clicking thumbnail images;
- jumping from one part of the document to another one with article thread items;
- jumping from one part of the document to another one using viewports and moving between them by pressing <Tab>.
Moving by viewports and hiding some parts of the document is realized by means of Viewport and NavigationNode dictionaries.
Incremental updates
All changes that were made in the PDF document are appended to the document without erasing previous content. And every time the documents are changed new xref (cross-reference table) and trailer are added. The new cross-reference table contains references on added or removed objects and on the previous cross-reference table. Such a mechanism allows putting together the final document content and, at the same time, storing previous states of the document.
Performance
High performance of navigating through pages is provided by Pages tree-like structure and effective search algorithm. However It can be increased further by combining repetitive graphics elements into one object, called Form XObject and using one object in all necessary places. There is also a way to optimize the whole document for a high-performance view. It is linearization. Linearization was initially invented for effective viewing of PDF documents accessed by the web. The linearized PDF document is read-only, any change to this will require repeated linearization.
High performance of navigating between document objects is realized by cross-reference tables that store object offsets from the start of the file.
Compression
Compression of PDF documents, usually Flat encoding, allows the creation of large documents with relatively low disk space occupation. For example, the PDF specification file that contains 758 pages with an outline, thumbnails, images, and tables has about 9 Mb size.
Security
PDF documents can be encrypted to give differentiated access only to certain users and they can be signed. The digital signing feature allows authenticating of the identity of the user and the document’s content. Digital Signature binds document state when it is signed with user information. Digital signature can be in any form: from purely mathematical to retinal scan if corresponding signature handler is provided.
Interactive forms
It is used for gathering information from users. Interactive forms, or so called AcroForms can validate, format and send user data to a server.
Presentation
There are several means of presentation in PDF:
- actions that are executed when document is opened;
- actions that are executed when page is opened;
- duration of showing a page;
- effects that appear while transiting from page to page.
Media content
Images, sounds, movie clips and 3D graphics can be added to PDF documents.
Extraction data
PDF allows adding certain markup that provides external applications with the possibility to extract necessary data. Document with such markup called Tagged PDF.
Prepress support
Preparing for publishing includes printer’s marks, color separation, output intents and trapping.
What is the use of a PDF file?
The main application of PDF documents is electronic document interchange and viewing in different environments.
How do I make a PDF file?
Creation and editing of PDF documents are possible in standalone Adobe Acrobat applications.
How do I open a PDF file?
You can open and view PDF files in standalone Adobe Acrobat Reader application or in Google Chrome browser with PDF plugin. Also simple utilities such as Sumatra PDF, Foxit Reader or Free PDFReader, will help you. Another way is to view PDF online, for example, on Google Drive.