Working with PDF/A or PDF/UA

PDF/A and PDF/UA format imposes several requirements related to the document content that cannot be fulfilled during automatic conversion from a document in Word format to PDF. These requirements should be verified and corrected either in a Word document before conversion or in a PDF document after conversion in order to produce a fully PDF/A and PDF/UA conformant document.

Basic requirements are for the structure or fonts of a PDF/A and PDF/UA document, which we will consider in the following sections.

Document Structure Requirements

The current requirements are for PDF/A-1a, PDF/A-2a, PDF/A-4, and PDF/UA-1 formats.

There are some nuances of how Aspose.Words works when converting to various PDF format standards. They must be taken into account if you want to get the expected result.

The subsections below describe nuances of how Aspose.Words works when converting to various PDF format standards and options for their solution.

Structure Type

PDF standard compliance levels within Aspose.Words Presence of requirement
PDF/A-1a tick
PDF/A-2a tick
PDF/UA-1 tick

A PDF document is a sequence of blocks such as headings, paragraphs, tables, and others. These blocks form a document structure – strongly or weak.

Both strong and weak structures are valid for PDF/A. Microsoft Word documents have a weak structure by design, and Aspose.Words creates PDF with the weak structure respectively and also generates headings according to the outline levels of paragraphs in the source document.

For a PDF/UA-1 document with a weak structure, it is additionally required that the heading numbers go in order without gaps.

To ensure correct output, users have to ensure that the source document content is properly organized and outline levels are correctly specified for paragraphs. Otherwise, the user should verify and fix the structure of the output PDF document.

Marking the Content as an Artifact

PDF standard compliance levels within Aspose.Words Presence of requirement
PDF/A-1a tick
PDF/A-2a tick
PDF/UA-1 tick

At the moment, Aspose.Words marks page headers and footers, note separators, repeated table header cells, and decorative images as artifacts. Note that this list may be updated in the future.

If a document contains any other content that should be marked as an artifact, or if any of the artifacted content is a real content, customers should fix that in the output PDF.

Natural Language Specification

PDF standard compliance levels within Aspose.Words Presence of requirement
PDF/A-1a tick
PDF/A-2a tick
PDF/UA-1 tick

Text language is specified in Microsoft Word documents. Aspose.Words exports the specified language to an output PDF with the Lang attribute attached to a marked-content sequence or a Span tag – it is controlled by the ExportLanguageToSpanTag property. Generally there are no language issues when text is entered by the user via Microsoft Word. But there is a possibility that the language may be inaccurate if the text is generated automatically.

Figure Caption

PDF standard compliance levels within Aspose.Words Presence of requirement
PDF/A-1a
PDF/A-2a
PDF/UA-1 tick

Microsoft Word documents allow users to add figure caption.

Currently Aspose.Words cannot export cuptions with the Caption tag, so they must be flagged in the output PDF.

Alternate Descriptions

PDF standard compliance levels within Aspose.Words Presence of requirement
PDF/A-1a tick
PDF/A-2a tick
PDF/UA-1 tick

Microsoft Word documents allow users to add alternate text to images, shapes, and tables. Aspose.Words exports such an alternate text to the output PDF.

PDF standard compliance levels within Aspose.Words Presence of requirement
PDF/A-1a
PDF/A-2a
PDF/UA-1 tick

In addition to the previous point, Microsoft Word documents also allow users to add alternate text to hyperlinks. Aspose.Words exports such an alternate text to the output PDF.

Unfortunately, not every application allows you to set up an alternate description. For example, Adobe Acrobat currently does not enable to set up such a description for hyperlinks. But in Microsoft Word, you can do this as follows:

alternate-descriptions-hyperlinks-mw

Sometimes there is a problem that it is not possible to set alt text for autogenerated hyperlinks in the table of contents (TOC) through the Microsoft Word GUI. Aspose.Words could update such fields and generate the links on its own.

Follow the code example to update TOC fields using the Aspose.Words Document Object Model (DOM):

auto doc = MakeObject<Document>(filename);
auto tocHyperLinks = doc->get_Range()->get_Fields()->
    LINQ_Where([](SharedPtr<Field> f) {return f->get_Type() == FieldType::FieldHyperlink; })->            
    LINQ_Where([](SharedPtr<FieldHyperlink> f) { return f->get_DisplayResult().StartsWith(u"#_Toc"); });

for (const auto& link : tocHyperLinks)
    link->set_ScreenTip(link->get_DisplayResult());

auto opt = MakeObject<PdfSaveOptions>();
opt->set_Compliance(PdfCompliance::PdfUa1);
opt->set_DisplayDocTitle(true);
opt->set_ExportDocumentStructure(true);
opt->get_OutlineOptions()->set_HeadingsOutlineLevels(3);
opt->get_OutlineOptions()->set_CreateMissingOutlineLevels(true);

auto outFile = filename.substr(0, filename.find_last_of('.')) + "_aw.pdf";
doc->Save(outFile, opt);

Table Headers

PDF standard compliance levels within Aspose.Words Presence of requirement
PDF/A-1a
PDF/A-2a
PDF/UA-1 tick

Tables in PDF/UA-1 documents must have headers – column, row, or both. PDF/A only requires standard table markup, which has no additional restrictions. Note that Aspose.Words generates the standard table markup automatically.

Replacement Text

PDF standard compliance levels within Aspose.Words Presence of requirement
PDF/A-1a tick
PDF/A-2a tick
PDF/UA-1

Microsoft Word document does not allow users to set replacement text. So this needs to be verified and fixed in the output PDF:

AcrobatReplacementText

Abbreviations and Acronyms Expansions

PDF standard compliance levels within Aspose.Words Presence of requirement
PDF/A-1a tick
PDF/A-2a tick
PDF/UA-1

Microsoft Word document does not allow users to set abbreviations and acronyms expansions. So this needs to be verified and fixed in the output PDF:

AcrobatSplitAddExpansionText

Document Title

PDF standard compliance levels within Aspose.Words Presence of requirement
PDF/A-1a
PDF/A-2a
PDF/A-4
PDF/UA-1 tick
Document in PDF/UA-1 should have a title.

Font Requirements

PDF standard compliance levels within Aspose.Words Presence of requirement
PDF/A-1a tick
PDF/A-1b tick
PDF/A-2a tick
PDF/A-2b tick
PDF/A-4 tick
PDF/UA-1 tick

There are also a number of nuances of working with fonts when converting to PDF/A-1, PDF/A-2, PDF/A-4 or PDF/UA-1 formats using Aspose.Words. They must be taken into account if you want to avoid possible problems with the output document.

The sections below describe such nuances and options for their solution.

PDF standard compliance levels within Aspose.Words Presence of requirement
PDF/A-1a tick
PDF/A-1b tick
PDF/A-2a tick
PDF/A-2b tick
PDF/A-4 tick
PDF/UA-1 tick

Aspose.Words does not verify the legal restrictions of the used fonts – it is up to users. In other words, a user should not provide inappropriate fonts for PDF conversion using Aspose.Words.

.notdef Glyph

PDF standard compliance levels within Aspose.Words Presence of requirement
PDF/A-1a
PDF/A-1b
PDF/A-2a tick
PDF/A-2b tick
PDF/A-4 tick
PDF/UA-1 tick

The usage of the .notdef glyph is prohibited. The .notdef glyph will appear if a document contains characters which are not present in the selected font and which also cannot be resolved via the Font Fallback mechanism.

Private Use Area (PUA)

PDF standard compliance levels within Aspose.Words Presence of requirement
PDF/A-1a
PDF/A-1b
PDF/A-2a tick
PDF/A-2b tick
PDF/A-4 tick
PDF/UA-1

Private Use Area (PUA) characters appear mostly for Windows symbolic fonts like “Symbol”, “Wingdings”, “Webdings”, and others. Microsoft Word formats do not provide an option to store actual text for characters.

“Segoe UI Symbol” is a Windows Unicode font which could be used as an alternative to symbolic fonts.