PHP Subsetting TrueType font files for PDF embedding

There are cases where we want to subset a TrueType font file, before embedding it into a PDF document. Number one reason is to reduce its size. A TrueType font file may contain thousands of glyph descriptions, but we may need only a few of them. For example, if our PDF document will contain only latin characters, we will need approximately 100 glyphs. If our PDF document contains latin and greek characters, we will need approximately 170 glyphs.

TrueType font file 'arial.ttf' as delivered with Microsoft Windows is 760 Kbytes long. Here are the tables that exist in the 'arial.ttf' TrueType font file:

	  Name     Offset     Length
	  ==========================
	  DSIG     772540       6012
	  GDEF     720676        706
	  GPOS     721384      44058
	  GSUB     765444       7064
	  JSTF     772508         30
	  LTSH      14284       3421
	  OS/2        520         96
	  PCLT     720620         54
	  VDMX      17708       4500
	  cmap     104296       8798
	  cvt      117752       1620
	  fpgm     113096       1646
	  gasp     720604         16
	  glyf     133044     542892
	  hdmx      22208      82088
	  head        396         54
	  hhea        452         36
	  hmtx        616      13668
	  kern     675936       5472
	  loca     119372      13672
	  maxp        488         32
	  name     681408       2656
	  post     684064      36539
	  prep     114744       3006
	

By far, the largest table is "glyf". This table contains the glyph descriptions, that is, for each glyph, the x and y coordinates of the points of the contours of the glyph, together with instructions (hints) that help rendering at low resolutions. If we subset the "glyf" table, the size of the TrueType fonr file will be reduced a lot.

We must be careful when we subset the table "glyf". There are two types of glyphs, "simple" glyphs and "composite" glyphs. Simple glyphs have no dependencies. Composite glyphs depend on other glyphs. For example, in the German alphabet the characters "A", "O", "U", "a", "o" and "u" can accept the "umlaut diacritic". It is a common case for the font designer to design a glyph for the "umlaut diacritic", and create "a with umlaut" as a composite glyph which consists of two components, namely component "a" and component "umlaut diacritic" with the second component placed on top of the first one. Same for "o with umlaut" and "u with umlaut". In this case, the font designer designs only one glyph (the "umlaut diacritic") and creates three new glyphs "by composition". If we subset the "glyf" table and we decide to include "a with umlaut", we must locate the glyph description, find out it is a composite glyph and include in the subset the two components (component "a" and component "umlaut diacritic"). Another example is the "capital Greek Alpha" letter, which looks exactly the same as the "capital Latin A" letter. In this case, the font designer will create "capital Greek Alpha" as a composite glyph which consists of one component, namely component "capital Latin A". If we decide to create a subset which includes "capital Greek Alpha", the "capital Latin A" will eventually has to be included as well.

Another reduction in size can be achieved by including only the tables that a "conforming PDF reader" (such as Adobe Acrobat Reader) uses. The PDF specification states that the following tables are used:

  1. "glyf" (glyph data). Contains the glyph descriptions as discussed above. Its size depends on the number of glyphs (drops when the number of glyphs is reduced).
  2. "head" (font header). A 54-bytes long table that contains global information about the font. We will keep the table as is, we will only update the field "indexToLocFormat" (see below).
  3. "hhea" (horizontal header). A 36-bytes long table that contains information for horizontal layout. We will keep the table as is, we will only update the field "numberOfHMetrics" (see below).
  4. "hmtx" (horizontal metrics). A table that gives the advance width and the left side bearing for each glyph. Its size depends on the number of glyphs (drops when the number of glyphs is reduced).
  5. "loca" (glyph index to location). A (mapping) table that for each glyph gives the offset (from the beginning of the "glyf" table) the glyph description starts at. Its size depends on the number of the glyphs (drops when the number of glyphs is reduced).
  6. "maxp" (maximum profile). A 32-bytes long table that establishes the maximum requirements for the font. We will keep the table as is, we will only update the field "numGlyphs" (see below).
  7. "cvt " (control value table). Contains a list of values that can be referenced by instructions. We will keep the table as is.
  8. "fpgm" (font program). Contains a list of instructions that are executed only once, when the font is initialized. We will keep the table as is.
  9. "prep" (control value program). Contains a list of instructions that are executed when font size or transformation matrix change. We will keep the table as is.
  10. "cmap" (character to glyph index). A (mapping) table that for each character code gives the glyph index this character code maps to (according to different encodings). This table is used when the TrueType font file is embedded as simple font.

As an explanation of the above:

From the above it is clear that in order to subset the TrueType font file, we have to process ten tables:
  1. Three tables ("cvt ", "fpgm" and "prep") will be kept "as is".
  2. Three tables ("head", "hhea" and "maxp") will be kept "as is" with minor updates.
  3. Four tables ("hmtx", "cmap", "loca" and "glyf") will be completely restructured.

Before we start reconstructing the tables, we must decide which glyphs to include and assign them a new index. If we subset the font in order to just display a text, we only need the glyphs for the chracters in the text. For example, if we want to display the text "Hello World!", we only need nine glyphs, for the characters " ", "!", "H", "W", "d", E", "l", "o" and "r". If on the other hand we subset the font in order to use the font in a fillable form, we will need the glyphs for all the characters that the user is expected to type in the fields of the form. If the user is expected to type only latin characters, we will need approximately 100 glyphs. If the user is expected to type latin and greek characters, we will need approximately 170 glyphs.

For each character, the "original glyph index" and the "new glyph index" will be kept. In the original TrueType font file, the character was mapped to a glyph index, this is the "original glyph index". In the new (subsetted) TrueType font file, the character will be mapped to another glyph index, this is the "new glyph index". By convention, glyph index 0 in all TrueType font files is reserved for the "missing character", that is, if a character cannot be mapped in the font, it will be mapped to glyph index 0. (The glyph description for this character is typically a space or a square). When we assign "new glyph indices" for the characters we will include in the subset, we can assign numbers sequentially starting from 1. We will anyhow include glyph index 0, and the glyph description for this will glyph be kept the same as in the original TrueType font file.

Reconstructing the table "hmtx"

This is easy. We have already collected the characters we want to include, and for each character, the original glyph index and the new glyph index. We traverse the original "hmtx" table, and for each "original glyph index" we collect the corresponding advance width and left side bearing. We push these values in a PHP array.

If the font is not monospaced, we can dump this array as "hmtx" table and declare "numberOfHMetrics" equal to "numGlyphs". That is, define advance width and left side bearing for all glyphs. If the font is monospaced or if we want to take advantage of the two-parts nature of the "hmtx" table, we can traverse the array entries from last to first, and collect the entries that have the same advance width. In this way we will split the array in two parts. We will declare "numberOfHMetrics" equal to the number of entries in the first part, we will dump the advance width and the left side bearing of these entries, and we will dump the left side bearing (only) of the entries in the second part.

Reconstructing the table "cmap"

This is more difficult. The "cmap" table typically contains a few encoding tables. These tables map character codes to glyph indices based on some encoding. There are 4 different formats for these encoding tables, namely formats "0", "2", "4" and "6". When we reconstruct the "cmap" table, we must reconstruct all contained encoding tables, each in its own format.

Format "0" is easy. It is just a table of 256 entries, each entry being a glyph index. To reconstruct: We traverse the table. If the entry has the value "0" (missing character), we keep the table entry value as is. Otherwise, if the entry has a value found in the "original glyph indices", we substitute the table entry value with the corresponding "new glyph index". Otherwise we set the table entry value "0" (missing character).

Format "2" - This needs update - I have never bumped into a format "2" table, which is used mainly in Japanese, Chinese and Korean characters, if you ever bump into a format "2" encoding table, just send me the TrueType font file.

Format "4" is more difficult. It is used when the character codes fall into several contiguous ranges, possibly with holes in some of the ranges. That is, consecutive character codes that map into consecutive glyph indices are stored efficiently. On the other hand, random character codes that map into random glyph indices are also stored efficiently. In order to reconstruct, we traverse the array with character codes, original glyph indices and new glyph indices. We locate subsequences of consecutive character codes and new glyph indices pairs. We push them into the new format "4" encoding table. An optimization is possible here when we locate a subsequence of random character codes and new new glyph indices pairs.

Format "6" is similar to format "0". However the table entries are not limited to start at 0 and to be 256. They start at "firstCode" and be "entryCount" (these are two fields in the encoding table header). The processing is similar to format "0" encoding tables. An optimization is possible here when the first and/or last entries are set to value "0". In this case we can increase "firstCode" and decrease "entryCount", this will result to a shorter table.

Reconstructing the table "glyf"

This is easy. From the original "glyf" table, we collect the glyph descriptions we need. We concatenate them and dump as the new "glyf" table. We keep the offset each glyph description starts at (this will be used in the reconstruction of the table "loca", see below). Each glyph description must start at 4-byte boundary, so we may have to "right pad with zeroes".

Reconstructing the table "loca"

If the total size of the table "glyf" is less than or equal to 128 Kbytes the offsets will be 2-bytes long, otherwise the offsets will be 4-bytes long. In the first case we divide each offset by 2 and dump, in the second case we dump. An offset immediately after the end of the last glyph description must be dumped. That is, this table will contain "numGlyphs+1" entries.

After the reconstruction of the above 4 tables, 3 more tables will be slighly updated and 3 more will be dumped "as is".

The 10 tables will be passed as arguments to the "TTF::marshalAll" method, which will produce a new (subsetted) TrueType font file. Each table in the TrueType font file has a checksum (stored in the table directory) and the table "head" has a field "checkSumAdjustment". "TTF::marshalAll" is responsible for calculating these checksums.

The new (subsetted) TrueType font file can be embedded in the PDF document "as is". Typically is it embedded as a PDF stream object, "FlateDecode" compressed.

You can find the complete PHP class here: TTFsubset.php

You can find the complete PHP class for TTF reading/writing here: TTF.php

The TTF subset produced as above contains only the tables needed by a PDF viewer. For a "general purpose" TTF subset, the "post", "name" and "OS/2" tables are also output (the first subsetted as it contains the glyph names, the second and third "as is"). You can find the classes here: TTFv2.zip

For comments, inquiries, etc, contact us at: info at 4real dot gr