PHP Macromedia Flash (SWF) file parser

SWF files are organized as a collection of tags.

There exists a header at the beginning of the file which contains the signature (3 bytes), the version (1 byte), the file length (4 bytes), then the frame size (variable), the frame rate (2 bytes) and the frame count (2 bytes). To parse, we read the first 8 bytes. If the signature is 'CWS', the SWF file is ZLIB compressed, if the signature is 'FWS', the SWF file is not compressed. In the first case, we uncompress the rest of the file (from byte 9 to the end).

Then, the tags follow. At the beginning of the each tag, the tag type and the tag length exist. The tag length allows us to skip tags which we cannot parse.

The code of the parser is in the 'SWF.php' file, which contains 5 classes:

The 'SWF' class offers a constructor, a 'parseTag()' class method, and the 'header' and 'tags' class properties (both associative arrays). The constructor parses the header and performs a quick parsing of the tags. For each tag, the type and the length are kept, together with the offset from the beginning of the '.swf' file this tag starts at. If the tag is a definition tag, the (defined) id is also kept.

Typical usage is as follows:

        $s = new SWF(file_get_contents('myfile.swf'));
        var_dump($s->header);
        var_dump($s->tags);
        foreach ($s->tags as $tag) {
            $ret = $s->parseTag($tag);
            var_dump($ret);
        }

'SWFextractText.php', 'SWFextractImages.php' and 'SWFextractShapes.php' are examples of using the parser.

SWFextractText.php

This is a utility to extract the text from a '.swf' file. We are interested in the following tags:

'DefineText' and 'DefineText2' tags define sets of text records. Each set of text records contains an optional fontID and an array of glyphIDs. If the fontID is present, the glyphs are displayed using the font having this fontID. If the fontID is absent, the fontID of the previous (or previous of previous etc.) set of text records is used.

Knowing the fontID and the glyphIDs is half way through. We must map the glyphIDs to Unicode characters. We first locate the font having this fontID. If the font is defined with a 'DefineFont' tag, we locate the 'DefineFontInfo' or 'DefineFontInfo2' tag (defining info for this fontID) and use the tag's 'codeTable'. If the font is defined with a 'DefineFont2' or 'DefineFont3' tag, we use the tag's 'codeTable'. If the font is defined with a 'DefineFont4' tag, we use the TTF parser to collect the 'cmap' table and build the glyphID to Unicode character mapping table.

All character codes defined in 'codeTables' are 'UCS-2'. The last step is to translate the character codes to 'UTF-8', using the 'mb_convert_encoding' function.

SWFextractImages.php

This is a utility to extract the images from a '.swf' file. We are interested in the following tags:

Tags 'DefineBits', 'DefineBitsJPEG2', 'DefineBitsJPEG3' and 'DefineBitsJPEG4' can contain JPG, PNG and GIF images (although the tag name suggests JPG only).

The 'JPEGTables' tag defines the JPEG encoding table for all JPEG images defined using the 'DefineBits' tag. While processing a 'DefineBits' tag, we prepend the 'JPEGTables' data.

The 'DefineBitsJPEG2' and 'DefineBitsJPEG3' define JPG, PNG or GIF images.

The 'DefineBitsLossless' and 'DefineBitsLossless2' tags define RGB and RGBA bitmap data. The bitmap data are converted and saved as PNG (using the GD library). Note that alpha values are defined differently in SWF and GD. A conversion is needed here: In SWF, alpha values range from 0 to 255 with 0 meaning transparent, in GD, alpha values range from 0 to 127 with 0 meaning opaque.

SWFextractShapes.php

This is a utility to extract the shapes from a '.swf' file. We are interested in the following tags:

The first pass parses and collects all shapes into a associative array. The second pass constructs a PDF file containing one shape per page. To keep it simple, shapes are only stroked (not filled). Each shape is first scaled and translated to correctly fit inside its page. We process the following shape records:

At the end, a 'close and stroke path' (s) PDF command is output. The state is restored (this cancels the translation and scaling) and a footer containing the shapeID is displayed.

Here are the sources:

Macromedia Flash (SWF) file parser: SWF.php

Utility to extract the text from a SWF file: SWFextractText.php

Utility to extract the images from a SWF file: SWFextractImages.php

Utility to extract the shapes from a SWF file: SWFextractShapes.php

True type font file parser: TTF.php

For comments, inquiries, etc, contact us at: info at 4real dot gr